Skip to content

Is solid_queue designed for distributed systems like Kubernetes? #685

@candidosales

Description

@candidosales

Hi there! 👋

I've been exploring solid_queue as a potential solution for our project, and I wanted to share some observations and questions about its architecture, particularly in the context of distributed systems and Kubernetes environments.

Context

I'm currently evaluating solid_queue for a project running on Kubernetes. Before investing time in a POC, I'd like to understand whether my assumptions about its design goals are correct.

My understanding

Based on the principle from Designing Data-Intensive Applications:

"There is no single system that can satisfy all data storage, querying and processing needs. In practice, most nontrivial applications need to combine several different technologies to satisfy their requirements."

In Kubernetes environments, pods run applications with concurrent execution capabilities, allowing for multiple concurrent executions. This led me to wonder about a few architectural aspects of solid_queue.

Questions and observations

Job distribution in distributed environments

In solid_queue's architecture, there isn't a mechanism to determine which specific pod will process a given job. This differs from traditional message broker patterns where:

  • Producers and consumers are separate entities: In solid_queue, the consumer is also the producer
  • Centralized orchestration: Message brokers centralize data and can arbitrarily assign messages to consumers
  • Durability and reliability: By centralizing data in the broker, these systems can more easily tolerate clients that connect, disconnect, or crash

Potential challenges I'm considering

Backpressure handling: What happens if producers send messages faster than consumers can process them? Without a centralized server to orchestrate processing, how does solid_queue handle backpressure or buffer messages?

Fault tolerance: What happens if pods/nodes crash or temporarily go offline? Are any messages at risk of being lost?

Worker recovery: What happens if a worker is killed (e.g., OOMKill)? How does the system handle worker restart?

Related GitHub issues

I noticed several issues that seem related to Kubernetes deployments:

My current hypothesis

It seems that solid_queue might be optimized for environments like Basecamp's, where they're not using Kubernetes. According to their blog posts, they use Kamal for deployment on bare-metal/VMs:

"It's kinda wild to think that it's been less than three months since we decided to scrap Kubernetes and pursue a simpler solution for the cloud exit with Kamal. And that we've already moved half of the cloud applications that need to come home!"

[Reference]

Alternative approaches

For comparison, systems like Temporal provide centralized orchestration that addresses these distributed system principles:

My question

Is my assumption correct that solid_queue was primarily designed for non-distributed, single-server or small-cluster environments rather than distributed systems like Kubernetes?

If I'm mistaken, I'd be very interested to learn about:

  • Use cases where solid_queue has been successfully deployed in Kubernetes environments
  • Recommended patterns or configurations for distributed deployments
  • Any architectural features I might have missed that address these concerns

Thank you for your time and for creating this project! I really appreciate the work that's gone into it 👏🏼 , and I'm genuinely curious to understand its design philosophy better.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions