Advanced super-stream usage #292

jamiechapmanbrn · 2023-02-23T19:37:06Z

jamiechapmanbrn
Feb 23, 2023

I'm working on creating a superstream consumer for apache druid, in order to avoid the challenges of maintaining a kafka cluster and a rabbitmq cluster. Theoretically, there isn't any gaps I'm aware of in the functionality of superstreams themselves in order to plumb this together properly. Practically, there's a few challenges I've run into and observations from the existing druid indexing code for both Amazon Kinesis and Apache Kafka that I think are missing from this client (as well as the c# client as far as I can tell).

In other streaming clients, it is possible to quite easily find the partitions or shards of any given stream. This is useful for a worker system that needs information in order to batch the jobs to do some kind of work. For druid, these indexing tasks are both CPU and memory intensive as it crunches the data down into an index format. It decides based on the partitions and some configuration information how it should batch the work and allocates some workers as appropriate. In order to split the work correctly, it uses the list of partitions in order to create workers with specific tasks. The equivalent for rabbitmq would be to find the sub-streams of a super-stream. Right now the workaround would be to list the streams and filter them based on string matches.

The client also doesn't really have the ability to decide in any real way what partitions are being read. This means that if I want to dispatch something to read from some set of the sub-streams, I would need to use the streams client, then piece them together and hope I haven't made some ordering mistake.

The last feature that would be helpful here would be a non-subthread model for reading from streams. In the Kafka client, the Druid indexer reads from the stream for a period of time. In the Kinesis client, it reads a set number of messages. These features are useful because Druid works in 'slices'. It reads for a certain amount of time, then breaks that into an indexible chunk it can put into a backing store. Using the existing rabbit streaming consumer, it would involve starting and stopping the client arbitrarily with some kind of timer or hold condition. It's certainly possible, but not ideal. Where that gets more painful, is combined with the fact that a druid indexing task will read from several streams at once, meaning that each stream would start and stop reading at different times.

This is just my perspective using the super streams feature, hopefully this is helpful for guiding development on this tool.

acogoluegnes · 2023-02-24T08:18:56Z

acogoluegnes
Feb 24, 2023
Maintainer

You can use the "low-level" Client API to look up the partitions of a super stream. This API is a Java-to-stream-protocol mapping, it's technically public but not part of the public API, so it can change, but it's acceptable to use in your use case I'd say. At least it can be good enough to explore the feasibility of a use case.

I'm not sure I understand the use case of the second request, that is just reading a part of the partitions. How would you decide which partitions should be read? With some sort of filter on the partition name like a wildcard? If this is what you expect, it should be easy enough to implement, but it may change the single active consumer semantics quite significantly, we would need to take this into account.

For the last request - consuming messages and stopping automatically IIUC - it could mean having some "condition" API that gets evaluated on the message or chunk arrival and that would trigger the closing of the subscription. What could be the condition? Number of messages, time?

The super stream API for this client library has been designed to be almost invisible to applications, and I think you're hitting the other side of this design because you need insight into the super stream topology.

These are reasonable requests but I don't know if they could have some value for other users. Nevertheless, we can keep refining the semantics and see how we can implement them.

Last thing:

I'm working on creating a superstream consumer for apache druid, in order to avoid the challenges of maintaining a kafka cluster and a rabbitmq cluster.

I think you meant "Amazon Kinesis" or "Zookeeper" cluster, not "RabbitMQ" cluster, because you'll need a RabbitMQ cluster anyway.

9 replies

acogoluegnes Feb 27, 2023
Maintainer

OK, thanks for the code excerpts, I'll have a look.

jamiechapmanbrn Feb 27, 2023
Author

I've noticed that the it would benefit from the ability to get the latest and earliest sequence number, it has an option in the config to start from now rather than the beginning of the stream. I see the queryStreamStats function returns the first available offset, but not the latest. I can get the latest by listening to the stream at latest and grabbing the first message and passing back the id, but it's not quite as nice. Poll would improve that considerably though.

acogoluegnes Feb 28, 2023
Maintainer

I had the look at the record supplier Druid API and the Kafka implementation. They are quite close to each other, I assume the abstraction is based on the implementation.

The API expects a seekable data stream with explicit references to partitions and RabbitMQ super streams are none of this TBH, so expect the implementation to be more involved than the Kafka one because of these discrepancies. RabbitMQ streams use a push-based approach (as opposed to pull-based for Kafka). Super streams are a partitioning solution to help scale out and the stream Java client strives to make it a transparent implementation detail. This is different in the Druid API and in Kafka in general where partitions are fully part of the client API.

I had a quick look at the Kinesis implementation and it seems to be closer to what an implementation for RabbitMQ streams could be.

Here are actionable takeaways:

you can still use super streams for the topology but you may have to use individual consumers because Druid assigns partitions to the record supplier. Or you may use the composite consumer implementation for super streams, but you may have to do some filtering as the Kinesis implementation does.
the implementation will have to maintain more state than the Kafka one.
queryStreamStats seem to return the information you need, the first offset and the committed chunk ID, which is a good approximation of the last offset in the stream. See the Javadoc.

If you have some code to share I can have a look at it.

jamiechapmanbrn Apr 21, 2023
Author

Sorry for the delay, I got sidetracked by a few other tasks that have come up.
I have a bare bones implementation available here:
apache/druid#14137

I could use suggestions on a more permanent algorithm to prevent opening/closing consumers as often.

Thanks so much for your help!

acogoluegnes May 4, 2023
Maintainer

Thanks for the follow-up. I had a look at the code, I saw you chose to maintain an individual consumer for each super stream partition, which I guess is fine for this use case.

Note streams and super streams do not support "exactly-once delivery" as you state in the PR description. They support deduplication on the publishing side (no duplicated messages in the stream if used properly) but a consumer can see the same message under some circumstances (e.g. when restarting after a failure). Exactly-once delivery between a resource (e.g. a stream) and another (e.g. Druid index) is not possible unless the 2 resources synchronize themselves with a transaction mechanism like XA.

About opening and closing consumers, it's hard to say without more information on the usage: frequency of poll, request offset on a given partition (are they contiguous between poll calls). The best way is to experiment with real-world scenarios.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Advanced super-stream usage #292

{{title}}

Replies: 1 comment 9 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Advanced super-stream usage #292

jamiechapmanbrn Feb 23, 2023

Replies: 1 comment · 9 replies

acogoluegnes Feb 24, 2023 Maintainer

acogoluegnes Feb 27, 2023 Maintainer

jamiechapmanbrn Feb 27, 2023 Author

acogoluegnes Feb 28, 2023 Maintainer

jamiechapmanbrn Apr 21, 2023 Author

acogoluegnes May 4, 2023 Maintainer

jamiechapmanbrn
Feb 23, 2023

Replies: 1 comment 9 replies

acogoluegnes
Feb 24, 2023
Maintainer

acogoluegnes Feb 27, 2023
Maintainer

jamiechapmanbrn Feb 27, 2023
Author

acogoluegnes Feb 28, 2023
Maintainer

jamiechapmanbrn Apr 21, 2023
Author

acogoluegnes May 4, 2023
Maintainer