Advanced super-stream usage #292
Replies: 1 comment 9 replies
-
You can use the "low-level" Client API to look up the partitions of a super stream. This API is a Java-to-stream-protocol mapping, it's technically public but not part of the public API, so it can change, but it's acceptable to use in your use case I'd say. At least it can be good enough to explore the feasibility of a use case. I'm not sure I understand the use case of the second request, that is just reading a part of the partitions. How would you decide which partitions should be read? With some sort of filter on the partition name like a wildcard? If this is what you expect, it should be easy enough to implement, but it may change the single active consumer semantics quite significantly, we would need to take this into account. For the last request - consuming messages and stopping automatically IIUC - it could mean having some "condition" API that gets evaluated on the message or chunk arrival and that would trigger the closing of the subscription. What could be the condition? Number of messages, time? The super stream API for this client library has been designed to be almost invisible to applications, and I think you're hitting the other side of this design because you need insight into the super stream topology. These are reasonable requests but I don't know if they could have some value for other users. Nevertheless, we can keep refining the semantics and see how we can implement them. Last thing:
I think you meant "Amazon Kinesis" or "Zookeeper" cluster, not "RabbitMQ" cluster, because you'll need a RabbitMQ cluster anyway. |
Beta Was this translation helpful? Give feedback.
-
I'm working on creating a superstream consumer for apache druid, in order to avoid the challenges of maintaining a kafka cluster and a rabbitmq cluster. Theoretically, there isn't any gaps I'm aware of in the functionality of superstreams themselves in order to plumb this together properly. Practically, there's a few challenges I've run into and observations from the existing druid indexing code for both Amazon Kinesis and Apache Kafka that I think are missing from this client (as well as the c# client as far as I can tell).
In other streaming clients, it is possible to quite easily find the partitions or shards of any given stream. This is useful for a worker system that needs information in order to batch the jobs to do some kind of work. For druid, these indexing tasks are both CPU and memory intensive as it crunches the data down into an index format. It decides based on the partitions and some configuration information how it should batch the work and allocates some workers as appropriate. In order to split the work correctly, it uses the list of partitions in order to create workers with specific tasks. The equivalent for rabbitmq would be to find the sub-streams of a super-stream. Right now the workaround would be to list the streams and filter them based on string matches.
The client also doesn't really have the ability to decide in any real way what partitions are being read. This means that if I want to dispatch something to read from some set of the sub-streams, I would need to use the streams client, then piece them together and hope I haven't made some ordering mistake.
The last feature that would be helpful here would be a non-subthread model for reading from streams. In the Kafka client, the Druid indexer reads from the stream for a period of time. In the Kinesis client, it reads a set number of messages. These features are useful because Druid works in 'slices'. It reads for a certain amount of time, then breaks that into an indexible chunk it can put into a backing store. Using the existing rabbit streaming consumer, it would involve starting and stopping the client arbitrarily with some kind of timer or hold condition. It's certainly possible, but not ideal. Where that gets more painful, is combined with the fact that a druid indexing task will read from several streams at once, meaning that each stream would start and stop reading at different times.
This is just my perspective using the super streams feature, hopefully this is helpful for guiding development on this tool.
Beta Was this translation helpful? Give feedback.
All reactions