-
-
Notifications
You must be signed in to change notification settings - Fork 527
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consumer gets stuck when it receives a control batch #403
Comments
I'm facing the same issue and have some findings after some debugging. Here's what's happening in our case.
One nasty workaround to escape from the empty batch situation within kafkajs's runner would be to do something like this: if (batch.isEmpty()) {
this.consumerGroup.offsetManager.resolvedOffsets[batch.topic][batch.partition] = batch.lastOffset()
continue
} instead of (current code): if (batch.isEmpty()) {
continue
} That kind of worked for me but I think it's not the proper solution and it only served as a way to confirm the root cause of the issue. Ideally we should just commit the control message's offset when finishing processing the batch, just like it would happen when On the other hand, if using A naive solution within kafkajs itself could be to have the runner to do if (this.eachBatchAutoResolve ||
batch.lastOffset() !== batch.messages[batch.messages.length - 1].offset) {
this.consumerGroup.resolveOffset({ topic, partition, offset: batch.lastOffset() })
} instead of (current code) if (this.eachBatchAutoResolve) {
this.consumerGroup.resolveOffset({ topic, partition, offset: batch.lastOffset() })
} But this is also not the right solution in my opinion because some use-cases require that no offset is committed by the runner including control message offsets (stream aggregations for instance) Also, I do not know if an empty batch with control message can be generated by committing to a transaction without messages but I'd assume that is the case actually and if we do that we would be again seeing a loop fetching from that control message's offset over and over again. So, I believe if we can avoid creating empty transactional batches on our producers entirely and on consumer-side we do one of:
or
Then our consumers should be safe and free of this kind of loops. That said, if the runner did not just I hope this helps @kkurten and others having problems due to this issue. PS: This is the first time I interact here so I want to say thanks to @tulios and all contributors for sharing this great work with all of us. |
Hey @kkurten and @drojas, thanks for the excellent report. I will investigate what's happening. It's summer holidays in Sweden (where I live), so we are a bit slow to reply this time of the year. We are planning the 1.9 release; I will take a quick look and check if I can include a fix for this in the upcoming release. |
Just wanted to chime in and say that these are some top-notch issues! I wish all issues had this much research put into them and were this clearly explained. |
We had a quick go at consuming messages written transactionally before, and I ran into a similar situation, but was clouded by a whole bunch of other issues. Reading what's happening here, I'm pretty sure that was at least a factor in it. Great find! |
any solution or workaround? |
Not yet, this is almost at the top of my list. 😭 |
any progress on this issue? |
I just wanted to chime in and confirm a possible situation @drojas mentioned above. It is indeed possible to get a complete batch of control messages and causes the consumer to get stuck reprocessing the same control batch indefinitely. Using
in |
I started to look at this now, and I have a unit test that can reproduce the problem (a.k.a control record at the end of the batch). The Java client auto increments the fetch offset: // control records are not returned to the user
if (!currentBatch.isControlBatch()) {
return record;
} else {
// Increment the next fetch offset when we skip a control batch.
nextFetchOffset = record.offset() + 1;
} KafkaJS will have to do a similar thing, where you can disable auto-resolve but still be able to fetch the next batch. I will have a PR soon so you can take a look. |
Pre-release |
We have a simple data pipeline where events are consumed from topic A and written to topic B by transactional (Java) Streams application. A nodejs microservice then consumes events from topic B (using kafkajs) and writes events to one of our APIs.
We noticed that when kafkajs receives a control batch (https://kafka.apache.org/documentation/#controlbatch) that contains only a single control record the consumer gets permanently stuck. Since the batch contains only a control record, Kafkajs filters out that record from the batch and the batch becomes empty. Since we are not actually consuming any messages the offset is not advanced. The next fetch will receive the same control batch and the same thing happens again. At this point our consumer is basically in a busy loop, constantly fetching the same data and not advancing anywhere (cpu usage also spikes quite a lot). We have a lot of microservices that use kafkajs and this issue seems to only affect topics that contain control batches.
I have not experienced this issue with Java consumer but I'm not sure if this is actually a bug in kafkajs or "working as intended". We are using currently eachMessage handler in our kafkajs microservice so we don't have much control over how e.g. offsets are managed. Should we just use eachBatch in these kind of scenarios to manually advance the offset or is there perhaps something in the kafkajs implementation that could be improved to avoid this situation completely(?).
Kafka version
2.2
Kafkajs version
1.8
Kafka consumer group offsets
Kafkajs logs
Logs are full of those fetch requests, I copied only few here as an example.
Example of control batch in kafkajs
The text was updated successfully, but these errors were encountered: