-
-
Notifications
You must be signed in to change notification settings - Fork 510
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unexpected error "Offset out of range" causes the consumer group to drop messages #1119
Comments
It would be helpful for us to answer at least one of the following questions: What causes the "Out Of Range" error?
Is there any way we could manually handle recovering from the fetch error?Can this be related to corrupted metadata in zookeeper? |
Hi @solanoepalacio sorry if I'm not answering any of your questions, I'm totally unknowledge about Kafka but I did an investigation about the issue trying to help and I find a related issue that maybe could give you more background or clues about a possible root cause for your problem if you didn't see it yet #578 (comment) In addition, I was looking at the source code of this repo and I find inside of the fetch() method that when an "Offset out of range" is being thrown, an exception of type KafkaJSOffsetOutOfRange that extends of KafkaJSProtocolError is being propagated as part of the promise chain trying to recover (in this case reset to latest offset) the exception (ConsumerGroup.js line 410 => fetch() method, line 622 => propagation of the exception) maybe you can use them to get more control to implement a custom way to recover a reliable offset. |
I have a hypothesis, but without more information, it's very difficult to know if this is indeed the issue. At least it could be some place to start. Offset out of range means that your consumer tried to either fetch or commit an offset that is not within the range of offsets maintained by the broker. Meaning if the broker has messages for offsets between 100 - 500, and you try to commit 99 or 501, the committed offset is clearly invalid and cannot be used. There are a few ways this could happen, but one of them is due to retention. Specifically, I suspect If you don't know the configuration of the topic, you can get it via KafkaJS: const { ConfigResourceTypes } = require('kafkajs')
await admin.describeConfigs({
includeSynonyms: false,
resources: [
{
type: ConfigResourceTypes.TOPIC,
name: 'topic-name',
configNames: ['retention.bytes']
}
]
}) There's also As for the specific questions. Keep in mind that my answers are to the best of my current knowledge. It's of course possible that there's some unintended interaction that I'm not aware of.
No. There's not really a difference between consuming from several topics and consuming from one topic but several partitions. The client primarily operates on partitions, not topics, so if this has something to do with it you'd see the same issue even when just consuming from a single topic. We also have tons of users consuming from several topics with no issues.
Multiple consumer instances are completely isolated from one another and share nothing except client configuration, so this should be unrelated. The one caveat here is to make sure that they belong to different consumer groups if they are consuming different topics. All members of a consumer group should be identical (or eventually become identical in the case of a rolling deployment, for example). But whether you have several consumer instances in the same process or different processes doesn't matter, other than them completing over time in the event loop.
Not in an automated way, as far as I can think of. You could manually reset the consumer group offsets to a known good value - but the question is what value to resume from. The core issue is that the consumer is trying to commit an offset that doesn't exist, so how do you know what offset to reset to? That's why the client falls back to the only reasonable default, which is what the consumer is set to start from if there is no offset to resume from (
I can't say with any certainty, but it wouldn't be my first guess. The brokers need to know which log offsets are valid, so they must have this information themselves, rather than having to go to Zookeeper for it every time. |
I have the same issue. |
Excuse me, has the final problem been solved? |
For me the issue was using the same consumer group for 2 separate queues. I haven't researched deeper, just created an additional consumer group and called it a day. |
We experienced the same error message and had the case that one consumer group would start Could this be also related when a consumer group wants to read messages from a partition, but the partition moved to another broker and the client/consumer would need to switch now the new partition? This is only a guess from our side, there were "things" going on with our Kafka cluster as this was during a node pool upgrade on a Kubernetes cluster. The Kafka server version we're using is 3.2.0 currently setup with the Strimzi operator. |
We experienced this same issue while updating one of our busier MSK clusters to 3.3.1. Our CPU usage was quite high during the upgrade, which we think may have contributed to the issue. We had similar situation to others though. We had a handful of partitions get reset back to beginning (we have fromBeginning: true). We keep 7 day retention so this resulted in random partitions suddenly being behind by nearly 2M offsets which was very noticeable in our alerting. Looking back at logs after reseting offsets to catch us back up and what not we saw the same Would it be reasonable to change |
Any updates on this ? |
We were never able to find a solution to the issue. We had to change our client implementation. I'll close the issue since I couldn't add any more data on the topic, nor reproduce it again. |
Describe the bug
Unexpected error
[ConsumerGroup] Offset out of range, resetting to default offset
is causing the consumer group to jump from an offset with lag to the latest offset skipping a huge number of messages.To Reproduce
I haven't been able to reproduce this issue outside of the production environment of the services. Also, it happens only in some of the environments (the ones with higher traffic).
###The use case in which we are experimenting this issue is as follows:
We have a producer service that pulls CSV files from an external service and produces one Kafka message per each record on the CSV file.
Many instances of a consumer service are connected to the topic as a consumer group and consume messages to process the messages according to business logic in write-intensive processing (slower than the producer service)
This makes it so that when the producer gets some CSV files some lag is accumulated (a few million in a production environment).
It all works correctly for some time. But after a few hours we see this error:
This error is thrown during the consumer group
fetch
and therecoverFromFetch
is triggering the change in the consumer offset (to "latest")When this error is thrown we still have many hundred thousand messages in each topic partition to consume. All of such messages are skipped when the consumer "jumps" to the latest offset.
Expected behaviour
I would expect the consumer to keep fetching messages from the last committed offset until it reaches the latest offset after processing all messages in between.
Observed behaviour
When this error is thrown we still have many hundred thousand messages in each topic partition to consume. All of such messages are skipped when the consumer "jumps" to the latest offset.
Environment:
Additional context
eachbatch
to handle message consumptionresolveOffset
after processing a batch would help, but the outcome was the same.I doubt that fetched offset is erroneous. What else could be causing this? Could I work around this by setting the default offset config to 'none' and handling the error myself? (fetching the last committed offset of the topic and seeking to that one?)
The text was updated successfully, but these errors were encountered: