Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Specified group generation id is not valid" after broker maintenance, consumer stops receiving events #1466

Closed
hbazan-pp opened this issue Oct 17, 2022 · 11 comments · Fixed by #1474

Comments

@hbazan-pp
Copy link

Hi, we are having an issue similar to #1009 but it happens after a broker maintenance.
We have consumers running parallelly on different machines, with a heartbeat check triggered on eachBatch.
We consume multiple topics, with a specific instance of our service per topic.
All of this works fine but we had issues (twice already) when brokers go on maintenance.
Some of the instance (thus some of the topics) stop consuming events, but don't throw errors nor crash (if it crashed we would respawn and everything would be ok).
We do see the error message:
[Consumer] Crash: KafkaJSNonRetriableError: Specified group generation id is not valid
But it doesn't actually crash, and the instance is stale, it won't consume any new message or trigger the heartbeat. If we restart the instance it will consume all pending traffic (given the offset is still current).
Odd thing is some of the topics keep working fine after the maintenance, so the overall system seems to be "up" unless we check each specific topic.

@IvanRogovskiy
Copy link

IvanRogovskiy commented Oct 20, 2022

I have pretty the same thing. I have a connection to 11 topics and when I start receiving messages i see the logs below

{"level":"WARN","timestamp":"2022-10-05T08:27:56.258Z","logger":"kafkajs","message":"[ConsumerGroup] Topic has been updated, resync group"


{"level":"ERROR","timestamp":"2022-10-05T08:27:58.856Z","logger":"kafkajs","message":"[Connection] Response SyncGroup(key: 14, version: 3)", error":"Specified group generation id is not valid","correlationId":87,"size":14}

and after it the message that the consumer has been stopped. Increasing of heartbeats interval and sessionTimeout didn't help

@alldayalone
Copy link

alldayalone commented Nov 2, 2022

Same thing for us

Nov 2, 2022 @ 09:31:39.581 [error]: [Consumer] Response Heartbeat(key: 12, version: 2) {"broker":"xxx","clientId":"xxx","error":"Specified group generation id is not valid","correlationId":14,"size":10,}
Nov 2, 2022 @ 09:31:44.532 [error]: [Consumer] Crash: KafkaJSNonRetriableError: Specified group generation id is not valid {"stack":"KafkaJSNonRetriableError: Specified group generation id is not valid\n    at ..."}
Nov 2, 2022 @ 09:31:44.538 [info]: [Consumer] Consumer has crashed {"type":"consumer.crash","payload":{"error":{"name":"KafkaJSNonRetriableError","retriable":false,"cause":{"name":"KafkaJSProtocolError","retriable":false,"type":"ILLEGAL_GENERATION","code":22}},"restart":false}}
Nov 2, 2022 @ 09:31:44.538 [info]: [Consumer] Consumer has disconnected {"type":"consumer.disconnect"}
Nov 2, 2022 @ 09:31:44.538 [info]: [Consumer] Consumer has stopped {"type":"consumer.stop"}

After that just hangs until manually restarted

Happened at the end of (or right after) AWS Kafka maintenance "Heal cluster"

@jakewins
Copy link
Contributor

jakewins commented Nov 2, 2022

Ran into this as well, proposed fix: #1474

@h0od
Copy link

h0od commented Nov 14, 2022

I've also encountered this. Rejoin should be correct in this case.

@ErlendFax
Copy link

We are seeing the same thing after a GKE update.

Does anyone know a workaround while we wait?

@rpastore-wolt
Copy link

We are seeing the same thing after a GKE update.

Does anyone know a workaround while we wait?

@ErlendFax have you found a workaround that is not restart manually the consumer ?

I've also encountered this. Rejoin should be correct in this case.

@h0od when you say rejoin, should the library handle it or should be done withing the consumer code ?

thanks 🙏

@ErlendFax
Copy link

We have not. Just hoping it won't fail again. I'm interested in a workaround/solution as well.

@h0od
Copy link

h0od commented Nov 28, 2022

@h0od when you say rejoin, should the library handle it or should be done withing the consumer code ?

The library should try to rejoin, exactly like it does when the group is rebalancing.

@vpriem
Copy link

vpriem commented Dec 13, 2022

Same here as well, node are being rotated and then consumer just stop consuming:

[Connection]: Response Fetch(key: 1, version: 11): This server is not the leader for that topic-partition
[Connection]: Response SyncGroup(key: 14, version: 3): This is not the correct coordinator for this group
[Connection]: Response JoinGroup(key: 11, version: 5): The coordinator is loading and hence can't process requests for this group
[Connection]: Response Heartbeat(key: 12, version: 3): Specified group generation id is not valid
... retries
[Consumer]: Crash: KafkaJSNonRetriableError: Specified group generation id is not valid
[Consumer]: Stopped

I think ILLEGAL_GENERATION should be considered as retriable in KafkaJS to restart consumer in restartOnFailure.

@guiestimoneon
Copy link

Hello guys

I am having this issue when I scale my application horizontally. The pod is processing normally and out of nowhere I get this error:

image

I suspect a rebalance has occurred and the pod still tries to commit a message. Im using .NET lib

@ErlendFax
Copy link

As a workaround, one could try something like this:

kafkaClient.consumer.on("consumer.crash", (event) => {
     if (event.payload.error.name === "KafkaJSNonRetriableError") {

         process.exit(1);  // will initiate a k8s restart

        // ... or do something else like reconnecting and starting `run` again ... 
    }
  });

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants