New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metadata is not refreshed on connection errors #950
Comments
So I've looked into it a bit more, checked out #919 which seems related-ish to this. I tried to simulate it locally using @Nevon's branch My slightly modified fork of
|
const onTimeout = async () => { | |
const error = new KafkaJSConnectionError('Connection timeout', { | |
broker: `${this.host}:${this.port}`, | |
}) | |
this.logError(error.message) |
- We see
Failed to connect to broker
which comes from here (line328
):
kafkajs/src/cluster/brokerPool.js
Lines 315 to 328 in daeec89
// Connection refused means this node is down, or the cluster is restarting, // which requires metadata refresh to discover the new nodes if (e.code === 'ECONNREFUSED') { return bail(e) } // Rebuild the connection since it can't recover from illegal SASL state broker.connection = await this.connectionBuilder.build({ host: broker.connection.host, port: broker.connection.port, rack: broker.connection.rack, }) this.logger.error(`Failed to connect to broker, reconnecting`, { retryCount, retryTime })
We know we're not getting ECONNREFUSED
because we are timing out, so this if-branch isn't executed, creating the loop we see in the logs. If we did get ECONNREFUSED
here, we couldn't have logged Failed to connect to broker
.
Hypothesis
From the comment above the check for ECONNREFUSED
it says
// Connection refused means this node is down, or the cluster is restarting,
// which requires metadata refresh to discover the new nodes
This comment perfectly describes what is happening in our scenario, but it seems the if-condition is too specific: It should also take into account the timeout error. WDYT? @tulios @Nevon
I know you mentioned @tulios that we get a timeout because the broker isn't completely shut down yet, but we can't really wait for that to happen since it's evidently taking too long. Would there be any downside to bailing early on a timeout?
In case we get a connection timeout, we currently don't refresh metadata in all cases. Fixes #950.
Describe the bug
⚠️ Disclaimer: This is primarily a hypothesis to explain an issue I have in production (weekly maintenance).
When a broker is killed remotely, despite the consequent connection errors, kafkajs doesn't realize that the node is currently inactive, and will keep trying to send messages until it inevitably fails with a KafkaJSNonRetriableError.
To Reproduce
I wrote a test and fix for this in a fork (commenting out the fix fails the test).
Expected behavior
I expect new metadata to be fetched if there's a connection error when attempting to send messages to the broker.
Observed behavior
Metadata does not seem to be refreshed, since kafkajs retries the same exact broker until it fails.
Logs from my service:
Environment
The text was updated successfully, but these errors were encountered: