Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metadata is not refreshed on connection errors #950

Closed
smartinio opened this issue Nov 11, 2020 · 1 comment · Fixed by #962
Closed

Metadata is not refreshed on connection errors #950

smartinio opened this issue Nov 11, 2020 · 1 comment · Fixed by #962

Comments

@smartinio
Copy link
Contributor

smartinio commented Nov 11, 2020

Describe the bug
⚠️ Disclaimer: This is primarily a hypothesis to explain an issue I have in production (weekly maintenance).

When a broker is killed remotely, despite the consequent connection errors, kafkajs doesn't realize that the node is currently inactive, and will keep trying to send messages until it inevitably fails with a KafkaJSNonRetriableError.

To Reproduce
I wrote a test and fix for this in a fork (commenting out the fix fails the test).

Expected behavior
I expect new metadata to be fetched if there's a connection error when attempting to send messages to the broker.

Observed behavior
Metadata does not seem to be refreshed, since kafkajs retries the same exact broker until it fails.

Logs from my service:

{"level":"ERROR","timestamp":"2020-11-10T11:09:29.843Z","logger":"kafkajs","message":"[Connection] Connection timeout","broker":"k-12.production.eu1.[masked]:9094","clientId":"[masked]"}

{"level":"ERROR","timestamp":"2020-11-10T11:09:29.844Z","logger":"kafkajs","message":"[BrokerPool] Failed to connect to broker, reconnecting","retryCount":0,"retryTime":301}

{"level":"ERROR","timestamp":"2020-11-10T11:09:35.145Z","logger":"kafkajs","message":"[Connection] Connection timeout","broker":"k-12.production.eu1.[masked]:9094","clientId":"[masked]"}

{"level":"ERROR","timestamp":"2020-11-10T11:09:35.146Z","logger":"kafkajs","message":"[BrokerPool] Failed to connect to broker, reconnecting","retryCount":1,"retryTime":660}

{"level":"ERROR","timestamp":"2020-11-10T11:09:40.806Z","logger":"kafkajs","message":"[Connection] Connection timeout","broker":"k-12.production.eu1.[masked]:9094","clientId":"[masked]"}

{"level":"ERROR","timestamp":"2020-11-10T11:09:40.807Z","logger":"kafkajs","message":"[BrokerPool] Failed to connect to broker, reconnecting","retryCount":2,"retryTime":1396}

{"level":"ERROR","timestamp":"2020-11-10T11:09:47.203Z","logger":"kafkajs","message":"[BrokerPool] Failed to connect to broker, reconnecting","retryCount":3,"retryTime":2594}

{"level":"ERROR","timestamp":"2020-11-10T11:09:54.797Z","logger":"kafkajs","message":"[Connection] Connection timeout","broker":"k-12.production.eu1.[masked]:9094","clientId":"[masked]"}

{"level":"ERROR","timestamp":"2020-11-10T11:09:54.798Z","logger":"kafkajs","message":"[BrokerPool] Failed to connect to broker, reconnecting","retryCount":4,"retryTime":4876}

{"level":"ERROR","timestamp":"2020-11-10T11:10:04.674Z","logger":"kafkajs","message":"[Connection] Connection timeout","broker":"k-12.production.eu1.[masked]:9094","clientId":"[masked]"}

{"level":"ERROR","timestamp":"2020-11-10T11:10:04.674Z","logger":"kafkajs","message":"[BrokerPool] Failed to connect to broker, reconnecting","retryCount":5,"retryTime":8040}

{"level":"ERROR","timestamp":"2020-11-10T11:10:04.675Z","logger":"kafkajs","message":"[Producer] Connection timeout","retryCount":0,"retryTime":284}

Environment

  • OS: Linux
  • KafkaJS version: 1.12.0
  • Kafka version: 2.4
  • NodeJS version: 12.18.4
@smartinio
Copy link
Contributor Author

smartinio commented Nov 13, 2020

So I've looked into it a bit more, checked out #919 which seems related-ish to this. I tried to simulate it locally using @Nevon's branch repro-919.

My slightly modified fork of repro-919 can be found here

  1. I rebased it onto latest master
  2. I started the local kafka cluster using ./scripts/dockerComposeUp.sh
  3. I ran node examples/producer.js
  4. The request hangs just like it does for @Nevon
  5. Followed the source code all the way down to scheduleRequestTimeoutCheck which is responsible for making sure the request doesn't hang forever
  6. Found that it's essentially a no-op because enforceRequestTimeout is false by default (document enforceRequestTimeout #955)
  7. Set enforceRequestTimeout: true and now it doesn't hang anymore, but eventually fails because the broker is blocked by toxiproxy
  8. I add one more seed broker to the setup, and it still fails without ever trying to use the second broker

Since I'm on 1.12.0, and this scheduleRequestTimeoutCheck came after that, that likely explains why I don't have the specific issue related to enforceRequestTimeout. But the issue of not switching brokers happens in both 1.12.0 and later versions. I've seen the same issue in services that run 1.14.0, which is partly why we haven't upgraded yet.

What do we know?

  1. We see Connection timeout which comes from here:

const onTimeout = async () => {
const error = new KafkaJSConnectionError('Connection timeout', {
broker: `${this.host}:${this.port}`,
})
this.logError(error.message)

  1. We see Failed to connect to broker which comes from here (line 328):

// Connection refused means this node is down, or the cluster is restarting,
// which requires metadata refresh to discover the new nodes
if (e.code === 'ECONNREFUSED') {
return bail(e)
}
// Rebuild the connection since it can't recover from illegal SASL state
broker.connection = await this.connectionBuilder.build({
host: broker.connection.host,
port: broker.connection.port,
rack: broker.connection.rack,
})
this.logger.error(`Failed to connect to broker, reconnecting`, { retryCount, retryTime })

We know we're not getting ECONNREFUSED because we are timing out, so this if-branch isn't executed, creating the loop we see in the logs. If we did get ECONNREFUSED here, we couldn't have logged Failed to connect to broker.

Hypothesis

From the comment above the check for ECONNREFUSED it says

// Connection refused means this node is down, or the cluster is restarting,
// which requires metadata refresh to discover the new nodes

This comment perfectly describes what is happening in our scenario, but it seems the if-condition is too specific: It should also take into account the timeout error. WDYT? @tulios @Nevon

I know you mentioned @tulios that we get a timeout because the broker isn't completely shut down yet, but we can't really wait for that to happen since it's evidently taking too long. Would there be any downside to bailing early on a timeout?

Nevon added a commit that referenced this issue Nov 13, 2020
In case we get a connection timeout, we currently don't refresh metadata in all cases. Fixes #950.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant