Metadata is not refreshed on connection errors #950

smartinio · 2020-11-11T07:41:43Z

Describe the bug
⚠️ Disclaimer: This is primarily a hypothesis to explain an issue I have in production (weekly maintenance).

When a broker is killed remotely, despite the consequent connection errors, kafkajs doesn't realize that the node is currently inactive, and will keep trying to send messages until it inevitably fails with a KafkaJSNonRetriableError.

To Reproduce
I wrote a test and fix for this in a fork (commenting out the fix fails the test).

Expected behavior
I expect new metadata to be fetched if there's a connection error when attempting to send messages to the broker.

Observed behavior
Metadata does not seem to be refreshed, since kafkajs retries the same exact broker until it fails.

Logs from my service:

{"level":"ERROR","timestamp":"2020-11-10T11:09:29.843Z","logger":"kafkajs","message":"[Connection] Connection timeout","broker":"k-12.production.eu1.[masked]:9094","clientId":"[masked]"}

{"level":"ERROR","timestamp":"2020-11-10T11:09:29.844Z","logger":"kafkajs","message":"[BrokerPool] Failed to connect to broker, reconnecting","retryCount":0,"retryTime":301}

{"level":"ERROR","timestamp":"2020-11-10T11:09:35.145Z","logger":"kafkajs","message":"[Connection] Connection timeout","broker":"k-12.production.eu1.[masked]:9094","clientId":"[masked]"}

{"level":"ERROR","timestamp":"2020-11-10T11:09:35.146Z","logger":"kafkajs","message":"[BrokerPool] Failed to connect to broker, reconnecting","retryCount":1,"retryTime":660}

{"level":"ERROR","timestamp":"2020-11-10T11:09:40.806Z","logger":"kafkajs","message":"[Connection] Connection timeout","broker":"k-12.production.eu1.[masked]:9094","clientId":"[masked]"}

{"level":"ERROR","timestamp":"2020-11-10T11:09:40.807Z","logger":"kafkajs","message":"[BrokerPool] Failed to connect to broker, reconnecting","retryCount":2,"retryTime":1396}

{"level":"ERROR","timestamp":"2020-11-10T11:09:47.203Z","logger":"kafkajs","message":"[BrokerPool] Failed to connect to broker, reconnecting","retryCount":3,"retryTime":2594}

{"level":"ERROR","timestamp":"2020-11-10T11:09:54.797Z","logger":"kafkajs","message":"[Connection] Connection timeout","broker":"k-12.production.eu1.[masked]:9094","clientId":"[masked]"}

{"level":"ERROR","timestamp":"2020-11-10T11:09:54.798Z","logger":"kafkajs","message":"[BrokerPool] Failed to connect to broker, reconnecting","retryCount":4,"retryTime":4876}

{"level":"ERROR","timestamp":"2020-11-10T11:10:04.674Z","logger":"kafkajs","message":"[Connection] Connection timeout","broker":"k-12.production.eu1.[masked]:9094","clientId":"[masked]"}

{"level":"ERROR","timestamp":"2020-11-10T11:10:04.674Z","logger":"kafkajs","message":"[BrokerPool] Failed to connect to broker, reconnecting","retryCount":5,"retryTime":8040}

{"level":"ERROR","timestamp":"2020-11-10T11:10:04.675Z","logger":"kafkajs","message":"[Producer] Connection timeout","retryCount":0,"retryTime":284}

Environment

OS: Linux
KafkaJS version: 1.12.0
Kafka version: 2.4
NodeJS version: 12.18.4

The text was updated successfully, but these errors were encountered:

smartinio · 2020-11-13T06:54:08Z

So I've looked into it a bit more, checked out #919 which seems related-ish to this. I tried to simulate it locally using @Nevon's branch repro-919.

My slightly modified fork of `repro-919` can be found here

I rebased it onto latest master
I started the local kafka cluster using ./scripts/dockerComposeUp.sh
I ran node examples/producer.js
The request hangs just like it does for @Nevon
Followed the source code all the way down to scheduleRequestTimeoutCheck which is responsible for making sure the request doesn't hang forever
Found that it's essentially a no-op because enforceRequestTimeout is false by default (document enforceRequestTimeout #955)
Set enforceRequestTimeout: true and now it doesn't hang anymore, but eventually fails because the broker is blocked by toxiproxy
I add one more seed broker to the setup, and it still fails without ever trying to use the second broker

Since I'm on 1.12.0, and this scheduleRequestTimeoutCheck came after that, that likely explains why I don't have the specific issue related to enforceRequestTimeout. But the issue of not switching brokers happens in both 1.12.0 and later versions. I've seen the same issue in services that run 1.14.0, which is partly why we haven't upgraded yet.

What do we know?

We see Connection timeout which comes from here:

kafkajs/src/network/connection.js

Lines 164 to 169 in daeec89

const onTimeout = async () => {

const error = new KafkaJSConnectionError('Connection timeout', {

broker: `${this.host}:${this.port}`,

})

this.logError(error.message)

We see Failed to connect to broker which comes from here (line 328):

kafkajs/src/cluster/brokerPool.js

Lines 315 to 328 in daeec89

// Connection refused means this node is down, or the cluster is restarting,

// which requires metadata refresh to discover the new nodes

if (e.code === 'ECONNREFUSED') {

return bail(e)

}

// Rebuild the connection since it can't recover from illegal SASL state

broker.connection = await this.connectionBuilder.build({

host: broker.connection.host,

port: broker.connection.port,

rack: broker.connection.rack,

})

this.logger.error(`Failed to connect to broker, reconnecting`, { retryCount, retryTime })

We know we're not getting ECONNREFUSED because we are timing out, so this if-branch isn't executed, creating the loop we see in the logs. If we did get ECONNREFUSED here, we couldn't have logged Failed to connect to broker.

Hypothesis

From the comment above the check for ECONNREFUSED it says

// Connection refused means this node is down, or the cluster is restarting,
// which requires metadata refresh to discover the new nodes

This comment perfectly describes what is happening in our scenario, but it seems the if-condition is too specific: It should also take into account the timeout error. WDYT? @tulios @Nevon

I know you mentioned @tulios that we get a timeout because the broker isn't completely shut down yet, but we can't really wait for that to happen since it's evidently taking too long. Would there be any downside to bailing early on a timeout?

In case we get a connection timeout, we currently don't refresh metadata in all cases. Fixes #950.

This was referenced Nov 11, 2020

Refresh metadata on connection error [#950] #953

Closed

Fix deadlock in onTimeout as well #956

Merged

Nevon added a commit that referenced this issue Nov 13, 2020

Refresh metadata on any connection error

9924584

In case we get a connection timeout, we currently don't refresh metadata in all cases. Fixes #950.

This was referenced Nov 13, 2020

Refresh metadata on any connection error #958

Closed

Remove nested retriers from producer #962

Merged

tulios closed this as completed in #962 Nov 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metadata is not refreshed on connection errors #950

Metadata is not refreshed on connection errors #950

smartinio commented Nov 11, 2020 •

edited

smartinio commented Nov 13, 2020 •

edited

Metadata is not refreshed on connection errors #950

Metadata is not refreshed on connection errors #950

Comments

smartinio commented Nov 11, 2020 • edited

smartinio commented Nov 13, 2020 • edited

My slightly modified fork of repro-919 can be found here

What do we know?

Hypothesis

smartinio commented Nov 11, 2020 •

edited

smartinio commented Nov 13, 2020 •

edited

My slightly modified fork of `repro-919` can be found here