Remove nested retriers from producer #962

Nevon · 2020-11-16T10:49:06Z

There used to be a retrier in the messageProducer that would handle certain errors from sendMessage by doing cluster operations like refreshing metadata or reconnecting the cluster. sendMessages also needed to have a retrier because it's doing potentially many requests to different brokers and needs to retry only the failed requests (i.e. it needs to own the retry semantics). This was solved by having a hardcoded retrier inside sendMessages.

The problem with this is that if there's an error that leads to retries inside sendMessages, it may also get retried on the messageProducer level. So if you have 5 retries on the messageProducer it could lead to up to 25 retries in total.

Closes #958 as this handles the issue by having sendMessage refresh metadata itself, instead of bailing and letting the messageProducer handle it.

Fixes #950

In case we get a connection timeout, we currently don't refresh metadata in all cases. Fixes #950.

Co-authored-by: Sam <me@smartin.io>

This really just unwraps the error, as it would otherwise be wrapped in a KafkaJSNumberOfRetriesExceeded error Co-authored-by: Sam <me@smartin.io>

…ios/kafkajs into remove-nested-retriers-from-producer

Co-authored-by: Sam <me@smartin.io>

Nevon · 2020-11-16T13:43:07Z

src/cluster/__tests__/brokerPool.spec.js

@@ -393,14 +394,14 @@ describe('Cluster > BrokerPool', () => {
      expect(broker.isConnected()).toEqual(true)
    })

-    it('recreates the connection on connection errors', async () => {
+    it('recreates the connection on ILLEGAL_SASL_STATE error', async () => {


I'm not 100% sure about this part. The comment seems to indicate that we only recreate the connection because of ILLEGAL_SASL_STATE errors, but this test indicates that we want to do it on any connection error. Yet on ECONNREFUSED we don't actually recreate the connection, as we bail out before reaching that.

Yet recreating the connection on any connection error sounds like madness, as there's plenty of state there that needs to be managed. For example, the RequestQueue lives there, and any inflight requests would need to be rejected. So I don't think that this was actually correct, even if we had a test that was trying to test for that.

Yes, I think the intention was to cover SASL errors

Nevon · 2020-11-18T15:41:44Z

This has been running in a fairly high throughput service by @smartinio for about 24h hours now without any issue, including dealing with a Kafka cluster redeploy (the client reconnected without a bunch of pointless retries towards the same host).

Nevon and others added 2 commits November 13, 2020 16:03

Refresh metadata on any connection error

9924584

In case we get a connection timeout, we currently don't refresh metadata in all cases. Fixes #950.

Retry connections after ILLEGAL_SASL_STATE errors

e7bcb68

Co-authored-by: Sam <me@smartin.io>

Nevon requested a review from tulios November 16, 2020 10:49

Merge branch 'master' into remove-nested-retriers-from-producer

3774126

Nevon mentioned this pull request Nov 16, 2020

Refresh metadata on any connection error #958

Closed

Nevon and others added 4 commits November 16, 2020 12:06

Bail on not implemented codecs

e81e1fa

This really just unwraps the error, as it would otherwise be wrapped in a KafkaJSNumberOfRetriesExceeded error Co-authored-by: Sam <me@smartin.io>

Merge branch 'remove-nested-retriers-from-producer' of github.com:tul…

c1044b1

…ios/kafkajs into remove-nested-retriers-from-producer

Generically unwrap unretriable producer errors

950473e

Co-authored-by: Sam <me@smartin.io>

Fix producer tests to use real protocol errors

6e2fb93

Nevon commented Nov 16, 2020

View reviewed changes

Nevon added 2 commits November 17, 2020 15:47

Merge branch 'master' into remove-nested-retriers-from-producer

a5b4b34

Merge branch 'master' into remove-nested-retriers-from-producer

ee64017

Nevon requested a review from JaapRood November 18, 2020 15:41

tulios approved these changes Nov 18, 2020

View reviewed changes

tulios merged commit d13acd5 into master Nov 18, 2020

tulios deleted the remove-nested-retriers-from-producer branch November 18, 2020 15:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove nested retriers from producer #962

Remove nested retriers from producer #962

Nevon commented Nov 16, 2020 •

edited

Loading

Nevon Nov 16, 2020

tulios Nov 18, 2020

Nevon commented Nov 18, 2020

Remove nested retriers from producer #962

Remove nested retriers from producer #962

Conversation

Nevon commented Nov 16, 2020 • edited Loading

Nevon Nov 16, 2020

Choose a reason for hiding this comment

tulios Nov 18, 2020

Choose a reason for hiding this comment

Nevon commented Nov 18, 2020

Nevon commented Nov 16, 2020 •

edited

Loading