Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove nested retriers from producer #962

Merged
merged 9 commits into from
Nov 18, 2020

Conversation

Nevon
Copy link
Collaborator

@Nevon Nevon commented Nov 16, 2020

There used to be a retrier in the messageProducer that would handle certain errors from sendMessage by doing cluster operations like refreshing metadata or reconnecting the cluster. sendMessages also needed to have a retrier because it's doing potentially many requests to different brokers and needs to retry only the failed requests (i.e. it needs to own the retry semantics). This was solved by having a hardcoded retrier inside sendMessages.

The problem with this is that if there's an error that leads to retries inside sendMessages, it may also get retried on the messageProducer level. So if you have 5 retries on the messageProducer it could lead to up to 25 retries in total.

Closes #958 as this handles the issue by having sendMessage refresh metadata itself, instead of bailing and letting the messageProducer handle it.

Fixes #950

Nevon and others added 2 commits November 13, 2020 16:03
In case we get a connection timeout, we currently don't refresh metadata in all cases. Fixes #950.
Co-authored-by: Sam <me@smartin.io>
@Nevon Nevon requested a review from tulios November 16, 2020 10:49
Nevon and others added 4 commits November 16, 2020 12:06
This really just unwraps the error, as it would otherwise be wrapped in a KafkaJSNumberOfRetriesExceeded error

Co-authored-by: Sam <me@smartin.io>
…ios/kafkajs into remove-nested-retriers-from-producer
Co-authored-by: Sam <me@smartin.io>
@@ -393,14 +394,14 @@ describe('Cluster > BrokerPool', () => {
expect(broker.isConnected()).toEqual(true)
})

it('recreates the connection on connection errors', async () => {
it('recreates the connection on ILLEGAL_SASL_STATE error', async () => {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not 100% sure about this part. The comment seems to indicate that we only recreate the connection because of ILLEGAL_SASL_STATE errors, but this test indicates that we want to do it on any connection error. Yet on ECONNREFUSED we don't actually recreate the connection, as we bail out before reaching that.

Yet recreating the connection on any connection error sounds like madness, as there's plenty of state there that needs to be managed. For example, the RequestQueue lives there, and any inflight requests would need to be rejected. So I don't think that this was actually correct, even if we had a test that was trying to test for that.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think the intention was to cover SASL errors

@Nevon
Copy link
Collaborator Author

Nevon commented Nov 18, 2020

This has been running in a fairly high throughput service by @smartinio for about 24h hours now without any issue, including dealing with a Kafka cluster redeploy (the client reconnected without a bunch of pointless retries towards the same host).

@Nevon Nevon requested a review from JaapRood November 18, 2020 15:41
@tulios tulios merged commit d13acd5 into master Nov 18, 2020
@tulios tulios deleted the remove-nested-retriers-from-producer branch November 18, 2020 15:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Metadata is not refreshed on connection errors
2 participants