Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: SIGNAL-7090 UPDATE async worker id setting per partition #97

Conversation

seungjinstord
Copy link
Contributor

@seungjinstord seungjinstord commented Sep 22, 2024

Related Ticket(s)

SIGNAL-7090

Checklist

Problem

While trying to use the BroadwayAdapter, I found that something in the AsyncAdapter was pushing the messages to just one partition.

Details

I was able to trace it back to the get_or_create_worker function. I found out that unless the :id is customized, and just AsyncWorker is used, then it doesn't matter whichever distinct name is used for Registry - the same worker pid is going be to tried to be retrieved. This means even though a different partition is found from AsyncAdapter, when it comes to get the associated worker pid - the first one created will be returned for any subsequent worker retrieval attempt.

Downstream effect is the consuming phase gets slowed down - only one partition will have all of the messages for the consumer group. Meaning, no matter how many consumers you have in the consumer group - only one consumer will be handling all of the messages pushed into the topic, across all of the partitions.

The fix is using a custom id that appends the partition number to the atom AsyncWorker. I string-concatenated it and didn't explicitly change it to atom, but I suspect it would be done under the hood.

As to why this was not detected in production, is probably because of the robustness of BEAM - a self-healing of crashed supervisor tree would end up picking random workers to be created, resulting in distributing messages across partitions.

But because for firehose we're passing a lot more messages, this became more visible.

TL;DR - I think this would increase overall performance of production consumption of Kafka messages for anywhere using AsyncAdapter. Meaning, this would increase consumption performance of services that consume from topic wms-service, which are GAS and WMS Bridge.

@seungjinstord seungjinstord self-assigned this Sep 22, 2024
@seungjinstord seungjinstord requested a review from a team September 23, 2024 18:59
@seungjinstord seungjinstord marked this pull request as ready for review September 23, 2024 19:00
@seungjinstord seungjinstord requested a review from a team as a code owner September 23, 2024 19:00
Copy link
Contributor

@btkostner btkostner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice find!

lib/kafee/producer/async_adapter.ex Outdated Show resolved Hide resolved
test/kafee/producer/async_adapter_test.exs Outdated Show resolved Hide resolved
test/kafee/producer/async_adapter_test.exs Outdated Show resolved Hide resolved
@seungjinstord seungjinstord merged commit 060dbd7 into main Sep 23, 2024
12 checks passed
@seungjinstord seungjinstord deleted the SIGNAL-7090-async-adapter-to-start-partition-based-async-worker branch September 23, 2024 22:26
seungjinstord pushed a commit that referenced this pull request Sep 23, 2024
An automated release has been created for you.
---


## [3.1.1](v3.1.0...v3.1.1)
(2024-09-23)


### Bug Fixes

* SIGNAL-7090 UPDATE async worker id setting per partition
([#97](#97))
([060dbd7](060dbd7))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants