-
-
Notifications
You must be signed in to change notification settings - Fork 522
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Protocol error when connecting consumer to AWS MSK Serverless #1449
Comments
Running the same code through the Java client with debug logging, it looks like it picks protocol version 7:
So it works with version 7 in the java client, but seemingly not with version 5 in kafkajs. So.. maybe MSK Serverless advertises that it supports protocol version 5, but actually it doesn't? That seems more likely than that kafkajs is incorrectly implementing version 5. |
Update: no, if I force the Java client to use version 5, MSK Serverless still accepts it. But, noting that the Java client fails the first message it sends, saying it needs a memberid, then succeeds when it re-sends a message with the memberid:
|
This is expected and is how the protocol is supposed to work. As for the protocol version, the way it should work is that the broker and client compare their list of supported versions and picks the highest one they both support. So if KafkaJS picked 5 that was the highest protocol version that both it and MSK supports for JoinGroup. You could try creating an MSK cluster and connect to that, since for a regular MSK cluster you should have access to broker logs. That said, not sure the logs will actually tell you anything useful. This is the problem with proprietary SaaS where you don't have access to logs or source and there's not even a way to test things without paying them for the privilege of supporting their software. |
I am an engineer with the AWS MSK Serverless team. First of all, thanks for putting together a repro library, it was quite helpful in debugging and reproducing the issue. We have done an initial investigation and it seems like the JoinGroupRequest from the KafkaJS library doesn't adhere to the Kafka protocol. Using the repro code shared here, we seem to be getting the following request from the client library.
The metadata field is of interest to us. That encodes the ConsumerProtocolSubscription field. The ConsumerProtocolSubscription structure contains the following fields
Taking a deeper look into the metadata
The first 2 bytes (short) are the version of the metadata structure- i.e. 0,1 => 1 Since the protocol version is 1, it is expected that the next 4 bytes of the byte array tell us the size of "ownedPartitions" collection for the topic. However, in these requests, we do not find any. On looking further, I found that version is always set to 1 for the Assigner supported (RoundRobin) by kafkajs
We can also see that at the time of encoding these bytes, the library only adds version, topics and userdata. (notice that no ownedTopics field is added). This conforms to version 0 of the protocol metadata schema, not version 1. kafkajs/src/consumer/assignerProtocol.js Line 47 in 4a195d7
I have tried out a simple fix in my local setup, where I change the version of roundRobinAssigner from 1 to 0. I have verified using the reproducer mentioned here https://github.com/jakewins/kafkajs-msk-protocolerr-repro/blob/main/index.ts that this works for both MSK Provisioned and MSK Serverless. Also note that we are aware this works with MSK Provisioned clusters and Vanilla Apache Kafka clusters, but MSK Serverless implementation requires the metadata field to adhere strictly to the protocol. Please let me know if you agree with the findings here or if we have missed something. I am happy to contribute the fix if needed. |
Thanks for your investigation @sankalpbhatia! I'm curious why MSK Serverless is decoding the embedded protocol metadata. That isn't part of the network protocol, as you can see from the protocol definition where it is defined as just
So I would say it looks like MSK is peeking into client internals from the server-side and assuming that the Java client implementation defines the protocol, which I don't think is quite right. That said, adding this field still makes sense to do, since it's a requirement for incremental cooperative rebalancing, so I don't particularly see any reason to deviate from the Java implementation and the recommendation from the documentation. However, we need to think carefully about what to do here.
I'm leaning towards the second option, but we'll need to see what happens when a consumer group has members with both the old and new version. |
Thanks @Nevon for the inputs and sorry for the delayed reply. Currently, decoding the protocol metadata in MSK Serverless is part of an internal implementation detail. I prefer option 2 as well. Looking at the code, I do not see the assigner using version to assign partitions. I will do a quick test about a consumer group having two different versioned consumers, and will report back. Will also try to create a pull request in the next couple of working days. |
+1 facing the same issue |
…res to the standard embedded schema tulios#1449
@Nevon I did a simple test where I ran 2 consumers. First consumer had the version of RoundRobinAssigner set to 0, while the second one had version 1. When running these two against a MSK Provisioned cluster, I observed that
This makes sense as currently, the consumerGroup coordinator in the library does not use 'version' to determine the assigner to be used: I have raised a pull request which changes the RoundRobin assigner version to 0, that seems to be passing the existing checks. |
Excellent! In a few minutes, the beta release channel will contain version 2.3.0-beta.1 with this fix. Could you please verify that connecting to MSK serverless works with this version? |
Thanks for merging the request. I have verified version 2.3.0-beta.1 works with MSK Serverless |
Looks to be working fine now, thanks for the fix! |
This fix will be out in v2.2.2. |
@Nevon why isn't this in kafkajs? or am i missing something? https://github.com/jmaver-plume/kafkajs-msk-iam-authentication-mechanism |
Describe the bug
When I connect a kafkajs consumer to AWS MSK Serverless, it prints out a protocol error in the log, and then the consumer stops.
Connecting and sending data works fine, as does using the admin client.
To Reproduce
Expected behavior
The consumer should start and listen for messages, and then disconnect. There should be no protocol errors.
Observed behavior
The consumer crashes with a protocol error:
As far as I can tell, broker logs are not available in MSK Serverless.
Environment:
The text was updated successfully, but these errors were encountered: