New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[IPv6 configuration] A node is stuck with "?U" status and Host ID is "null", unclear reason #16039
Comments
Another reproducer: By log host_id was set successully. Other nodes also see this host_id.
Its status is UP and UNKNOWN seen on
Setting host_id is seen later on the
And status in the cluster is
Issue description
Describe your issue in detail and steps it took to produce it. ImpactDescribe the impact this issue causes to the user. How frequently does it reproduce?Describe the frequency with how this issue can be reproduced. Installation detailsCluster size: 6 nodes (i4i.2xlarge) Scylla Nodes used in this run:
OS / Image: Test: Logs and commands
Logs:
|
@elcallio - please have a look. |
So the problem is that parts of the JMX interface nodetool uses relies on endpoints as strings, i.e. textual IP:s. Scylla/seastar formats ipv6 addresses slightly different; java removes leading zeros in an address, i.e. Relatively easy stop-gap solution is to simply transform strings in scylla-jmx, normalizing to java-style text. I am reluctant to modify the formatting in scylla. |
The problem is of course making sure all relevant code paths are handled... |
Fixes scylladb#228 Endpoint strings in StorageService does not follow Java formatting for IPv6 nodes. This causes nodetool to break when mixing API:s returning actual IP addresses with info mapped by string. Example is nodetool status. Refs scylladb/scylladb#16039 While this is really broken nodetool code, the problem is probably easier to fix here, in Scylla-JMX, as a stopgap until we can fully replace nodetool.
@elcallio - any idea how did it work in the past? Doesn't sound like it ever did... ? |
Probably not. I am however lying a bit as well: The current formatting was applied by 4ea6e06, which changed formatting to use a special, manual I.e. we have two different formatting paths here. Not sure if there would be any bad consequences of changing the But mainly I of course dislike the two paths... |
Does it impact 5.4? |
Yes. In fact, given that java/inet_ntop formats differently and the way the API:s work, I don't think these nodetool ops ever worked properly for IPv6 hosts. Luckily, the java-only (JMX) fix should backport easily across all releases one might wish to fix. |
scylladb/scylla-jmx#229 is the fix on JMX (then need backport to 5.4) |
Why did we discover it on 5.4 and not master? |
@avikivity - I would assume some sort of a 'race' on when we run this test on master (and it's obvious we don't run regularly IPv6 tests on master?) |
Fixes #228 Endpoint strings in StorageService does not follow Java formatting for IPv6 nodes. This causes nodetool to break when mixing API:s returning actual IP addresses with info mapped by string. Example is nodetool status. Refs scylladb/scylladb#16039 While this is really broken nodetool code, the problem is probably easier to fix here, in Scylla-JMX, as a stopgap until we can fully replace nodetool. Closes: #229
@denesb since you reviewed it, please complete the backport |
I thought they were all done? |
What's not clear to me is if it has scylladb/scylla-jmx@80ce599 in it as well. |
I think we should fix the way scylladb formats ipv6. |
The comment from elcallio/scylla-jmx@ccb1ccb says:
|
I can't evaluate it without understanding what's broken. |
See above - #16039 (comment) |
4ea6e06 should be fixed to restore the previous behavior. After that, we can change jmx/nodetool/whatever to be more permissive, and after that change the formatters to conform to canonical representation (without zeroes). |
@tchaikov - please see above. |
@avikivity - but I vote to get the java fix to 5.4, to unblock its release (if it's not in already - not sure) |
Don't know, @denesb should manage the backport. |
The problem is that it is java formatting that is the issue here. Java formatting != inet_ntop. Nodetool mixes data in inetaddress format with data in string format, addresses formatted in seastar/scylla. And I don't think we should change seastar to format ipv6 java-style. Which is why, until such time as we make a nodetool replacement, and can bypass the archaic JMX API:s it uses, it is easier to just try to ensure strings are java-formatted in the JMX stack. |
Fixes #228 Endpoint strings in StorageService does not follow Java formatting for IPv6 nodes. This causes nodetool to break when mixing API:s returning actual IP addresses with info mapped by string. Example is nodetool status. Refs scylladb/scylladb#16039 While this is really broken nodetool code, the problem is probably easier to fix here, in Scylla-JMX, as a stopgap until we can fully replace nodetool. Closes: #229 (cherry picked from commit 80ce599)
* ./tools/jmx 9a03d4fa...166599f0 (1): > StorageService: Normalize endpoint inetaddress strings to java form Fixes: #16039
Backport queued to 5.4 as 1a0424d |
in 4ea6e06, we specialized fmt::formatter<gms::inet_address> using the formatter of bytes if the underlying address is an IPv6 address. this breaks the tests with JMX which expected the shortened form of the text representation of the IPv6 address. in this change, instead of reinventing the wheel, let's reuse the existing formatter of net::inet_address, which is able to handle both IPv4 and IPv6 addresses, also it follows https://datatracker.ietf.org/doc/html/rfc5952 by compressing the consecutive zeros. since this new formatter is a thin wrapper of seastar::net::inet_addresss, the corresponding unit test will be added to Seastar. Refs scylladb#16039 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
should be fixed by #16267 |
to ensure that we adhere to the related RFC. Refs scylladb/scylladb#16039 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
to ensure that we adhere to the related RFC. because seastar::inet_address does not have its own test suite, let's colocate it with the tests for network_interfaces() at this moment. we can extract them out once there are more of them. Refs scylladb/scylladb#16039 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
to ensure that we adhere to the related RFC. because seastar::inet_address does not have its own test suite, let's colocate it with the tests for network_interfaces() at this moment. we can extract them out once there are more of them. Refs scylladb/scylladb#16039 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
to ensure that we adhere to the related RFC. because seastar::inet_address does not have its own test suite, let's colocate it with the tests for network_interfaces() at this moment. we can extract them out once there are more of them. Refs scylladb/scylladb#16039 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
in 4ea6e06, we specialized fmt::formatter<gms::inet_address> using the formatter of bytes if the underlying address is an IPv6 address. this breaks the tests with JMX which expected the shortened form of the text representation of the IPv6 address. in this change, instead of reinventing the wheel, let's reuse the existing formatter of net::inet_address, which is able to handle both IPv4 and IPv6 addresses, also it follows https://datatracker.ietf.org/doc/html/rfc5952 by compressing the consecutive zeros. since this new formatter is a thin wrapper of seastar::net::inet_addresss, the corresponding unit test will be added to Seastar. Refs scylladb#16039 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
in 4ea6e06, we specialized fmt::formatter<gms::inet_address> using the formatter of bytes if the underlying address is an IPv6 address. this breaks the tests with JMX which expected the shortened form of the text representation of the IPv6 address. in this change, instead of reinventing the wheel, let's reuse the existing formatter of net::inet_address, which is able to handle both IPv4 and IPv6 addresses, also it follows https://datatracker.ietf.org/doc/html/rfc5952 by compressing the consecutive zeros. since this new formatter is a thin wrapper of seastar::net::inet_addresss, the corresponding unit test will be added to Seastar. Refs #16039 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #16267
to ensure that we adhere to the related RFC. because seastar::inet_address does not have its own test suite, let's colocate it with the tests for network_interfaces() at this moment. we can extract them out once there are more of them. Refs scylladb/scylladb#16039 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Issue description
Scylla configuration: all addresses are IPv6
scylla.yaml
When cluster was created,
longevity-10gb-3h-5-4-db-node-ba01d950-1
node remained in status "?N":Host ID is
null
despite it was set:Setting host id for the node was seen by other nodes and status "NORMAL". For example:
But
nodetool status
shows "?N" status.As result the test failed with error
Not all nodes joined the cluster
I did not find any errors in the DB node and test logs. It is really not clear why the status is UNKNOWN for this node.
When I ran the reproducers I received case when 2 nodes were in UNKNOWN status.
This issue is observed with 5.3.0 version and 5.4.0
Impact
Cluster is not ready
How frequently does it reproduce?
I ran few reproducers with different versions
5.4.0~rc1 - the problem is reproduced, reproduced almost every run
5.3.0-rc0 - the problem is reproduced (got from first run). Official IPv6 test was not run with this version.
2023.1.2 - the problem is NOT reproduced (ran twice). Official IPv6 test was run and passed.
5.2.9 - the problem is NOT reproduced (ran once). Official IPv6 test was run and passed.
Installation details
Cluster size: 6 nodes (i4i.2xlarge)
Scylla Nodes used in this run:
OS / Image:
ami-01715aa610de633df
(aws: undefined_region)Test:
longevity-10gb-3h-ipv6-test
Test id:
ba01d950-928e-49f4-a81e-1bb54e52f9ad
Test name:
scylla-5.4/longevity/longevity-10gb-3h-ipv6-test
Test config file(s):
Logs and commands
$ hydra investigate show-monitor ba01d950-928e-49f4-a81e-1bb54e52f9ad
$ hydra investigate show-logs ba01d950-928e-49f4-a81e-1bb54e52f9ad
Logs:
Jenkins job URL
Argus
The text was updated successfully, but these errors were encountered: