-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Internal Replication Factor settings are silently changed when onlining a replacement cluster node #15017
Comments
@ptrsmrn - please take a look - system_auth related or a general issue |
This happens on non system_auth key spaces as well. |
I should be able to perform almost any log collection/config changes as required to provide additional information, and I'm |
It's really odd. |
This isn't just an appearance/UI bug. I did a node replacement while writes were on-going, but didn't get a chance to do a rolling restart until Monday. I'm now running a repair and disk utilization is going up a not insignificant amount... meaning, the incorrect metadata is being used for write persistence, is probably used for repair, and won't be noticeable if you're doing most of your read/write ops with 1/2 consistency level. Because of the relatively silent nature of this issue, it has potentially catastrophic consequences for the data stored, as well as written to the cluster. Documentation on increasing the replication factor via ALTER KEYSPACE statements makes it seem like say, going from a replication factor of 5, to 2 (with an unknown strategy), then back to 5 (via a rolling restart, not via an ALTER KEYSPACE command), could result in data consistency/loss issues. This scares the hell out of me. Edit: one thing we/I would appreciate guidance on, is what we ought to do in the mean time for data integrity. This only showed up in dev when we started to cycle out the nodes to upgrade to Ubuntu 22.04, so we essentially induced this failure, but if we were to get an instance retirement notification from AWS for a production host, we think we'd end up in this situation due to replacing the node. Edit 2: Going to spin up a second cluster on 5.2.5 to see if it happens, and if so, will spin up a cluster on 5.1.x to see if it happens as that's still a supported release version |
I managed to reproduce it on a fresh 5.2.5 cluster while doing writes. Next steps are going to be doing this with a completely fresh cluster with only the system_auth replication factor change & creation of new keyspace- but no writes. If it happens while doing that, I'll check with latest 5.1. This sucks. |
On an idle cluster it was reproduced, now trying with 5.1.15 |
Not reproducible on 5.1.15. Will be giving 5.2.0rc0 a shot tomorrow. scylla-5.1.15...scylla-5.2.0-rc0 Just a few changes to bisect :( |
Reproducible on 5.2.0-rc0. At this point, should I massively upsize root volume, turn on trace logging on a single cluster host, and redo this? |
Decided to see if things were any different if I used |
I wish I could provide some guidance already, but we first need more data to analyze.
trace log lvl can be too much, but we can give it a shot, so if you're only able to do it, then yes please.
5.1.15 is "newer" than 5.2.0-rc0, and I wonder if older 5.1.x would fail, could you possibly try with some older, e.g. 5.1.4? But generally I agree with the reasoning that 5.2+ introduced a potential bug.
|
Alrighty. I have a disgusting volume of logs now, although a lot of them are thrown exceptions due to the the deleted node- it's 978MB gzipped for all 6 nodes- we can figure out how to get you that after the weekend. I did, however, notice something when going through the logs- when update_endpoint is getting called, a lot of the time the rack info is missing- and this only happens during the node replacement- from the "new" node: From a node in the cluster- That log line is written here- Lines 66 to 77 in 34ab98e
Given the comparison in the condition above the log statement, and the subsequent lines inserting the node IP address into a map of sets, where map keys are supposed to be logged there, I would expect that those strings wouldn't be zero length. It would really suck if this is a red herring. |
@bhalevy can you please advise here? Wasn't that something that got accidentally changed during one of refactors? |
Not that I am aware of. |
@ptrsmrn let's try to reproduce this locally. |
I can grab the scylla & snitch config I’m using in an hour or two for you |
Thanks. The problem you pointed around update_endpoint missing dc/rack information may be very relevant. |
@lattwood did you upload the (huge) log file anywhere? |
Instances are in eu-central-1, have the instance metadata service enabled, and have IMDSv2 set to Config files-
# Configured by Ansible Scylla role
# Additional parameters can be edited right here for all-node distribution
cluster_name: "a-cluster-name"
data_file_directories:
- /var/lib/scylla/data
commitlog_directory: /var/lib/scylla/commitlog
authenticator: PasswordAuthenticator
authorizer: CassandraAuthorizer
num_tokens: 256
commitlog_sync: periodic
commitlog_sync_period_in_ms: 10000
commitlog_segment_size_in_mb: 32
seed_provider:
- class_name: org.apache.cassandra.locator.SimpleSeedProvider
parameters:
- seeds: "10.1.90.69"
listen_address: 10.1.93.255
native_transport_port: 9042
read_request_timeout_in_ms: 5000
write_request_timeout_in_ms: 2000
cas_contention_timeout_in_ms: 1000
endpoint_snitch: Ec2Snitch
rpc_address: 10.1.93.255
broadcast_address: 10.1.93.255
broadcast_rpc_address: 10.1.93.255
rpc_port: 9160
api_port: 10000
api_address: 127.0.0.1
batch_size_warn_threshold_in_kb: 5
batch_size_fail_threshold_in_kb: 50
partitioner: org.apache.cassandra.dht.Murmur3Partitioner
commitlog_total_space_in_mb: -1
murmur3_partitioner_ignore_msb_bits: 12
api_ui_dir: /opt/scylladb/swagger-ui/dist/
api_doc_dir: /opt/scylladb/api/api-doc/
enable_sstables_mc_format: True
internode_compression: all
consistent_cluster_management: True
|
@bhalevy I'm going to upload them to an S3 bucket this morning (I'm in Atlantic Time, +1 from eastern) and give you a presigned URL via a side-channel. Compressed the logs from the 6 nodes are < 1GB so it won't take long to upload. I noticed there's a 5.3.0-rc0 but it's from May. Is that worth me trying to reproduce this on, or should I take a stab at an earlier 5.1 build? |
@bhalevy s3 presigned url provided via Slack |
I wasn't able to reproduce the issue yet. @xemul did you encounter the Ec2Snitch returning empty dc/rack values by any chance? @lattwood what is the cluster topology and what nodes did you try replacing? |
@bhalevy 1 DC, 3 racks, 2 nodes per rack. I’ll see about reproducing this again with tcpdump running during the critical period & trace logging enabled again today |
And it happens when replacing any of the nodes after a termination in the EC2 Console- akin to what would happen if there was a hardware failure of an EC2 instance and AWS terminated it, a not uncommon occurrence on EC2 |
I’m going to be building ScyllaDB from source in debug mode, and then
adding log statements until it’s obvious where the issue is.
…On Tue, Aug 22, 2023 at 7:01 AM Benny Halevy ***@***.***> wrote:
I wasn't able to reproduce the issue yet.
@xemul <https://github.com/xemul> did you encounter the Ec2Snitch
returning empty dc/rack values by any chance?
@lattwood <https://github.com/lattwood> what is the cluster topology and
what nodes did you try replacing?
BTW, I'm not sure if the update_endpoint messages with empty dc/rack are
indeed the smoking gun since apparently they can happen normally in 5.2
when we clone the topology structure internally.
—
Reply to this email directly, view it on GitHub
<#15017 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABPJNKAXM4P3POMLOMA4T2TXWR7QRANCNFSM6AAAAAA3L7K52U>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@lattwood I'm not sure if you need to build in debug mode - it will cause scylla to be very slow, unless we suspect the address or undefined-behavior sanitizers to come to the rescue. |
Hmm, got it. Will try with a dev build, the debug build was quite slow on master, but the issue didn't show up with that build. Going to see if it happens with a dev build this morning. If it does it means race condition 😭 |
I need to do some more testing tomorrow, but I think this is resolved with the release of 5.2.7, specifically this change- |
Great news. Let us know if it's indeed that. |
Confirmed this is fixed via aca9e41 by cherry-picking that on top of 5.2.6, compiling and testing here: https://github.com/lattwood/scylladb/tree/lattwood/scylla-5.2.6-hotfix Thanks for the pointers on all of this. |
Installation details
Scylla version: 5.2.5 (this didn't happen on 4.6.10)
Cluster size: 6 nodes, i4i.2xlarge
OS (RHEL/CentOS/Ubuntu/AWS AMI): Ubuntu 20.04, issue was discovered while moving to 22.04
On a six node cluster where the following CQL has been run, resulting in the following nodetool output...
And where a node has been forcibly removed from the cluster, and its replacement is brought up with
replace_node_first_boot: UUID-OF-REMOVED-NODE
in the ScyllaDB config, when it has successfully replaced the missing node, thenodetool
output for the above command becomes...and the same problem exists on the
akeyspace
keyspace. If youdescribe
the keyspace,replication
is still set to the above value (ie, it doesn't change from what was set in CQL).So in otherwords, something is happening to the effective replication configuration ScyllaDB is using.
When a rolling restart of Scylla is performed, the output of
nodetool getendpoints ...
is corrected.This issue was discovered when attempting to do
LIST ROLES for cassandra
, as Scylla/Cassandra special cases the read CL for that role name with Quorum. (it succeeds after the rolling restart)I have no idea what this implies for data consistency/resiliency, but I suspect it isn't great.
Additional details- Cluster was created on 5.2.5 with raft disabled (unintentionally). Issue occurs with raft disabled, and it also occurs when replacing a node after it's been enabled too.
The text was updated successfully, but these errors were encountered: