Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Internal Replication Factor settings are silently changed when onlining a replacement cluster node #15017

Closed
lattwood opened this issue Aug 10, 2023 · 30 comments
Assignees
Milestone

Comments

@lattwood
Copy link

lattwood commented Aug 10, 2023

Installation details
Scylla version: 5.2.5 (this didn't happen on 4.6.10)
Cluster size: 6 nodes, i4i.2xlarge
OS (RHEL/CentOS/Ubuntu/AWS AMI): Ubuntu 20.04, issue was discovered while moving to 22.04


On a six node cluster where the following CQL has been run, resulting in the following nodetool output...

ALTER KEYSPACE system_auth WITH replication = {'class': 'NetworkTopologyStrategy', 'eu-central': '6'};
CREATE KEYSPACE akeyspace WITH replication = {'class': 'NetworkTopologyStrategy', 'eu-central': '3'}  AND durable_writes = true;
> nodetool getendpoints system_auth roles cassandra
10.1.106.70
10.1.78.113
10.1.101.109
10.1.66.242
10.1.87.51
10.1.90.69

And where a node has been forcibly removed from the cluster, and its replacement is brought up with replace_node_first_boot: UUID-OF-REMOVED-NODE in the ScyllaDB config, when it has successfully replaced the missing node, the nodetool output for the above command becomes...

> nodetool getendpoints system_auth roles cassandra
10.1.106.70
10.1.78.113

and the same problem exists on the akeyspace keyspace. If you describe the keyspace, replication is still set to the above value (ie, it doesn't change from what was set in CQL).

So in otherwords, something is happening to the effective replication configuration ScyllaDB is using.

When a rolling restart of Scylla is performed, the output of nodetool getendpoints ... is corrected.

> nodetool getendpoints system_auth roles cassandra
10.1.106.70
10.1.78.113
10.1.101.109
10.1.66.242
10.1.93.255
10.1.90.69

This issue was discovered when attempting to do LIST ROLES for cassandra, as Scylla/Cassandra special cases the read CL for that role name with Quorum. (it succeeds after the rolling restart)

I have no idea what this implies for data consistency/resiliency, but I suspect it isn't great.


Additional details- Cluster was created on 5.2.5 with raft disabled (unintentionally). Issue occurs with raft disabled, and it also occurs when replacing a node after it's been enabled too.

@mykaul
Copy link
Contributor

mykaul commented Aug 11, 2023

@ptrsmrn - please take a look - system_auth related or a general issue

@lattwood
Copy link
Author

This happens on non system_auth key spaces as well.

@lattwood
Copy link
Author

I should be able to perform almost any log collection/config changes as required to provide additional information, and I'm Logan Attwood on the ScyllaDB-Users Slack.

@avikivity
Copy link
Member

It's really odd.

@lattwood
Copy link
Author

lattwood commented Aug 14, 2023

This isn't just an appearance/UI bug.

I did a node replacement while writes were on-going, but didn't get a chance to do a rolling restart until Monday. I'm now running a repair and disk utilization is going up a not insignificant amount...

meaning, the incorrect metadata is being used for write persistence, is probably used for repair, and won't be noticeable if you're doing most of your read/write ops with 1/2 consistency level. Because of the relatively silent nature of this issue, it has potentially catastrophic consequences for the data stored, as well as written to the cluster.

Documentation on increasing the replication factor via ALTER KEYSPACE statements makes it seem like say, going from a replication factor of 5, to 2 (with an unknown strategy), then back to 5 (via a rolling restart, not via an ALTER KEYSPACE command), could result in data consistency/loss issues.

This scares the hell out of me.

Edit: one thing we/I would appreciate guidance on, is what we ought to do in the mean time for data integrity. This only showed up in dev when we started to cycle out the nodes to upgrade to Ubuntu 22.04, so we essentially induced this failure, but if we were to get an instance retirement notification from AWS for a production host, we think we'd end up in this situation due to replacing the node.

Edit 2: Going to spin up a second cluster on 5.2.5 to see if it happens, and if so, will spin up a cluster on 5.1.x to see if it happens as that's still a supported release version

@lattwood
Copy link
Author

lattwood commented Aug 15, 2023

I managed to reproduce it on a fresh 5.2.5 cluster while doing writes.

Next steps are going to be doing this with a completely fresh cluster with only the system_auth replication factor change & creation of new keyspace- but no writes. If it happens while doing that, I'll check with latest 5.1.

This sucks.

@lattwood
Copy link
Author

lattwood commented Aug 15, 2023

On an idle cluster it was reproduced, now trying with 5.1.15

@lattwood
Copy link
Author

Not reproducible on 5.1.15.

Will be giving 5.2.0rc0 a shot tomorrow.

scylla-5.1.15...scylla-5.2.0-rc0

Just a few changes to bisect :(

@lattwood
Copy link
Author

Reproducible on 5.2.0-rc0. At this point, should I massively upsize root volume, turn on trace logging on a single cluster host, and redo this?

@lattwood
Copy link
Author

Decided to see if things were any different if I used replace_address_first_boot vs replace_node_first_boot on 5.2 and didn't see any difference.

@ptrsmrn
Copy link
Contributor

ptrsmrn commented Aug 17, 2023

I wish I could provide some guidance already, but we first need more data to analyze.

Reproducible on 5.2.0-rc0. At this point, should I massively upsize root volume, turn on trace logging on a single cluster host, and redo this?

trace log lvl can be too much, but we can give it a shot, so if you're only able to do it, then yes please.

Will be giving 5.2.0rc0 a shot tomorrow.
scylla-5.1.15...scylla-5.2.0-rc0
Just a few changes to bisect :(

5.1.15 is "newer" than 5.2.0-rc0, and I wonder if older 5.1.x would fail, could you possibly try with some older, e.g. 5.1.4? But generally I agree with the reasoning that 5.2+ introduced a potential bug.

$ git log -1 scylla-5.1.15
commit 12966e84352e22147750b7a1ebf60f6901935900 (tag: scylla-5.1.15)
Author: Beni Peled <beni.peled@scylladb.com>
Date:   Sun Jul 30 14:30:16 2023 +0300

    release: prepare for 5.1.15
$ git log -1 scylla-5.2.0-rc0 
commit 34ab98e1be2e39e484b2ea7a57898e6ad16694de (HEAD, tag: scylla-5.2.0-rc0)
Author: Anna Mikhlin <anna.mikhlin@scylladb.com>
Date:   Wed Jan 18 14:54:36 2023 +0200

    release: prepare for 5.2.0-rc0

@lattwood
Copy link
Author

lattwood commented Aug 18, 2023

Alrighty. I have a disgusting volume of logs now, although a lot of them are thrown exceptions due to the the deleted node- it's 978MB gzipped for all 6 nodes- we can figure out how to get you that after the weekend.

I did, however, notice something when going through the logs- when update_endpoint is getting called, a lot of the time the rack info is missing- and this only happens during the node replacement- from the "new" node:

image

From a node in the cluster-

image

That log line is written here-

if (current != _current_locations.end()) {
if (current->second.dc == dr.dc && current->second.rack == dr.rack) {
return;
}
remove_endpoint(ep);
}
tlogger.debug("update_endpoint: {} {}/{}", ep, dr.dc, dr.rack);
_dc_endpoints[dr.dc].insert(ep);
_dc_racks[dr.dc][dr.rack].insert(ep);
_datacenters.insert(dr.dc);
_current_locations[ep] = std::move(dr);

Given the comparison in the condition above the log statement, and the subsequent lines inserting the node IP address into a map of sets, where map keys are supposed to be logged there, I would expect that those strings wouldn't be zero length. It would really suck if this is a red herring.

@ptrsmrn
Copy link
Contributor

ptrsmrn commented Aug 21, 2023

@bhalevy can you please advise here? Wasn't that something that got accidentally changed during one of refactors?

@bhalevy
Copy link
Member

bhalevy commented Aug 21, 2023

@bhalevy can you please advise here? Wasn't that something that got accidentally changed during one of refactors?

Not that I am aware of.
nodetool getendpoints gets to use an internal function: get_natural_endpoints, that's used extensively on the common path and we didn't see any such issues there.
@tgrabiec any insight?

@bhalevy
Copy link
Member

bhalevy commented Aug 21, 2023

@ptrsmrn let's try to reproduce this locally.

@lattwood
Copy link
Author

I can grab the scylla & snitch config I’m using in an hour or two for you

@bhalevy
Copy link
Member

bhalevy commented Aug 21, 2023

Thanks. The problem you pointed around update_endpoint missing dc/rack information may be very relevant.
And there were changes in this area in 5.2 vs. 5.1 (and further changes in later releases that likely fix this issue)

@bhalevy
Copy link
Member

bhalevy commented Aug 21, 2023

@lattwood did you upload the (huge) log file anywhere?

@lattwood
Copy link
Author

Instances are in eu-central-1, have the instance metadata service enabled, and have IMDSv2 set to optional.

Config files-

/etc/scylla/scylla.yaml

# Configured by Ansible Scylla role
# Additional parameters can be edited right here for all-node distribution

cluster_name: "a-cluster-name"

data_file_directories:
    - /var/lib/scylla/data

commitlog_directory: /var/lib/scylla/commitlog

authenticator: PasswordAuthenticator
authorizer: CassandraAuthorizer

num_tokens: 256
commitlog_sync: periodic
commitlog_sync_period_in_ms: 10000
commitlog_segment_size_in_mb: 32

seed_provider:
    - class_name: org.apache.cassandra.locator.SimpleSeedProvider
      parameters:
          - seeds: "10.1.90.69"

listen_address: 10.1.93.255
native_transport_port: 9042
read_request_timeout_in_ms: 5000
write_request_timeout_in_ms: 2000
cas_contention_timeout_in_ms: 1000
endpoint_snitch: Ec2Snitch
rpc_address: 10.1.93.255
broadcast_address: 10.1.93.255
broadcast_rpc_address: 10.1.93.255
rpc_port: 9160
api_port: 10000
api_address: 127.0.0.1
batch_size_warn_threshold_in_kb: 5
batch_size_fail_threshold_in_kb: 50
partitioner: org.apache.cassandra.dht.Murmur3Partitioner
commitlog_total_space_in_mb: -1
murmur3_partitioner_ignore_msb_bits: 12
api_ui_dir: /opt/scylladb/swagger-ui/dist/
api_doc_dir: /opt/scylladb/api/api-doc/


enable_sstables_mc_format: True
internode_compression: all
consistent_cluster_management: True

/etc/scylla/cassandra-rackdc.properties

# Configured by Ansible Scylla role

#
# cassandra-rackdc.properties
#
# The lines may include white spaces at the beginning and the end.
# The rack and data center names may also include white spaces.
# All trailing and leading white spaces will be trimmed.
#
# dc=my_data_center
# rack=my_rack
# prefer_local=<false | true>
# dc_suffix=<Data Center name suffix, used by EC2SnitchXXX snitches>
#


@lattwood
Copy link
Author

@bhalevy I'm going to upload them to an S3 bucket this morning (I'm in Atlantic Time, +1 from eastern) and give you a presigned URL via a side-channel. Compressed the logs from the 6 nodes are < 1GB so it won't take long to upload.

I noticed there's a 5.3.0-rc0 but it's from May. Is that worth me trying to reproduce this on, or should I take a stab at an earlier 5.1 build?

@lattwood
Copy link
Author

@bhalevy s3 presigned url provided via Slack

@bhalevy
Copy link
Member

bhalevy commented Aug 22, 2023

I wasn't able to reproduce the issue yet.

@xemul did you encounter the Ec2Snitch returning empty dc/rack values by any chance?

@lattwood what is the cluster topology and what nodes did you try replacing?
BTW, I'm not sure if the update_endpoint messages with empty dc/rack are indeed the smoking gun since apparently they can happen normally in 5.2 when we clone the topology structure internally.

@lattwood
Copy link
Author

@bhalevy 1 DC, 3 racks, 2 nodes per rack.

I’ll see about reproducing this again with tcpdump running during the critical period & trace logging enabled again today

@lattwood
Copy link
Author

And it happens when replacing any of the nodes after a termination in the EC2 Console- akin to what would happen if there was a hardware failure of an EC2 instance and AWS terminated it, a not uncommon occurrence on EC2

@lattwood
Copy link
Author

lattwood commented Aug 23, 2023 via email

@bhalevy
Copy link
Member

bhalevy commented Aug 24, 2023

@lattwood I'm not sure if you need to build in debug mode - it will cause scylla to be very slow, unless we suspect the address or undefined-behavior sanitizers to come to the rescue.
It should be fine to use the dev mode for such cases.

@lattwood
Copy link
Author

Hmm, got it. Will try with a dev build, the debug build was quite slow on master, but the issue didn't show up with that build. Going to see if it happens with a dev build this morning. If it does it means race condition 😭

@lattwood
Copy link
Author

I need to do some more testing tomorrow, but I think this is resolved with the release of 5.2.7, specifically this change-

aca9e41

@bhalevy
Copy link
Member

bhalevy commented Aug 25, 2023

I need to do some more testing tomorrow, but I think this is resolved with the release of 5.2.7, specifically this change-

aca9e41

Great news. Let us know if it's indeed that.

@lattwood
Copy link
Author

image image

Confirmed this is fixed via aca9e41 by cherry-picking that on top of 5.2.6, compiling and testing here: https://github.com/lattwood/scylladb/tree/lattwood/scylla-5.2.6-hotfix

Thanks for the pointers on all of this.

@DoronArazii DoronArazii added this to the 5.4 milestone Aug 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants