Internal Replication Factor settings are silently changed when onlining a replacement cluster node #15017

lattwood · 2023-08-10T19:54:18Z

Installation details
Scylla version: 5.2.5 (this didn't happen on 4.6.10)
Cluster size: 6 nodes, i4i.2xlarge
OS (RHEL/CentOS/Ubuntu/AWS AMI): Ubuntu 20.04, issue was discovered while moving to 22.04

On a six node cluster where the following CQL has been run, resulting in the following nodetool output...

ALTER KEYSPACE system_auth WITH replication = {'class': 'NetworkTopologyStrategy', 'eu-central': '6'};
CREATE KEYSPACE akeyspace WITH replication = {'class': 'NetworkTopologyStrategy', 'eu-central': '3'}  AND durable_writes = true;

> nodetool getendpoints system_auth roles cassandra
10.1.106.70
10.1.78.113
10.1.101.109
10.1.66.242
10.1.87.51
10.1.90.69

And where a node has been forcibly removed from the cluster, and its replacement is brought up with replace_node_first_boot: UUID-OF-REMOVED-NODE in the ScyllaDB config, when it has successfully replaced the missing node, the nodetool output for the above command becomes...

> nodetool getendpoints system_auth roles cassandra
10.1.106.70
10.1.78.113

and the same problem exists on the akeyspace keyspace. If you describe the keyspace, replication is still set to the above value (ie, it doesn't change from what was set in CQL).

So in otherwords, something is happening to the effective replication configuration ScyllaDB is using.

When a rolling restart of Scylla is performed, the output of nodetool getendpoints ... is corrected.

> nodetool getendpoints system_auth roles cassandra
10.1.106.70
10.1.78.113
10.1.101.109
10.1.66.242
10.1.93.255
10.1.90.69

This issue was discovered when attempting to do LIST ROLES for cassandra, as Scylla/Cassandra special cases the read CL for that role name with Quorum. (it succeeds after the rolling restart)

I have no idea what this implies for data consistency/resiliency, but I suspect it isn't great.

Additional details- Cluster was created on 5.2.5 with raft disabled (unintentionally). Issue occurs with raft disabled, and it also occurs when replacing a node after it's been enabled too.

The text was updated successfully, but these errors were encountered:

mykaul · 2023-08-11T06:36:28Z

@ptrsmrn - please take a look - system_auth related or a general issue

lattwood · 2023-08-11T11:12:46Z

This happens on non system_auth key spaces as well.

lattwood · 2023-08-14T14:33:36Z

I should be able to perform almost any log collection/config changes as required to provide additional information, and I'm Logan Attwood on the ScyllaDB-Users Slack.

avikivity · 2023-08-14T14:40:23Z

It's really odd.

lattwood · 2023-08-14T18:01:39Z

This isn't just an appearance/UI bug.

I did a node replacement while writes were on-going, but didn't get a chance to do a rolling restart until Monday. I'm now running a repair and disk utilization is going up a not insignificant amount...

meaning, the incorrect metadata is being used for write persistence, is probably used for repair, and won't be noticeable if you're doing most of your read/write ops with 1/2 consistency level. Because of the relatively silent nature of this issue, it has potentially catastrophic consequences for the data stored, as well as written to the cluster.

Documentation on increasing the replication factor via ALTER KEYSPACE statements makes it seem like say, going from a replication factor of 5, to 2 (with an unknown strategy), then back to 5 (via a rolling restart, not via an ALTER KEYSPACE command), could result in data consistency/loss issues.

This scares the hell out of me.

Edit: one thing we/I would appreciate guidance on, is what we ought to do in the mean time for data integrity. This only showed up in dev when we started to cycle out the nodes to upgrade to Ubuntu 22.04, so we essentially induced this failure, but if we were to get an instance retirement notification from AWS for a production host, we think we'd end up in this situation due to replacing the node.

Edit 2: Going to spin up a second cluster on 5.2.5 to see if it happens, and if so, will spin up a cluster on 5.1.x to see if it happens as that's still a supported release version

lattwood · 2023-08-15T15:20:10Z

I managed to reproduce it on a fresh 5.2.5 cluster while doing writes.

Next steps are going to be doing this with a completely fresh cluster with only the system_auth replication factor change & creation of new keyspace- but no writes. If it happens while doing that, I'll check with latest 5.1.

This sucks.

lattwood · 2023-08-15T19:04:01Z

On an idle cluster it was reproduced, now trying with 5.1.15

lattwood · 2023-08-15T22:01:00Z

Not reproducible on 5.1.15.

Will be giving 5.2.0rc0 a shot tomorrow.

scylla-5.1.15...scylla-5.2.0-rc0

Just a few changes to bisect :(

lattwood · 2023-08-16T19:57:32Z

Reproducible on 5.2.0-rc0. At this point, should I massively upsize root volume, turn on trace logging on a single cluster host, and redo this?

lattwood · 2023-08-16T20:31:32Z

Decided to see if things were any different if I used replace_address_first_boot vs replace_node_first_boot on 5.2 and didn't see any difference.

ptrsmrn · 2023-08-17T20:57:51Z

I wish I could provide some guidance already, but we first need more data to analyze.

Reproducible on 5.2.0-rc0. At this point, should I massively upsize root volume, turn on trace logging on a single cluster host, and redo this?

trace log lvl can be too much, but we can give it a shot, so if you're only able to do it, then yes please.

Will be giving 5.2.0rc0 a shot tomorrow.
scylla-5.1.15...scylla-5.2.0-rc0
Just a few changes to bisect :(

5.1.15 is "newer" than 5.2.0-rc0, and I wonder if older 5.1.x would fail, could you possibly try with some older, e.g. 5.1.4? But generally I agree with the reasoning that 5.2+ introduced a potential bug.

$ git log -1 scylla-5.1.15
commit 12966e84352e22147750b7a1ebf60f6901935900 (tag: scylla-5.1.15)
Author: Beni Peled <beni.peled@scylladb.com>
Date:   Sun Jul 30 14:30:16 2023 +0300

    release: prepare for 5.1.15
$ git log -1 scylla-5.2.0-rc0 
commit 34ab98e1be2e39e484b2ea7a57898e6ad16694de (HEAD, tag: scylla-5.2.0-rc0)
Author: Anna Mikhlin <anna.mikhlin@scylladb.com>
Date:   Wed Jan 18 14:54:36 2023 +0200

    release: prepare for 5.2.0-rc0

lattwood · 2023-08-18T22:29:43Z

Alrighty. I have a disgusting volume of logs now, although a lot of them are thrown exceptions due to the the deleted node- it's 978MB gzipped for all 6 nodes- we can figure out how to get you that after the weekend.

I did, however, notice something when going through the logs- when update_endpoint is getting called, a lot of the time the rack info is missing- and this only happens during the node replacement- from the "new" node:

From a node in the cluster-

That log line is written here-

scylladb/locator/topology.cc

Lines 66 to 77 in 34ab98e

    
           if (current != _current_locations.end()) { 
        
               if (current->second.dc == dr.dc && current->second.rack == dr.rack) { 
        
                   return; 
        
               } 
        
               remove_endpoint(ep); 
        
           } 
        
           tlogger.debug("update_endpoint: {} {}/{}", ep, dr.dc, dr.rack); 
        
           _dc_endpoints[dr.dc].insert(ep); 
        
           _dc_racks[dr.dc][dr.rack].insert(ep); 
        
           _datacenters.insert(dr.dc); 
        
           _current_locations[ep] = std::move(dr);

Given the comparison in the condition above the log statement, and the subsequent lines inserting the node IP address into a map of sets, where map keys are supposed to be logged there, I would expect that those strings wouldn't be zero length. It would really suck if this is a red herring.

ptrsmrn · 2023-08-21T09:31:33Z

@bhalevy can you please advise here? Wasn't that something that got accidentally changed during one of refactors?

bhalevy · 2023-08-21T12:23:38Z

@bhalevy can you please advise here? Wasn't that something that got accidentally changed during one of refactors?

Not that I am aware of.
nodetool getendpoints gets to use an internal function: get_natural_endpoints, that's used extensively on the common path and we didn't see any such issues there.
@tgrabiec any insight?

bhalevy · 2023-08-21T12:32:38Z

@ptrsmrn let's try to reproduce this locally.

lattwood · 2023-08-21T12:35:00Z

I can grab the scylla & snitch config I’m using in an hour or two for you

bhalevy · 2023-08-21T12:40:04Z

Thanks. The problem you pointed around update_endpoint missing dc/rack information may be very relevant.
And there were changes in this area in 5.2 vs. 5.1 (and further changes in later releases that likely fix this issue)

bhalevy · 2023-08-21T12:59:27Z

@lattwood did you upload the (huge) log file anywhere?

lattwood · 2023-08-21T13:28:19Z

Instances are in eu-central-1, have the instance metadata service enabled, and have IMDSv2 set to optional.

Config files-

/etc/scylla/scylla.yaml

# Configured by Ansible Scylla role
# Additional parameters can be edited right here for all-node distribution

cluster_name: "a-cluster-name"

data_file_directories:
    - /var/lib/scylla/data

commitlog_directory: /var/lib/scylla/commitlog

authenticator: PasswordAuthenticator
authorizer: CassandraAuthorizer

num_tokens: 256
commitlog_sync: periodic
commitlog_sync_period_in_ms: 10000
commitlog_segment_size_in_mb: 32

seed_provider:
    - class_name: org.apache.cassandra.locator.SimpleSeedProvider
      parameters:
          - seeds: "10.1.90.69"

listen_address: 10.1.93.255
native_transport_port: 9042
read_request_timeout_in_ms: 5000
write_request_timeout_in_ms: 2000
cas_contention_timeout_in_ms: 1000
endpoint_snitch: Ec2Snitch
rpc_address: 10.1.93.255
broadcast_address: 10.1.93.255
broadcast_rpc_address: 10.1.93.255
rpc_port: 9160
api_port: 10000
api_address: 127.0.0.1
batch_size_warn_threshold_in_kb: 5
batch_size_fail_threshold_in_kb: 50
partitioner: org.apache.cassandra.dht.Murmur3Partitioner
commitlog_total_space_in_mb: -1
murmur3_partitioner_ignore_msb_bits: 12
api_ui_dir: /opt/scylladb/swagger-ui/dist/
api_doc_dir: /opt/scylladb/api/api-doc/


enable_sstables_mc_format: True
internode_compression: all
consistent_cluster_management: True

/etc/scylla/cassandra-rackdc.properties

# Configured by Ansible Scylla role

#
# cassandra-rackdc.properties
#
# The lines may include white spaces at the beginning and the end.
# The rack and data center names may also include white spaces.
# All trailing and leading white spaces will be trimmed.
#
# dc=my_data_center
# rack=my_rack
# prefer_local=<false | true>
# dc_suffix=<Data Center name suffix, used by EC2SnitchXXX snitches>
#

lattwood · 2023-08-21T13:40:26Z

@bhalevy I'm going to upload them to an S3 bucket this morning (I'm in Atlantic Time, +1 from eastern) and give you a presigned URL via a side-channel. Compressed the logs from the 6 nodes are < 1GB so it won't take long to upload.

I noticed there's a 5.3.0-rc0 but it's from May. Is that worth me trying to reproduce this on, or should I take a stab at an earlier 5.1 build?

lattwood · 2023-08-21T14:36:05Z

@bhalevy s3 presigned url provided via Slack

bhalevy · 2023-08-22T10:01:33Z

I wasn't able to reproduce the issue yet.

@xemul did you encounter the Ec2Snitch returning empty dc/rack values by any chance?

@lattwood what is the cluster topology and what nodes did you try replacing?
BTW, I'm not sure if the update_endpoint messages with empty dc/rack are indeed the smoking gun since apparently they can happen normally in 5.2 when we clone the topology structure internally.

lattwood · 2023-08-22T11:25:15Z

@bhalevy 1 DC, 3 racks, 2 nodes per rack.

I’ll see about reproducing this again with tcpdump running during the critical period & trace logging enabled again today

lattwood · 2023-08-22T13:59:50Z

And it happens when replacing any of the nodes after a termination in the EC2 Console- akin to what would happen if there was a hardware failure of an EC2 instance and AWS terminated it, a not uncommon occurrence on EC2

lattwood · 2023-08-23T10:45:21Z

I’m going to be building ScyllaDB from source in debug mode, and then adding log statements until it’s obvious where the issue is.

…

On Tue, Aug 22, 2023 at 7:01 AM Benny Halevy ***@***.***> wrote: I wasn't able to reproduce the issue yet. @xemul <https://github.com/xemul> did you encounter the Ec2Snitch returning empty dc/rack values by any chance? @lattwood <https://github.com/lattwood> what is the cluster topology and what nodes did you try replacing? BTW, I'm not sure if the update_endpoint messages with empty dc/rack are indeed the smoking gun since apparently they can happen normally in 5.2 when we clone the topology structure internally. — Reply to this email directly, view it on GitHub <#15017 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABPJNKAXM4P3POMLOMA4T2TXWR7QRANCNFSM6AAAAAA3L7K52U> . You are receiving this because you were mentioned.Message ID: ***@***.***>

bhalevy · 2023-08-24T06:04:27Z

@lattwood I'm not sure if you need to build in debug mode - it will cause scylla to be very slow, unless we suspect the address or undefined-behavior sanitizers to come to the rescue.
It should be fine to use the dev mode for such cases.

lattwood · 2023-08-24T14:10:12Z

Hmm, got it. Will try with a dev build, the debug build was quite slow on master, but the issue didn't show up with that build. Going to see if it happens with a dev build this morning. If it does it means race condition 😭

lattwood · 2023-08-24T22:29:26Z

I need to do some more testing tomorrow, but I think this is resolved with the release of 5.2.7, specifically this change-

aca9e41

bhalevy · 2023-08-25T06:15:32Z

I need to do some more testing tomorrow, but I think this is resolved with the release of 5.2.7, specifically this change-

aca9e41

Great news. Let us know if it's indeed that.

lattwood · 2023-08-25T20:33:08Z

Confirmed this is fixed via aca9e41 by cherry-picking that on top of 5.2.6, compiling and testing here: https://github.com/lattwood/scylladb/tree/lattwood/scylla-5.2.6-hotfix

Thanks for the pointers on all of this.

mykaul added triage/oss area/replace labels Aug 13, 2023

ptrsmrn self-assigned this Aug 13, 2023

lattwood mentioned this issue Aug 17, 2023

fix(nemesis) check length of nodetool getendpoints output scylladb/scylla-cluster-tests#6514

Closed

7 tasks

lattwood closed this as completed Aug 25, 2023

DoronArazii added this to the 5.4 milestone Aug 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Internal Replication Factor settings are silently changed when onlining a replacement cluster node #15017

Internal Replication Factor settings are silently changed when onlining a replacement cluster node #15017

lattwood commented Aug 10, 2023 •

edited

mykaul commented Aug 11, 2023

lattwood commented Aug 11, 2023

lattwood commented Aug 14, 2023

avikivity commented Aug 14, 2023

lattwood commented Aug 14, 2023 •

edited

lattwood commented Aug 15, 2023 •

edited

lattwood commented Aug 15, 2023 •

edited

lattwood commented Aug 15, 2023

lattwood commented Aug 16, 2023

lattwood commented Aug 16, 2023

ptrsmrn commented Aug 17, 2023

lattwood commented Aug 18, 2023 •

edited

ptrsmrn commented Aug 21, 2023

bhalevy commented Aug 21, 2023

bhalevy commented Aug 21, 2023

lattwood commented Aug 21, 2023

bhalevy commented Aug 21, 2023

bhalevy commented Aug 21, 2023

lattwood commented Aug 21, 2023

lattwood commented Aug 21, 2023

lattwood commented Aug 21, 2023

bhalevy commented Aug 22, 2023

lattwood commented Aug 22, 2023

lattwood commented Aug 22, 2023

lattwood commented Aug 23, 2023 via email

bhalevy commented Aug 24, 2023

lattwood commented Aug 24, 2023

lattwood commented Aug 24, 2023

bhalevy commented Aug 25, 2023

lattwood commented Aug 25, 2023

Internal Replication Factor settings are silently changed when onlining a replacement cluster node #15017

Internal Replication Factor settings are silently changed when onlining a replacement cluster node #15017

Comments

lattwood commented Aug 10, 2023 • edited

mykaul commented Aug 11, 2023

lattwood commented Aug 11, 2023

lattwood commented Aug 14, 2023

avikivity commented Aug 14, 2023

lattwood commented Aug 14, 2023 • edited

lattwood commented Aug 15, 2023 • edited

lattwood commented Aug 15, 2023 • edited

lattwood commented Aug 15, 2023

lattwood commented Aug 16, 2023

lattwood commented Aug 16, 2023

ptrsmrn commented Aug 17, 2023

lattwood commented Aug 18, 2023 • edited

ptrsmrn commented Aug 21, 2023

bhalevy commented Aug 21, 2023

bhalevy commented Aug 21, 2023

lattwood commented Aug 21, 2023

bhalevy commented Aug 21, 2023

bhalevy commented Aug 21, 2023

lattwood commented Aug 21, 2023

lattwood commented Aug 21, 2023

lattwood commented Aug 21, 2023

bhalevy commented Aug 22, 2023

lattwood commented Aug 22, 2023

lattwood commented Aug 22, 2023

lattwood commented Aug 23, 2023 via email

bhalevy commented Aug 24, 2023

lattwood commented Aug 24, 2023

lattwood commented Aug 24, 2023

bhalevy commented Aug 25, 2023

lattwood commented Aug 25, 2023

lattwood commented Aug 10, 2023 •

edited

lattwood commented Aug 14, 2023 •

edited

lattwood commented Aug 15, 2023 •

edited

lattwood commented Aug 15, 2023 •

edited

lattwood commented Aug 18, 2023 •

edited