During rollout restart operator doesn't wait until previously restarted pod becomes part of a Scylla cluster #1077

vponomaryov · 2022-10-28T16:59:14Z

Describe the bug
If we change the number of cores for Scylla (it will trigger rollout pod restart) then we start facing following racy condition:
First pod gets restarted and if it gets into a cluster with some unexpected delay then operator doesn't wait for it and restarts second pod.
In a cluster of 3 nodes it makes cluster be inoperational not having quorum for several minutes.
It happens on GKE and doesn't happen on EKS because in the latter one case a restarted pod gets into a cluster much faster.

Below are the logs from a run on GKE cluster.
pod-2 (the first restarted) logs:

2022-10-27 18:40:55,710 INFO waiting for scylla to stop
INFO  2022-10-27 18:40:55,710 [shard 0] init - Signal received; shutting down
...
INFO  2022-10-27 18:40:57,897 [shard 0] gossip - Gossip is now stopped
...
INFO  2022-10-27 18:40:59,866 [shard 0] init - Scylla version 5.1.0~rc3-0.20221009.9deeeb4db1cd shutdown complete.
2022-10-27 18:40:59,940 INFO stopped: scylla (exit status 0)
...
I1027 18:41:03.305900       1 operator/sidecar.go:438] "Scylla process finished"
rpc error: code = NotFound desc = an error occurred when try to find container "ef77bed95aa5838282bc5d55420a2718d6b111912fc2954dc645c35c7ce87d3f": not foundI1027 18:41:11.371885       1 operator/sidecar.go:147] sidecar version "v1.8.0-alpha.0-133-g97c831e"
...
I1027 18:41:12.363766       1 operator/sidecar.go:360] "Starting scylla"
...
INFO  2022-10-27 18:41:15,764 [shard 0] database - Resharded 7kB for system.compaction_history in 0.54 seconds, 13kB/s
...
INFO  2022-10-27 18:41:18,583 [shard 0] init - loading system_schema sstables
...
INFO  2022-10-27 18:41:19,311 [shard 0] init - setting up system keyspace
...
INFO  2022-10-27 18:41:21,558 [shard 0] database - Resharded 235MB for keyspace1.standard1 in 2.02 seconds, 116MB/s
...
INFO  2022-10-27 18:41:21,621 [shard 0] storage_service - entering STARTING mode
...
INFO  2022-10-27 18:41:21,700 [shard 0] storage_service - Starting up server gossip
...
E1027 18:41:24.931854       1 sidecar/probes.go:110] "readyz probe: can't get scylla native transport" err="giving up after 1 attempts: agent [HTTP 404] Not found" Service="scylla/sct-cluster-us-east1-b-us-east1-2" Node="10.80.11.2"
...
E1027 18:41:34.932431       1 sidecar/probes.go:110] "readyz probe: can't get scylla native transport" err="giving up after 1 attempts: agent [HTTP 404] Not found" Service="scylla/sct-cluster-us-east1-b-us-east1-2" Node="10.80.11.2"
...
INFO  2022-10-27 18:41:44,256 [shard 0] storage_service - entering JOINING mode
...
INFO  2022-10-27 18:41:44,828 [shard 0] storage_service - Node 10.80.11.2 state jump to normal
E1027 18:41:44.931320       1 sidecar/probes.go:110] "readyz probe: can't get scylla native transport" err="giving up after 1 attempts: agent [HTTP 404] Not found" Service="scylla/sct-cluster-us-east1-b-us-east1-2" Node="10.80.11.2"
INFO  2022-10-27 18:41:45,035 [shard 0] storage_service - entering NORMAL mode
...
INFO  2022-10-27 18:41:57,051 [shard 0] init - Scylla version 5.1.0~rc3-0.20221009.9deeeb4db1cd initialization completed.
WARN  2022-10-27 18:42:45,043 [shard 0] cdc - Could not update CDC description table with generation (2022/10/27 18:24:22, 69265668-5186-4a22-a2b1-6b6bad5a0f55): exceptions::unavailable_exception (Cannot achieve consistency level for cl QUORUM. Requires 2, alive 1). Will try again.
WARN  2022-10-27 18:43:14,796 [shard 0] gossip - Skip marking node 10.80.4.107 with status = shutdown as UP
INFO  2022-10-27 18:43:14,796 [shard 0] gossip - InetAddress 10.80.14.183 is now UP, status = NORMAL
INFO  2022-10-27 18:43:14,798 [shard 0] storage_service - Node 10.80.14.183 state jump to normal
INFO  2022-10-27 18:43:14,801 [shard 0] storage_service - Node 10.80.4.107 state jump to normal
INFO  2022-10-27 18:43:14,950 [shard 0] gossip - InetAddress 10.80.12.113 is now DOWN, status = LEFT
INFO  2022-10-27 18:43:14,952 [shard 0] gossip - Node 10.80.12.113 will be removed from gossip at [2022-10-30 18:31:53]: (expire = 1667154713831435293, now = 1666896194952348069, diff = 258518 seconds)
...
WARN  2022-10-27 18:43:36,752 [shard 0] gossip - Fail to send EchoMessage to 10.80.4.107: seastar::rpc::closed_error (connection is closed)
INFO  2022-10-27 18:43:36,892 [shard 0] gossip - InetAddress 10.80.4.107 is now UP, status = NORMAL

pod-1 logs (second restarted):

...
INFO  2022-10-27 18:32:55,041 [shard 0] gossip - 60000 ms elapsed, 10.80.12.113 gossip quarantine over
INFO  2022-10-27 18:40:55,897 [shard 0] gossip - Got shutdown message from 10.80.11.2, received_generation=1666893961, local_generation=1666893961
INFO  2022-10-27 18:40:55,898 [shard 0] gossip - InetAddress 10.80.11.2 is now DOWN, status = shutdown
2022-10-27 18:42:05,276 INFO waiting for scylla to stop
INFO  2022-10-27 18:42:05,291 [shard 0] compaction_manager - Asked to stop
...

So, from the logs above we can see that the time when pod-1 (second restarted) got shutdown message the first one
was not ready yet. The appeared quorum error from CDC is the proof for it.

To Reproduce
Steps to reproduce the behavior:

Deploy operator
Deploy 3-pod Scylla cluster on GKE
Run some constant load
Change the number of CPUs in the scyllacluster spec
Wait for the restart of the second pod
See quorum-errors on the loader side.

Expected behavior
Operator should wait for the previous pod return back to a Scylla cluster before restarting next one.

Logs
K8S logs: https://cloudius-jenkins-test.s3.amazonaws.com/0512c157-c4d7-4d3f-9b44-2bcce9d34de9/20221027_191640/kubernetes-0512c157.tar.gz
DB logs: https://cloudius-jenkins-test.s3.amazonaws.com/0512c157-c4d7-4d3f-9b44-2bcce9d34de9/20221027_191640/db-cluster-0512c157.tar.gz

Environment:

Platform: GKE
Kubernetes version: 1.22
Scylla version: 5.0.5 , 5.1.0-rc3
Scylla-operator version: e.g.: v1.8.0-alpha.0-133-g97c831e | v1.7.4

The text was updated successfully, but these errors were encountered:

tnozicka · 2022-10-31T07:57:04Z

We wait for readiness, I wonder if this is the case where it becomes ready without having a quorum.

fruch · 2022-11-02T06:36:32Z

@vponomaryov do we a run of this with scylla-core latest master ?

vponomaryov · 2022-11-02T07:12:17Z

5.1.0-rc3

I didn't run it with master. Only with the specified versions in the bug report.

fruch · 2022-11-02T07:52:44Z

Can we run it at least once with master version ?

And then we should try having a dtest reproducer of it. our rolling restart process shouldn't be that brittle.

vponomaryov · 2022-11-02T09:52:08Z

Can we run it at least once with master version ?

And then we should try having a dtest reproducer of it. our rolling restart process shouldn't be that brittle.

Done here: https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/valerii/job/vp-longevity-scylla-operator-3h-gke/17/consoleFull

The error is expected because a Scylla node always becomes UN for itself first and only then to others.
So, what we really should try to keep is slowness like in the GKE platform.
For example, this bug is not reproduced using all-in-one K8S (kind).
So, dtest reproducer will need to have some tricks to cause slow status updates.

fruch · 2022-11-02T10:04:08Z

We can slow down traffic between two nodes

The 'test_nodetool_flush_and_reshard' test fails due to the following bug: scylladb/scylla-operator#1077 So, skip it.

vponomaryov · 2022-11-15T16:00:04Z

The probability to hit this bug is very high.
Our nodetool flush and reshard K8S test, which triggers rollout restart, constantly suffers from it.
Latest failure is here: https://jenkins.scylladb.com/job/scylla-operator/job/operator-master/job/functional/job/functional-k8s-local-kind-aws/25/consoleFull
The appropriate nemesis is skipped constantly already.

The 'test_nodetool_flush_and_reshard' test is affected by the following bug: scylladb/scylla-operator#1077 using non-fast K8S backends such as 'k8s-gke' and 'k8s-local-kind'. So, skip it on the slow backends.

mykaul · 2022-11-22T12:28:48Z

@tnozicka - I'm unsure if our logic should not be converted to 'ask other nodes what node X status is' instead of 'ask node X what its status is' ...

Operator's racy bug [1] gets reproduced on slow backends and nodes with small number of cores. So, skip it for the nodes that are not running on the EKS and have less than 14 cores. [1] scylladb/scylla-operator#1077

tnozicka · 2022-11-23T13:45:32Z

Yeah, currently readiness project ability to talk CONSISTENCY_ONE and we are talking about migrating it project CONSISTENCY_QUORUM.

@rzetelskik can you summarize the update of your discoveries we talked about today?

Operator's racy bug [1] gets reproduced on slow backends and nodes with small number of cores. So, skip it for the nodes that are not running on the EKS and have less than 14 cores. [1] scylladb/scylla-operator#1077

fruch · 2022-11-28T07:09:07Z

Yeah, currently readiness project ability to talk CONSISTENCY_ONE and we are talking about migrating it project CONSISTENCY_QUORUM.

@rzetelskik can you summarize the update of your discoveries we talked about today?

@rzetelskik @tnozicka can one of you update on this one, and shed some light on what's "project CONSISTENCY_QUORUM" in this context, cause I don't understand what does it mean...

rzetelskik · 2022-11-28T09:41:06Z

@rzetelskik can you summarize the update of your discoveries we talked about today?

Sure. In an effort to reproduce this I've added a simple coroutine to our E2E test suite that'd just send read queries periodically with quorum consistency and it caused the suite to fail quite consistently, so I haven't yet followed up on why this only kept happening on GKE for you guys, since the issue seems more general anyway.

Our readiness probe only checks whether the node considers itself UN. We've hit issues related to it and discussed it many times so that's nothing new. By the time a node declares readiness, the gossip should've settled, although nothing stops it from settling with a split brain, so it's quite easy to hit the following sequence:

Entire 3 node cluster up: {1: {1: UN, 2: UN, 3:UN}, 2: {1: UN, 2: UN, 3: UN}, 3: {1: UN, 2: UN, 3: UN}}
Restart node 3 and wait for it to come back up and declare readiness.
Node 3 is up but not a part of the cluster: {1: {1: UN, 2: UN, 3:DN}, 2: {1: UN, 2: UN, 3: DN}, 3: {1: DN, 2: DN, 3: UN}}
Restart node 2: {1: {1: UN, 2: DN, 3:DN}, 3: {1: DN, 2: DN, 3: UN}}

At this point you'll get errors from queries with quorum consistency regardless of which node you hit.

So we've discussed different approaches to how we could modify the readiness probe or the operator to get closer to what the documentation advises, i.e. all nodes UP before performing a topology change.
Essentially we want our readiness probes' semantics to no longer just be "I'm ready to serve traffic" but "I'm ready for a topology change". However we weren't able to just declare readiness when, and only when, a node considers all of its peers UN, since that caused some issues with rollouts.

The current idea suggested by @tnozicka is that each node should ask all of its peers about statuses of their peers and only declare readiness if all of them consider at least N-1 of their peers UN, where N is the total numbers of peers.

@rzetelskik @tnozicka can one of you update on this one, and shed some light on what's "project CONSISTENCY_QUORUM" in this context, cause I don't understand what does it mean...

I think what Tomas's message meant is that currently the readiness probe only declares that a given node is able to serve queries with consistency ONE (which is only true if that node replicates the data in question) and at that point we were discussing implementing a heuristic where a node would check whether a quorum of its peers is UN.

soyacz · 2023-01-31T07:05:53Z

Issue description

This issue is a regression.
It is unknown if this issue is a regression.

It happens not only when changing number of cores. The same situation happened on GKE typical rolling restart.
node-2:

INFO  2023-01-29 02:30:46,965 [shard 0] init - Scylla version 5.3.0~dev-0.20230127.5eadea301ecc initialization completed.
WARN  2023-01-29 02:31:34,956 [shard 0] cdc - Could not update CDC description table with generation (2023/01/29 01:40:35, 970df7c8-82ce-403b-b4a9-4804eaa2cee8): exceptions::unavailable_exception (Cannot achieve consistency level for cl QUORUM. Requires 2, alive 1). Will try again.

just before cdc errors node-1 restarted:

2023-01-29 02:30:57,175 INFO waiting for scylla to stop
INFO  2023-01-29 02:30:57,180 [shard 0] compaction_manager - Asked to stop

Impact

Cluster is not operable for CL=QUORUM when RF=3

How frequently does it reproduce?

During this run we had only one RollingRestart nemesis and it failed. Observed on GKE.

Installation details

Kernel Version: 5.15.0-1023-gke
Scylla version (or git commit hash): 5.3.0~dev-20230127.5eadea301ecc with build-id 057196c711d6c52ca6781558c576c858a3cf049b

Operator Image: scylladb/scylla-operator:latest
Operator Helm Version: v1.9.0-alpha.0-41-g04c20d3
Operator Helm Repository: https://storage.googleapis.com/scylla-operator-charts/latest
Cluster size: 4 nodes (n1-standard-8)

Scylla Nodes used in this run:
No resources left at the end of the run

OS / Image: N/A (k8s-gke: us-east1)

Test: longevity-scylla-operator-3h-gke
Test id: c3397295-e68c-4ce5-bc39-3bd16e034133
Test name: scylla-operator/operator-master/gke/longevity-scylla-operator-3h-gke
Test config file(s):

longevity-scylla-operator-3h.yaml

Logs and commands

Restore Monitor Stack command: $ hydra investigate show-monitor c3397295-e68c-4ce5-bc39-3bd16e034133
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs c3397295-e68c-4ce5-bc39-3bd16e034133

Logs:

db-cluster-c3397295.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/c3397295-e68c-4ce5-bc39-3bd16e034133/20230129_030101/db-cluster-c3397295.tar.gz
sct-runner-c3397295.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/c3397295-e68c-4ce5-bc39-3bd16e034133/20230129_030101/sct-runner-c3397295.tar.gz
monitor-set-c3397295.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/c3397295-e68c-4ce5-bc39-3bd16e034133/20230129_030101/monitor-set-c3397295.tar.gz
loader-set-c3397295.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/c3397295-e68c-4ce5-bc39-3bd16e034133/20230129_030101/loader-set-c3397295.tar.gz
kubernetes-c3397295.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/c3397295-e68c-4ce5-bc39-3bd16e034133/20230129_030101/kubernetes-c3397295.tar.gz
parallel-timelines-report-c3397295.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/c3397295-e68c-4ce5-bc39-3bd16e034133/20230129_030101/parallel-timelines-report-c3397295.tar.gz

Jenkins job URL

fruch · 2023-10-09T16:57:44Z

during testing of operator 1.11, I've hit in multiple cases both on GKE where we see it quite often, and on EKS longevity (in 12h run, and multi tenancy)

tnozicka · 2023-10-10T07:10:24Z

thanks, we are having some architectural discussions about what we could do for v1.12, so this is not forgotten but we have limited bandwidth with everything else going on

zimnx · 2023-10-18T10:19:58Z

I ran experiments on GKE with minReadySeconds on StatefulSets - in big shortcut it's a sleep between updating next node, on different setups.

Each setup consisted of 3 unbound loaders sending read requests continuously using gocql driver with default config. After cluster was fully up and running, test triggered a rolling restart. Traffic was stopped 5s after restart was completed.

setup	requests	failures	success ratio
ClusterIP TLS minReadySeconds=0	360008	278367	0.226
ClusterIP TLS minReadySeconds=60	161707	18085	0.888
ClusterIP TLS minReadySeconds=120	293230	12554	0.957
PodIP TLS minReadySeconds=0	80427	78	0.999
PodIP TLS minReadySeconds=60	118925	53	0.999
PodIP TLS minReadySeconds=120	244028	74	0.999

Discrepancy between ClusterIP and PodIP results might suggest that kube-proxy which provides ClusterIP overlay might cause most of the failures. To verify whether it's the case, I repeated the tests on GKE with Dataplane V2 (Cillium managed by Google) where kube-proxy is not present.

setup	requests	failures	success ratio
ClusterIP TLS minReadySeconds=0	54367	979	0.982
ClusterIP TLS minReadySeconds=60	113112	441	0.996
ClusterIP TLS minReadySeconds=120	201340	789	0.996
PodIP TLS minReadySeconds=0	61446	1701	0.972
PodIP TLS minReadySeconds=60	114669	159	0.999
PodIP TLS minReadySeconds=120	198945	638	0.997

Results shows Operator is not able to provide 100% success rate in any setup even when minReadySeconds is high. Although setting it to 60s would help a lot on default ClusterIP configuration while not influencing bootstrap time that much.

Simillar results were present when traffic was non-TLS.

rzetelskik · 2023-10-18T10:34:02Z

@zimnx good job with getting all these results. Were you able to find out the mode that kube-proxy is running in in the default scenario? Given the results I'd assume it's iptables but we need to check.

Also, coming back to what the original issue says:

It happens on GKE and doesn't happen on EKS because in the latter one case a restarted pod gets into a cluster much faster.

It was over a year ago, so it might have changed, but @vponomaryov do you modify the networking stack in EKS anyhow, or is it vanilla?

vponomaryov · 2023-10-18T10:38:54Z

It was over a year ago, so it might have changed, but @vponomaryov do you modify the networking stack in EKS anyhow, or is it vanilla?

We enable the vpc-cni addon.
We set WARM_ENI_TARGET=0, MINIMUM_IP_TARGET=8 and WARM_IP_TARGET=2 env vars to the aws-node daemonset.
We add security groups for each of the network interfaces of K8S nodes.

zimnx · 2023-10-26T20:07:26Z

After my last experiment, I aggregated the same types of errors.

Most of the errors were unavailability errors and I found out that in setups where number of total errors is low, almost all errors shows up when a node is tearing down.

I looked up Scylla documentation about how nodes should be teared down, and we have a mismatch. Changing PreStopHook to nodetool drain + supervisorctl stop scylla, caused that Scylla started printing lots of messages about operation being aborted on other nodes. I found related issue (scylladb/scylladb#10447), unfortunately fix is not easy to backport, so it wasn't backported to recent versions. Apparently master was fixed but when I tried I wasn't able to restart any node due to scylla being stuck. I tried with older versions, and found out that 5.0.13 doesn't print these abort operation messages, and it also solved failures happening during node teardown.
On PodIP setup I was left with only 2-3 EOF failures which either are scylla bug not gracefully shutting down connection, or gocql driver misbehaving. I left these unresolved to proceed further, we can tackle them later.

This unfortunately didn't solve traffic disruption on ClusterIP setups where errors were showing more than 1 node being down. This meant there's a split brain in gossip state. Looking at multiple nodetool status of all nodes confirmed it.

I initially thought that maybe kube-proxy lags and iptables are not updated fast enough, but I ruled it out as experiments showed Service ClusterIP mappings in iptables are updated right after Pod is recreated with new PodIP.

Scylla keeps 4 connections between each shard and node on 7000 port used for inter-node communication. One of them is used for gossip. These connections are lazily initialized so it might be that there're less than 4 connections if one was dropped and there wasn't anyone needing one.

If we look at stable 2 node ClusterIP cluster where each node have 1 shard, we can see existing connections and their mapping using conntrack.
Brief info about the cluster state:

pod-0: 10.85.0.28
pod-1: 10.85.0.31

svc-0: 10.106.100.33
svc-1: 10.99.34.233

tcp      6 431999 ESTABLISHED src=10.85.0.31 dst=10.106.100.33 sport=61507 dport=7000 src=10.85.0.28 dst=10.85.0.31 sport=7000 dport=61507 [ASSURED] mark=0 use=1
tcp      6 431999 ESTABLISHED src=10.85.0.31 dst=10.106.100.33 sport=58256 dport=7000 src=10.85.0.28 dst=10.85.0.31 sport=7000 dport=58256 [ASSURED] mark=0 use=1
tcp      6 431972 ESTABLISHED src=10.85.0.31 dst=10.106.100.33 sport=60959 dport=7000 src=10.85.0.28 dst=10.85.0.31 sport=7000 dport=60959 [ASSURED] mark=0 use=1
tcp      6 431972 ESTABLISHED src=10.85.0.28 dst=10.99.34.233 sport=54112 dport=7000 src=10.85.0.31 dst=10.85.0.28 sport=7000 dport=54112 [ASSURED] mark=0 use=1 
tcp      6 431999 ESTABLISHED src=10.85.0.28 dst=10.99.34.233 sport=49551 dport=7000 src=10.85.0.31 dst=10.85.0.28 sport=7000 dport=49551 [ASSURED] mark=0 use=1 
tcp      6 431999 ESTABLISHED src=10.85.0.28 dst=10.99.34.233 sport=55568 dport=7000 src=10.85.0.31 dst=10.85.0.28 sport=7000 dport=55568 [ASSURED] mark=0 use=1 
tcp      6 431943 ESTABLISHED src=10.85.0.28 dst=10.99.34.233 sport=49506 dport=7000 src=10.85.0.31 dst=10.85.0.28 sport=7000 dport=49506 [ASSURED] mark=0 use=1

There're 3 connections from pod-0, to svc-1 with correct NATed address of pod-1, and 4 connections from pod-1 to svc-0 with correct NATed address of pod-0.

When pod-1 is being deleted, conntrack shows multiple attempts where pod-0 is trying to reconnect to svc-1 but fails which is expected.
When pod-1 is recreated with new IP 10.85.0.32, we can see that there're 3 ongoing attempts to connect to svc-1 but with old pod-1 IP address in SYN_SENT state meaning they are awaiting ACK:

tcp      6 75 SYN_SENT src=10.85.0.28 dst=10.99.34.233 sport=53454 dport=7000 [UNREPLIED] src=10.85.0.31 dst=10.85.0.28 sport=7000 dport=53454 mark=0 use=1
tcp      6 74 SYN_SENT src=10.85.0.28 dst=10.99.34.233 sport=57715 dport=7000 [UNREPLIED] src=10.85.0.31 dst=10.85.0.28 sport=7000 dport=57715 mark=0 use=1
tcp      6 75 SYN_SENT src=10.85.0.28 dst=10.99.34.233 sport=64317 dport=7000 [UNREPLIED] src=10.85.0.31 dst=10.85.0.28 sport=7000 dport=64317 mark=0 use=1

Between when pod-1 was deleted, and new one recreated, pod-0 tried to reconnect but traffic was blackholed, meaning this SYN is lost, and this session needs to timeout.
At the same time, we can see that old sessions entered TIME_WAIT state which is normal, and that pod-1 managed to connect to svc-0:

tcp      6 65 TIME_WAIT src=10.85.0.28 dst=10.99.34.233 sport=54112 dport=7000 src=10.85.0.31 dst=10.85.0.28 sport=7000 dport=54112 [ASSURED] mark=0 use=2       
tcp      6 65 TIME_WAIT src=10.85.0.28 dst=10.99.34.233 sport=49551 dport=7000 src=10.85.0.31 dst=10.85.0.28 sport=7000 dport=49551 [ASSURED] mark=0 use=1       
tcp      6 65 TIME_WAIT src=10.85.0.28 dst=10.99.34.233 sport=55568 dport=7000 src=10.85.0.31 dst=10.85.0.28 sport=7000 dport=55568 [ASSURED] mark=0 use=1       
tcp      6 69 TIME_WAIT src=10.85.0.28 dst=10.99.34.233 sport=58760 dport=7000 src=10.85.0.31 dst=10.85.0.28 sport=7000 dport=58760 [ASSURED] mark=0 use=1       
tcp      6 65 TIME_WAIT src=10.85.0.28 dst=10.99.34.233 sport=49506 dport=7000 src=10.85.0.31 dst=10.85.0.28 sport=7000 dport=49506 [ASSURED] mark=0 use=1       
tcp      6 69 TIME_WAIT src=10.85.0.31 dst=10.106.100.33 sport=61507 dport=7000 src=10.85.0.28 dst=10.85.0.31 sport=7000 dport=61507 [ASSURED] mark=0 use=1     
tcp      6 69 TIME_WAIT src=10.85.0.31 dst=10.106.100.33 sport=58256 dport=7000 src=10.85.0.28 dst=10.85.0.31 sport=7000 dport=58256 [ASSURED] mark=0 use=1      
tcp      6 69 TIME_WAIT src=10.85.0.28 dst=10.99.34.233 sport=50906 dport=7000 src=10.85.0.31 dst=10.85.0.28 sport=7000 dport=50906 [ASSURED] mark=0 use=1      
tcp      6 69 TIME_WAIT src=10.85.0.31 dst=10.106.100.33 sport=60959 dport=7000 src=10.85.0.28 dst=10.85.0.31 sport=7000 dport=60959 [ASSURED] mark=0 use=1      
tcp      6 431999 ESTABLISHED src=10.85.0.32 dst=10.106.100.33 sport=62421 dport=7000 src=10.85.0.28 dst=10.85.0.32 sport=7000 dport=62421 [ASSURED] mark=0 use=1
tcp      6 431999 ESTABLISHED src=10.85.0.32 dst=10.106.100.33 sport=55614 dport=7000 src=10.85.0.28 dst=10.85.0.32 sport=7000 dport=55614 [ASSURED] mark=0 use=1
tcp      6 431998 ESTABLISHED src=10.85.0.32 dst=10.106.100.33 sport=60774 dport=7000 src=10.85.0.28 dst=10.85.0.32 sport=7000 dport=60774 [ASSURED] mark=0 use=1

After SYN_SENT sessions expired, there were no sessions from pod-0 to svc-1, only pod-1 to svc-0:

tcp      6 431999 ESTABLISHED src=10.85.0.32 dst=10.106.100.33 sport=62421 dport=7000 src=10.85.0.28 dst=10.85.0.32 sport=7000 dport=62421 [ASSURED] mark=0 use=1
tcp      6 431999 ESTABLISHED src=10.85.0.32 dst=10.106.100.33 sport=55614 dport=7000 src=10.85.0.28 dst=10.85.0.32 sport=7000 dport=55614 [ASSURED] mark=0 use=1
tcp      6 431999 ESTABLISHED src=10.85.0.32 dst=10.106.100.33 sport=60774 dport=7000 src=10.85.0.28 dst=10.85.0.32 sport=7000 dport=60774 [ASSURED] mark=0 use=1

Eventually pod-0 tried to reconnect and succeeded with correct pod-1 IP:

tcp      6 431999 ESTABLISHED src=10.85.0.28 dst=10.99.34.233 sport=54055 dport=7000 src=10.85.0.32 dst=10.85.0.28 sport=7000 dport=54055 [ASSURED] mark=0 use=1 
tcp      6 431991 ESTABLISHED src=10.85.0.28 dst=10.99.34.233 sport=55146 dport=7000 src=10.85.0.32 dst=10.85.0.28 sport=7000 dport=55146 [ASSURED] mark=0 use=1 
tcp      6 431999 ESTABLISHED src=10.85.0.28 dst=10.99.34.233 sport=52300 dport=7000 src=10.85.0.32 dst=10.85.0.28 sport=7000 dport=52300 [ASSURED] mark=0 use=1 
tcp      6 431999 ESTABLISHED src=10.85.0.32 dst=10.106.100.33 sport=55614 dport=7000 src=10.85.0.28 dst=10.85.0.32 sport=7000 dport=55614 [ASSURED] mark=0 use=1
tcp      6 431991 ESTABLISHED src=10.85.0.32 dst=10.106.100.33 sport=60774 dport=7000 src=10.85.0.28 dst=10.85.0.32 sport=7000 dport=60774 [ASSURED] mark=0 use=1
tcp      6 431999 ESTABLISHED src=10.85.0.32 dst=10.106.100.33 sport=62421 dport=7000 src=10.85.0.28 dst=10.85.0.32 sport=7000 dport=62421 [ASSURED] mark=0 use=1

Looks like SYN_SENT sessions are the root cause of our gossip brainsplit as nodes cannot establish session to node being restarted. They report UN, as they are able to talk to others, while others are stuck on connection. Immediately after these stuck session expire, new sessions are established and gossip state synchronizes.

On GKE, maximum SYN packets are retransmissed 6 times, meaning from first SYN to failure it can take 127s, and then Scylla needs to reconnect which can take another couple of seconds and then exchange gossip.

To solve it we have several options:

introduce new configuration value into Scylla which would control the timeout on rpc client connect - new feature meaning we will get it only in next version
remove UNREPLIED conntrack entries for ScyllaCluster Pods. When next SYN retransmission happens and there's no conntrack entry, connection attempt fails and Scylla reconnects. This could be a good workaround until timeout is introduced in Scylla, as we already have nodesetup privileged Pods pods running in host network on every Scylla node.
Set minReadySeconds to value high enough to extend maximum SYN retransmission. This has a big downside of cluster boostrap time being increased.
Come back to initial idea about collecting state of all (QUORUM?) nodes in readiness probe.
Maybe something else, suggestions are welcome.

rzetelskik · 2023-10-27T08:37:40Z

@zimnx thanks for diving into it. I wonder how many more times conntrack is going to bite us. 😛

On GKE, maximum SYN packets are retransmissed 6 times, meaning from first SYN to failure it can take 127s

Just to understand it, could you explain the calculation here?

remove UNREPLIED conntrack entries for ScyllaCluster Pods. When next SYN retransmission happens and there's no conntrack entry, connection attempt fails and Scylla reconnects. This could be a good workaround until timeout is introduced in Scylla, as we already have nodesetup privileged Pods pods running in host network on every Scylla node.

Remember that such workaround would require running with additional linux capabilities. Looking at the conntrack cleaner daemonset (9a63e01), which we used in our CI to resolve a similar issue in the past (#971), it would require NET_ADMIN.

I also wonder if that wouldn't affect stability of otherwise healthy clusters.

Set minReadySeconds to value high enough to extend maximum SYN retransmission. This has a big downside of cluster boostrap time being increased.

I recall you mentioned 180s not being a value high enough to completely rid us of errors, which is quite surprising now given the 127s above. Was that caused by a higher syn_sent conntrack timeout?

Maybe something else, suggestions are welcome.

It's worth explicitly stating that the issue comes from running kube-proxy depending on netfilter's conntrack. From my understanding at this point we're not sure if this is specific to running kube-proxy in iptables mode, or if that also occurs in ipvs mode. @zimnx have you tried it?

Anyway, an approach that gets rid of the issue without us implementing any workarounds is to advise our users to remove the netfilter dependency if they require exposing their nodes on ClusterIPs. Like you said above, you haven't hit the issue while running in GKE with Dataplane V2 (kube-proxy-less Cillium).

zimnx · 2023-10-27T09:03:46Z

On GKE, maximum SYN packets are retransmissed 6 times, meaning from first SYN to failure it can take 127s

Just to understand it, could you explain the calculation here?

https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt

tcp_syn_retries - INTEGER
Number of times initial SYNs for an active TCP connection attempt
will be retransmitted. Should not be higher than 127. Default value
is 6, which corresponds to 63seconds till the last retransmission
with the current initial RTO of 1second. With this the final timeout
for an active TCP connection attempt will happen after 127seconds.

Remember that such workaround would require running with additional linux capabilities

Our nodesetup pods are already running as root, so no extra permissions are required.

I also wonder if that wouldn't affect stability of otherwise healthy clusters.

Removing only SYN_SENT entries to clusterIPs and Scylla ports after configurable and reasonable timeout should only cause more reconnection attempts.

I recall you mentioned 180s not being a value high enough to completely rid us of errors, which is quite surprising now given the 127s above. Was that caused by a higher syn_sent conntrack timeout?

That's what I plan to look into next, maybe there's something else causing disruption.

It's worth explicitly stating that the issue comes from running kube-proxy depending on netfilter's conntrack. From my understanding at this point we're not sure if this is specific to running kube-proxy in iptables mode, or if that also occurs in ipvs mode. @zimnx have you tried it?

Nope

Like you said above, you haven't hit the issue while running in GKE with Dataplane V2 (kube-proxy-less Cillium).

I haven't tried with different node teardown logic. It's something I want to tryout later as well.

zimnx · 2023-10-27T11:14:47Z

remove UNREPLIED conntrack entries for ScyllaCluster Pods. When next SYN retransmission happens and there's no conntrack entry, connection attempt fails and Scylla reconnects. This could be a good workaround until timeout is introduced in Scylla, as we already have nodesetup privileged Pods pods running in host network on every Scylla node.

Another option would be to lower net.netfilter.nf_conntrack_tcp_timeout_syn_sent from default 120s to lower value from out sysctls-buddy container. I haven't tested, but with this conntrack entry translating ClusterIP into stale Pod IP would vanish earlier, and next TCP retransmission would fail triggering reconnection. Although it's a global setting.

~~Leaving as a workaround we might consider.~~

Doesn't work, this setting needs to be changed on node, not in container network namespace.

zimnx · 2023-10-30T11:11:49Z

I added setting of conntrack TCP timeout of SYN_SENT entries to our node setup DaemonSet, it solved availability issues on both Cluster and PodIP settings without minReadySeconds.

setup	requests	EOF/request timeout	availability failures	success ratio
ClusterIP minReadySeconds=0 Scylla=5.0.12 conntrack_timeout_syn_sent=20s	56621	4	0	0.999
PodIP minReadySeconds=0 Scylla=5.0.12 conntrack_timeout_syn_sent=20s	56411	4	0	0.999

We no longer see any availability errors meaning Scylla rolls out without loosing Quorum.
EOFs might be scylla or gocql bug, not related to rollout.
Timeouts may happen, as setups runs with low non-guaranteed resources.

Since we found configuration where we no longer observe any availability issues, I verified how different Scylla versions behave.

setup	requests	EOF/request timeout	availability failures	success ratio
ClusterIP minReadySeconds=0 Scylla=5.0.12 conntrack_timeout_syn_sent=20s	33926	3	0	0.999
PodIP minReadySeconds=0 Scylla=5.0.12 conntrack_timeout_syn_sent=20s	33750	4	0	0.999
PodIP minReadySeconds=0 Scylla=5.1.18 conntrack_timeout_syn_sent=20s	31458	1	19	0.999
ClusterIP minReadySeconds=0 Scylla=5.1.18 conntrack_timeout_syn_sent=20s	34919	0	15	0.999
ClusterIP minReadySeconds=0 Scylla=5.2.9 conntrack_timeout_syn_sent=20s	30176	2	16	0.999
PodIP minReadySeconds=0 Scylla=5.2.9 conntrack_timeout_syn_sent=20s	30512	3	13	0.999
ClusterIP minReadySeconds=0 Scylla=5.3.0-rc0 conntrack_timeout_syn_sent=20s	32213	1	11	0.999
PodIP minReadySeconds=0 Scylla=5.3.0-rc0 conntrack_timeout_syn_sent=20s	31847	2	20	0.999
PodIP minReadySeconds=0 Scylla=5.4.0-rc0 conntrack_timeout_syn_sent=20s	40221	1	5	0.999
ClusterIP minReadySeconds=0 Scylla=5.4.0-rc0 conntrack_timeout_syn_sent=20s	40892	2	7	0.999

Versions >=5.1 cause request failures during when node is shutting down.

Changing net.netfilter.nf_conntrack_tcp_timeout_syn_sent to 60s breaks ClusterIP scenarios:

setup	requests	EOF/request timeout	availability failures	success ratio
ClusterIP minReadySeconds=0 Scylla=5.0.12 conntrack_timeout_syn_sent=60s	35792	9	0	0.999
PodIP minReadySeconds=0 Scylla=5.0.12 conntrack_timeout_syn_sent=60s	35688	2	0	0.999
PodIP minReadySeconds=0 Scylla=5.1.18 conntrack_timeout_syn_sent=60s	36021	1	21	0.999
ClusterIP minReadySeconds=0 Scylla=5.1.18 conntrack_timeout_syn_sent=60s	37997	9	2620	0.93
ClusterIP minReadySeconds=0 Scylla=5.2.9 conntrack_timeout_syn_sent=60s	36494	10	3055	0.92
PodIP minReadySeconds=0 Scylla=5.2.9 conntrack_timeout_syn_sent=60s	33809	3	10	0.999
ClusterIP minReadySeconds=0 Scylla=5.3.0-rc0 conntrack_timeout_syn_sent=60s	38529	12	2816	0.93
PodIP minReadySeconds=0 Scylla=5.3.0-rc0 conntrack_timeout_syn_sent=60s	40817	2	9	0.999
PodIP minReadySeconds=0 Scylla=5.4.0-rc0 conntrack_timeout_syn_sent=60s	41197	1	3	0.999
ClusterIP minReadySeconds=0 Scylla=5.4.0-rc0 conntrack_timeout_syn_sent=60s	42154	12	3599	0.91

Looks like setting net.netfilter.nf_conntrack_tcp_timeout_syn_sent to 20s would fix ClusterIP setups as it would enforce shorter timeout on rpc_client connection attempts.
Adding configuration option to Scylla which would allow controlling this timeout would allow us to get rid of this workaround.
We also need to fix Scylla, as supported versions have a bug causing availability issues when node is shutting down.

rzetelskik · 2023-10-30T21:51:01Z

@zimnx could we make it an optional feature in node setup? The users would then have a choice between running on PodIPs without any issues, using this workaround, or running a kube-proxy-less cluster, which sounds like reasonably many choices.

zimnx · 2023-10-31T16:27:36Z

Issues in scylla regarding loss of availability during node shutdown on newer versions and rpc_client connection timeout
scylladb/scylladb#15899
scylladb/scylladb#15901

tnozicka · 2023-10-31T20:23:54Z

Great job drilling down into this!

I don't think we want to get into managing contrack settings for a subset of platforms in node setup DS. It seems fragile, potentially affecting other workloads and quite platform dependent. Scylla timeout seems like more fitting and stable place even if we have to wait a bit.

mykaul · 2023-11-02T12:42:41Z

Are we sure shutdown is not hung, which causes RPC connections not to close properly, from both sides?

mykaul · 2023-11-02T12:43:10Z

@xemul looked at multiple RPC issues recently, may be he has some idea.

zimnx · 2023-11-03T11:53:08Z

Are we sure shutdown is not hung, which causes RPC connections not to close properly, from both sides?

Rollout proceeds, process dies some time after we drain it and kill it without need for ungraceful shutdown.
Kostia suggested it might be related to too short shutdown-announce-in-ms, but when I set it to 10minutes it didn't help. Details and logs from the run are available under scylladb/scylladb#15899

zimnx · 2023-11-07T15:21:21Z

After couple of days I have doubts whether timeout on rpc_client connection would solve the issue. I think, it would only postpone fixing the real issue.

The reason why timeout would help us, is because it would give nodes another attempt to connect to restarted node. New attempt usually means new source port, session would have new ip/port tuple which requires iptables traversal. Effectively using updated ClusterIP mapping. And this is what fixes the issue we hit with ClusterIP clusters.
The problem is that it's not reliable.

If for some reason, new attempt wouldn't be successful or would take longer to connect, timeout on other nodes wouldn't prevent node being restarted to pass readiness check.
If Scylla would be able to boot within seconds, timeout set to value higher than boot time wouldn't prevent restarted node from passing readiness check.
New source port is not guaranteed. In case it wouldn't be, nodes would be stuck with the same stale conntrack entry preventing from using updated ClusterIP mapping.
Setting timeout to too low value, would cause troubles on environments with slow connection between nodes.

We can force connection timeout by namespaced net.ipv4.tcp_syn_retries which would mimic the proposed timeout only in Scylla Pod. It wouldn't help in all these cases by itself. Additionally we could set non-namespaced net.netfilter.nf_conntrack_tcp_timeout_syn_sent but it would only help with case 3. Although on setups where ClusterIP is NATed not using netfliter but something else, similar situation with stale entry in NAT map may happen. We won't be able to solve it once for all.

Looks like even with configurable timeout, we wouldn't be able to survive rollouts without availability loss, because UN information from single node is not enough.
We need to be sure every required connection is up both ways. Gossip is exchanged with one connection. Depending on the timing, some connections may be established, and some may be stuck. If one of the established ones was gossip connection, we could have UN state agreement across all nodes if just gossip exchange is enough to call it UN, I don't know, worth asking core.

Seems like one of the first idea about asking multiple nodes about status of ourselves would be a good way to solve these availability issues. Together with recommendation to set additional sysctls they should make readiness pass quick enough.
I wonder if we will be able to ask the status of a given node using raft.

zimnx · 2023-11-07T16:31:41Z

I wonder if we will be able to ask the status of a given node using raft.

We won't.
Status of a node consist of two states, node state (normal, bootstrapping, joining, leaving) and state of failure detector (up/down). Node state will be consistent across cluster with raft, but failure detector state will still be per node.

vponomaryov added the kind/bug Categorizes issue or PR as related to a bug. label Oct 28, 2022

tnozicka assigned zimnx Oct 31, 2022

tnozicka added this to Backlog in Release 1.9 via automation Oct 31, 2022

tnozicka added this to the v1.9 milestone Oct 31, 2022

tnozicka added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Oct 31, 2022

fruch mentioned this issue Nov 2, 2022

fix(k8s-gke): skip flush and reshard nemesis due to the bug scylladb/scylla-cluster-tests#5375

Merged

7 tasks

tnozicka assigned rzetelskik and unassigned zimnx Nov 14, 2022

vponomaryov added a commit to vponomaryov/scylla-cluster-tests that referenced this issue Nov 15, 2022

test(k8s): skip flush and reshard test due to operator's bug

1eabfb1

The 'test_nodetool_flush_and_reshard' test fails due to the following bug: scylladb/scylla-operator#1077 So, skip it.

vponomaryov mentioned this issue Nov 15, 2022

test(k8s): skip flush and reshard test due to operator's bug scylladb/scylla-cluster-tests#5423

Merged

7 tasks

tnozicka mentioned this issue Nov 23, 2022

Add e2e for vertical scaling #1083

Open

This was referenced Nov 28, 2022

Extend e2e tests with quorum monitoring #1101

Closed

Declare readiness only if the node and its peers consider at most one node not-UN #1108

Closed

zimnx mentioned this issue Jan 19, 2023

schema disagreement error attempting to insert data after the Scylla upgrade #1150

Open

2 tasks

tnozicka assigned tnozicka and zimnx Sep 6, 2023

zimnx mentioned this issue Oct 31, 2023

RFE: Configuration option to control rpc_client connection timeout scylladb/scylladb#15901

Open

tnozicka unassigned tnozicka and rzetelskik Nov 6, 2023

rzetelskik mentioned this issue Nov 16, 2023

Update of the spec.externalSeeds option for ScyllaCluster triggers redundant pod-rollout breaking QUORUM having 2 pods. #1570

Closed

zimnx linked a pull request Dec 8, 2023 that will close this issue

Forward ClusterIP traffic to inter-node communication port before CQL is ready #1614

Open

2 tasks

This was referenced Jan 12, 2024

Add endpoint to REST API to indicate whether this node is fully up scylladb/scylladb#16763

Open

C-S failure during rolling restart (during certs reload) #1447

Closed

[Multitenant] c-s failure during terminate k8s host (after kill scylla) #1693

Closed

tnozicka mentioned this issue Mar 18, 2024

Using LB for a scylla cluster and introducing single DB node disruptions we get failed: Connection refused (Connection refused) responses #1841

Open

zimnx mentioned this issue Apr 25, 2024

Traffic stability during rolling upgrades #1577

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

During rollout restart operator doesn't wait until previously restarted pod becomes part of a Scylla cluster #1077

During rollout restart operator doesn't wait until previously restarted pod becomes part of a Scylla cluster #1077

vponomaryov commented Oct 28, 2022

tnozicka commented Oct 31, 2022

fruch commented Nov 2, 2022

vponomaryov commented Nov 2, 2022

fruch commented Nov 2, 2022

vponomaryov commented Nov 2, 2022

fruch commented Nov 2, 2022

vponomaryov commented Nov 15, 2022 •

edited

Loading

mykaul commented Nov 22, 2022

tnozicka commented Nov 23, 2022 •

edited

Loading

fruch commented Nov 28, 2022

rzetelskik commented Nov 28, 2022 •

edited

Loading

soyacz commented Jan 31, 2023

Logs:

fruch commented Oct 9, 2023

tnozicka commented Oct 10, 2023

zimnx commented Oct 18, 2023

rzetelskik commented Oct 18, 2023 •

edited

Loading

vponomaryov commented Oct 18, 2023

zimnx commented Oct 26, 2023

rzetelskik commented Oct 27, 2023 •

edited

Loading

zimnx commented Oct 27, 2023

zimnx commented Oct 27, 2023 •

edited

Loading

zimnx commented Oct 30, 2023 •

edited

Loading

rzetelskik commented Oct 30, 2023 •

edited

Loading

zimnx commented Oct 31, 2023 •

edited

Loading

tnozicka commented Oct 31, 2023

mykaul commented Nov 2, 2023

mykaul commented Nov 2, 2023

zimnx commented Nov 3, 2023

zimnx commented Nov 7, 2023 •

edited

Loading

zimnx commented Nov 7, 2023

During rollout restart operator doesn't wait until previously restarted pod becomes part of a Scylla cluster #1077

During rollout restart operator doesn't wait until previously restarted pod becomes part of a Scylla cluster #1077

Comments

vponomaryov commented Oct 28, 2022

tnozicka commented Oct 31, 2022

fruch commented Nov 2, 2022

vponomaryov commented Nov 2, 2022

fruch commented Nov 2, 2022

vponomaryov commented Nov 2, 2022

fruch commented Nov 2, 2022

vponomaryov commented Nov 15, 2022 • edited Loading

mykaul commented Nov 22, 2022

tnozicka commented Nov 23, 2022 • edited Loading

fruch commented Nov 28, 2022

rzetelskik commented Nov 28, 2022 • edited Loading

soyacz commented Jan 31, 2023

Issue description

Impact

How frequently does it reproduce?

Installation details

Logs:

fruch commented Oct 9, 2023

tnozicka commented Oct 10, 2023

zimnx commented Oct 18, 2023

rzetelskik commented Oct 18, 2023 • edited Loading

vponomaryov commented Oct 18, 2023

zimnx commented Oct 26, 2023

rzetelskik commented Oct 27, 2023 • edited Loading

zimnx commented Oct 27, 2023

zimnx commented Oct 27, 2023 • edited Loading

zimnx commented Oct 30, 2023 • edited Loading

rzetelskik commented Oct 30, 2023 • edited Loading

zimnx commented Oct 31, 2023 • edited Loading

tnozicka commented Oct 31, 2023

mykaul commented Nov 2, 2023

mykaul commented Nov 2, 2023

zimnx commented Nov 3, 2023

zimnx commented Nov 7, 2023 • edited Loading

zimnx commented Nov 7, 2023

vponomaryov commented Nov 15, 2022 •

edited

Loading

tnozicka commented Nov 23, 2022 •

edited

Loading

rzetelskik commented Nov 28, 2022 •

edited

Loading

rzetelskik commented Oct 18, 2023 •

edited

Loading

rzetelskik commented Oct 27, 2023 •

edited

Loading

zimnx commented Oct 27, 2023 •

edited

Loading

zimnx commented Oct 30, 2023 •

edited

Loading

rzetelskik commented Oct 30, 2023 •

edited

Loading

zimnx commented Oct 31, 2023 •

edited

Loading

zimnx commented Nov 7, 2023 •

edited

Loading