Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

During rollout restart operator doesn't wait until previously restarted pod becomes part of a Scylla cluster #1077

Open
Tracked by #1577
vponomaryov opened this issue Oct 28, 2022 · 37 comments · May be fixed by #1614
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects

Comments

@vponomaryov
Copy link
Contributor

Describe the bug
If we change the number of cores for Scylla (it will trigger rollout pod restart) then we start facing following racy condition:
First pod gets restarted and if it gets into a cluster with some unexpected delay then operator doesn't wait for it and restarts second pod.
In a cluster of 3 nodes it makes cluster be inoperational not having quorum for several minutes.
It happens on GKE and doesn't happen on EKS because in the latter one case a restarted pod gets into a cluster much faster.

Below are the logs from a run on GKE cluster.
pod-2 (the first restarted) logs:

2022-10-27 18:40:55,710 INFO waiting for scylla to stop
INFO  2022-10-27 18:40:55,710 [shard 0] init - Signal received; shutting down
...
INFO  2022-10-27 18:40:57,897 [shard 0] gossip - Gossip is now stopped
...
INFO  2022-10-27 18:40:59,866 [shard 0] init - Scylla version 5.1.0~rc3-0.20221009.9deeeb4db1cd shutdown complete.
2022-10-27 18:40:59,940 INFO stopped: scylla (exit status 0)
...
I1027 18:41:03.305900       1 operator/sidecar.go:438] "Scylla process finished"
rpc error: code = NotFound desc = an error occurred when try to find container "ef77bed95aa5838282bc5d55420a2718d6b111912fc2954dc645c35c7ce87d3f": not foundI1027 18:41:11.371885       1 operator/sidecar.go:147] sidecar version "v1.8.0-alpha.0-133-g97c831e"
...
I1027 18:41:12.363766       1 operator/sidecar.go:360] "Starting scylla"
...
INFO  2022-10-27 18:41:15,764 [shard 0] database - Resharded 7kB for system.compaction_history in 0.54 seconds, 13kB/s
...
INFO  2022-10-27 18:41:18,583 [shard 0] init - loading system_schema sstables
...
INFO  2022-10-27 18:41:19,311 [shard 0] init - setting up system keyspace
...
INFO  2022-10-27 18:41:21,558 [shard 0] database - Resharded 235MB for keyspace1.standard1 in 2.02 seconds, 116MB/s
...
INFO  2022-10-27 18:41:21,621 [shard 0] storage_service - entering STARTING mode
...
INFO  2022-10-27 18:41:21,700 [shard 0] storage_service - Starting up server gossip
...
E1027 18:41:24.931854       1 sidecar/probes.go:110] "readyz probe: can't get scylla native transport" err="giving up after 1 attempts: agent [HTTP 404] Not found" Service="scylla/sct-cluster-us-east1-b-us-east1-2" Node="10.80.11.2"
...
E1027 18:41:34.932431       1 sidecar/probes.go:110] "readyz probe: can't get scylla native transport" err="giving up after 1 attempts: agent [HTTP 404] Not found" Service="scylla/sct-cluster-us-east1-b-us-east1-2" Node="10.80.11.2"
...
INFO  2022-10-27 18:41:44,256 [shard 0] storage_service - entering JOINING mode
...
INFO  2022-10-27 18:41:44,828 [shard 0] storage_service - Node 10.80.11.2 state jump to normal
E1027 18:41:44.931320       1 sidecar/probes.go:110] "readyz probe: can't get scylla native transport" err="giving up after 1 attempts: agent [HTTP 404] Not found" Service="scylla/sct-cluster-us-east1-b-us-east1-2" Node="10.80.11.2"
INFO  2022-10-27 18:41:45,035 [shard 0] storage_service - entering NORMAL mode
...
INFO  2022-10-27 18:41:57,051 [shard 0] init - Scylla version 5.1.0~rc3-0.20221009.9deeeb4db1cd initialization completed.
WARN  2022-10-27 18:42:45,043 [shard 0] cdc - Could not update CDC description table with generation (2022/10/27 18:24:22, 69265668-5186-4a22-a2b1-6b6bad5a0f55): exceptions::unavailable_exception (Cannot achieve consistency level for cl QUORUM. Requires 2, alive 1). Will try again.
WARN  2022-10-27 18:43:14,796 [shard 0] gossip - Skip marking node 10.80.4.107 with status = shutdown as UP
INFO  2022-10-27 18:43:14,796 [shard 0] gossip - InetAddress 10.80.14.183 is now UP, status = NORMAL
INFO  2022-10-27 18:43:14,798 [shard 0] storage_service - Node 10.80.14.183 state jump to normal
INFO  2022-10-27 18:43:14,801 [shard 0] storage_service - Node 10.80.4.107 state jump to normal
INFO  2022-10-27 18:43:14,950 [shard 0] gossip - InetAddress 10.80.12.113 is now DOWN, status = LEFT
INFO  2022-10-27 18:43:14,952 [shard 0] gossip - Node 10.80.12.113 will be removed from gossip at [2022-10-30 18:31:53]: (expire = 1667154713831435293, now = 1666896194952348069, diff = 258518 seconds)
...
WARN  2022-10-27 18:43:36,752 [shard 0] gossip - Fail to send EchoMessage to 10.80.4.107: seastar::rpc::closed_error (connection is closed)
INFO  2022-10-27 18:43:36,892 [shard 0] gossip - InetAddress 10.80.4.107 is now UP, status = NORMAL

pod-1 logs (second restarted):

...
INFO  2022-10-27 18:32:55,041 [shard 0] gossip - 60000 ms elapsed, 10.80.12.113 gossip quarantine over
INFO  2022-10-27 18:40:55,897 [shard 0] gossip - Got shutdown message from 10.80.11.2, received_generation=1666893961, local_generation=1666893961
INFO  2022-10-27 18:40:55,898 [shard 0] gossip - InetAddress 10.80.11.2 is now DOWN, status = shutdown
2022-10-27 18:42:05,276 INFO waiting for scylla to stop
INFO  2022-10-27 18:42:05,291 [shard 0] compaction_manager - Asked to stop
...

So, from the logs above we can see that the time when pod-1 (second restarted) got shutdown message the first one
was not ready yet. The appeared quorum error from CDC is the proof for it.

To Reproduce
Steps to reproduce the behavior:

  1. Deploy operator
  2. Deploy 3-pod Scylla cluster on GKE
  3. Run some constant load
  4. Change the number of CPUs in the scyllacluster spec
  5. Wait for the restart of the second pod
  6. See quorum-errors on the loader side.

Expected behavior
Operator should wait for the previous pod return back to a Scylla cluster before restarting next one.

Logs
K8S logs: https://cloudius-jenkins-test.s3.amazonaws.com/0512c157-c4d7-4d3f-9b44-2bcce9d34de9/20221027_191640/kubernetes-0512c157.tar.gz
DB logs: https://cloudius-jenkins-test.s3.amazonaws.com/0512c157-c4d7-4d3f-9b44-2bcce9d34de9/20221027_191640/db-cluster-0512c157.tar.gz

Environment:

  • Platform: GKE
  • Kubernetes version: 1.22
  • Scylla version: 5.0.5 , 5.1.0-rc3
  • Scylla-operator version: e.g.: v1.8.0-alpha.0-133-g97c831e | v1.7.4
@vponomaryov vponomaryov added the kind/bug Categorizes issue or PR as related to a bug. label Oct 28, 2022
@tnozicka
Copy link
Member

We wait for readiness, I wonder if this is the case where it becomes ready without having a quorum.

@tnozicka tnozicka added this to Backlog in Release 1.9 via automation Oct 31, 2022
@tnozicka tnozicka added this to the v1.9 milestone Oct 31, 2022
@tnozicka tnozicka added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Oct 31, 2022
@fruch
Copy link

fruch commented Nov 2, 2022

@vponomaryov do we a run of this with scylla-core latest master ?

@vponomaryov
Copy link
Contributor Author

  • 5.1.0-rc3

I didn't run it with master. Only with the specified versions in the bug report.

@fruch
Copy link

fruch commented Nov 2, 2022

Can we run it at least once with master version ?

And then we should try having a dtest reproducer of it. our rolling restart process shouldn't be that brittle.

@vponomaryov
Copy link
Contributor Author

Can we run it at least once with master version ?

And then we should try having a dtest reproducer of it. our rolling restart process shouldn't be that brittle.

Done here: https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/valerii/job/vp-longevity-scylla-operator-3h-gke/17/consoleFull

The error is expected because a Scylla node always becomes UN for itself first and only then to others.
So, what we really should try to keep is slowness like in the GKE platform.
For example, this bug is not reproduced using all-in-one K8S (kind).
So, dtest reproducer will need to have some tricks to cause slow status updates.

@fruch
Copy link

fruch commented Nov 2, 2022

We can slow down traffic between two nodes

@tnozicka tnozicka assigned rzetelskik and unassigned zimnx Nov 14, 2022
vponomaryov added a commit to vponomaryov/scylla-cluster-tests that referenced this issue Nov 15, 2022
The 'test_nodetool_flush_and_reshard' test fails due to the following
bug: scylladb/scylla-operator#1077
So, skip it.
@vponomaryov
Copy link
Contributor Author

vponomaryov commented Nov 15, 2022

The probability to hit this bug is very high.
Our nodetool flush and reshard K8S test, which triggers rollout restart, constantly suffers from it.
Latest failure is here: https://jenkins.scylladb.com/job/scylla-operator/job/operator-master/job/functional/job/functional-k8s-local-kind-aws/25/consoleFull
The appropriate nemesis is skipped constantly already.

vponomaryov added a commit to vponomaryov/scylla-cluster-tests that referenced this issue Nov 16, 2022
The 'test_nodetool_flush_and_reshard' test is affected by the following
bug:

    scylladb/scylla-operator#1077

using non-fast K8S backends such as 'k8s-gke' and 'k8s-local-kind'.
So, skip it on the slow backends.
fruch pushed a commit to scylladb/scylla-cluster-tests that referenced this issue Nov 16, 2022
The 'test_nodetool_flush_and_reshard' test is affected by the following
bug:

    scylladb/scylla-operator#1077

using non-fast K8S backends such as 'k8s-gke' and 'k8s-local-kind'.
So, skip it on the slow backends.
@mykaul
Copy link
Contributor

mykaul commented Nov 22, 2022

@tnozicka - I'm unsure if our logic should not be converted to 'ask other nodes what node X status is' instead of 'ask node X what its status is' ...

vponomaryov added a commit to vponomaryov/scylla-cluster-tests that referenced this issue Nov 23, 2022
Operator's racy bug [1] gets reproduced on slow backends and
nodes with small number of cores.
So, skip it for the nodes that are not running on the EKS and have
less than 14 cores.

[1] scylladb/scylla-operator#1077
@tnozicka
Copy link
Member

tnozicka commented Nov 23, 2022

Yeah, currently readiness project ability to talk CONSISTENCY_ONE and we are talking about migrating it project CONSISTENCY_QUORUM.

@rzetelskik can you summarize the update of your discoveries we talked about today?

fruch pushed a commit to scylladb/scylla-cluster-tests that referenced this issue Nov 27, 2022
Operator's racy bug [1] gets reproduced on slow backends and
nodes with small number of cores.
So, skip it for the nodes that are not running on the EKS and have
less than 14 cores.

[1] scylladb/scylla-operator#1077
@fruch
Copy link

fruch commented Nov 28, 2022

Yeah, currently readiness project ability to talk CONSISTENCY_ONE and we are talking about migrating it project CONSISTENCY_QUORUM.

@rzetelskik can you summarize the update of your discoveries we talked about today?

@rzetelskik @tnozicka can one of you update on this one, and shed some light on what's "project CONSISTENCY_QUORUM" in this context, cause I don't understand what does it mean...

@rzetelskik
Copy link
Member

rzetelskik commented Nov 28, 2022

@rzetelskik can you summarize the update of your discoveries we talked about today?

Sure. In an effort to reproduce this I've added a simple coroutine to our E2E test suite that'd just send read queries periodically with quorum consistency and it caused the suite to fail quite consistently, so I haven't yet followed up on why this only kept happening on GKE for you guys, since the issue seems more general anyway.

Our readiness probe only checks whether the node considers itself UN. We've hit issues related to it and discussed it many times so that's nothing new. By the time a node declares readiness, the gossip should've settled, although nothing stops it from settling with a split brain, so it's quite easy to hit the following sequence:

  1. Entire 3 node cluster up: {1: {1: UN, 2: UN, 3:UN}, 2: {1: UN, 2: UN, 3: UN}, 3: {1: UN, 2: UN, 3: UN}}
  2. Restart node 3 and wait for it to come back up and declare readiness.
  3. Node 3 is up but not a part of the cluster: {1: {1: UN, 2: UN, 3:DN}, 2: {1: UN, 2: UN, 3: DN}, 3: {1: DN, 2: DN, 3: UN}}
  4. Restart node 2: {1: {1: UN, 2: DN, 3:DN}, 3: {1: DN, 2: DN, 3: UN}}

At this point you'll get errors from queries with quorum consistency regardless of which node you hit.

So we've discussed different approaches to how we could modify the readiness probe or the operator to get closer to what the documentation advises, i.e. all nodes UP before performing a topology change.
Essentially we want our readiness probes' semantics to no longer just be "I'm ready to serve traffic" but "I'm ready for a topology change". However we weren't able to just declare readiness when, and only when, a node considers all of its peers UN, since that caused some issues with rollouts.

The current idea suggested by @tnozicka is that each node should ask all of its peers about statuses of their peers and only declare readiness if all of them consider at least N-1 of their peers UN, where N is the total numbers of peers.

@rzetelskik @tnozicka can one of you update on this one, and shed some light on what's "project CONSISTENCY_QUORUM" in this context, cause I don't understand what does it mean...

I think what Tomas's message meant is that currently the readiness probe only declares that a given node is able to serve queries with consistency ONE (which is only true if that node replicates the data in question) and at that point we were discussing implementing a heuristic where a node would check whether a quorum of its peers is UN.

@soyacz
Copy link

soyacz commented Jan 31, 2023

Issue description

  • This issue is a regression.
  • It is unknown if this issue is a regression.

It happens not only when changing number of cores. The same situation happened on GKE typical rolling restart.
node-2:

INFO  2023-01-29 02:30:46,965 [shard 0] init - Scylla version 5.3.0~dev-0.20230127.5eadea301ecc initialization completed.
WARN  2023-01-29 02:31:34,956 [shard 0] cdc - Could not update CDC description table with generation (2023/01/29 01:40:35, 970df7c8-82ce-403b-b4a9-4804eaa2cee8): exceptions::unavailable_exception (Cannot achieve consistency level for cl QUORUM. Requires 2, alive 1). Will try again.

just before cdc errors node-1 restarted:

2023-01-29 02:30:57,175 INFO waiting for scylla to stop
INFO  2023-01-29 02:30:57,180 [shard 0] compaction_manager - Asked to stop

Impact

Cluster is not operable for CL=QUORUM when RF=3

How frequently does it reproduce?

During this run we had only one RollingRestart nemesis and it failed. Observed on GKE.

Installation details

Kernel Version: 5.15.0-1023-gke
Scylla version (or git commit hash): 5.3.0~dev-20230127.5eadea301ecc with build-id 057196c711d6c52ca6781558c576c858a3cf049b

Operator Image: scylladb/scylla-operator:latest
Operator Helm Version: v1.9.0-alpha.0-41-g04c20d3
Operator Helm Repository: https://storage.googleapis.com/scylla-operator-charts/latest
Cluster size: 4 nodes (n1-standard-8)

Scylla Nodes used in this run:
No resources left at the end of the run

OS / Image: N/A (k8s-gke: us-east1)

Test: longevity-scylla-operator-3h-gke
Test id: c3397295-e68c-4ce5-bc39-3bd16e034133
Test name: scylla-operator/operator-master/gke/longevity-scylla-operator-3h-gke
Test config file(s):

Logs and commands
  • Restore Monitor Stack command: $ hydra investigate show-monitor c3397295-e68c-4ce5-bc39-3bd16e034133
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs c3397295-e68c-4ce5-bc39-3bd16e034133

Logs:

Jenkins job URL

@fruch
Copy link

fruch commented Oct 9, 2023

during testing of operator 1.11, I've hit in multiple cases both on GKE where we see it quite often, and on EKS longevity (in 12h run, and multi tenancy)

@tnozicka
Copy link
Member

thanks, we are having some architectural discussions about what we could do for v1.12, so this is not forgotten but we have limited bandwidth with everything else going on

@zimnx
Copy link
Collaborator

zimnx commented Oct 18, 2023

I ran experiments on GKE with minReadySeconds on StatefulSets - in big shortcut it's a sleep between updating next node, on different setups.

Each setup consisted of 3 unbound loaders sending read requests continuously using gocql driver with default config. After cluster was fully up and running, test triggered a rolling restart. Traffic was stopped 5s after restart was completed.

setup requests failures success ratio
ClusterIP TLS minReadySeconds=0 360008 278367 0.226
ClusterIP TLS minReadySeconds=60 161707 18085 0.888
ClusterIP TLS minReadySeconds=120 293230 12554 0.957
PodIP TLS minReadySeconds=0 80427 78 0.999
PodIP TLS minReadySeconds=60 118925 53 0.999
PodIP TLS minReadySeconds=120 244028 74 0.999

Discrepancy between ClusterIP and PodIP results might suggest that kube-proxy which provides ClusterIP overlay might cause most of the failures. To verify whether it's the case, I repeated the tests on GKE with Dataplane V2 (Cillium managed by Google) where kube-proxy is not present.

setup requests failures success ratio
ClusterIP TLS minReadySeconds=0 54367 979 0.982
ClusterIP TLS minReadySeconds=60 113112 441 0.996
ClusterIP TLS minReadySeconds=120 201340 789 0.996
PodIP TLS minReadySeconds=0 61446 1701 0.972
PodIP TLS minReadySeconds=60 114669 159 0.999
PodIP TLS minReadySeconds=120 198945 638 0.997

Results shows Operator is not able to provide 100% success rate in any setup even when minReadySeconds is high. Although setting it to 60s would help a lot on default ClusterIP configuration while not influencing bootstrap time that much.

Simillar results were present when traffic was non-TLS.

@rzetelskik
Copy link
Member

rzetelskik commented Oct 18, 2023

@zimnx good job with getting all these results. Were you able to find out the mode that kube-proxy is running in in the default scenario? Given the results I'd assume it's iptables but we need to check.

Also, coming back to what the original issue says:

It happens on GKE and doesn't happen on EKS because in the latter one case a restarted pod gets into a cluster much faster.

It was over a year ago, so it might have changed, but @vponomaryov do you modify the networking stack in EKS anyhow, or is it vanilla?

@vponomaryov
Copy link
Contributor Author

It was over a year ago, so it might have changed, but @vponomaryov do you modify the networking stack in EKS anyhow, or is it vanilla?

  • We enable the vpc-cni addon.
  • We set WARM_ENI_TARGET=0, MINIMUM_IP_TARGET=8 and WARM_IP_TARGET=2 env vars to the aws-node daemonset.
  • We add security groups for each of the network interfaces of K8S nodes.

@zimnx
Copy link
Collaborator

zimnx commented Oct 26, 2023

After my last experiment, I aggregated the same types of errors.

Most of the errors were unavailability errors and I found out that in setups where number of total errors is low, almost all errors shows up when a node is tearing down.

I looked up Scylla documentation about how nodes should be teared down, and we have a mismatch. Changing PreStopHook to nodetool drain + supervisorctl stop scylla, caused that Scylla started printing lots of messages about operation being aborted on other nodes. I found related issue (scylladb/scylladb#10447), unfortunately fix is not easy to backport, so it wasn't backported to recent versions. Apparently master was fixed but when I tried I wasn't able to restart any node due to scylla being stuck. I tried with older versions, and found out that 5.0.13 doesn't print these abort operation messages, and it also solved failures happening during node teardown.
On PodIP setup I was left with only 2-3 EOF failures which either are scylla bug not gracefully shutting down connection, or gocql driver misbehaving. I left these unresolved to proceed further, we can tackle them later.

This unfortunately didn't solve traffic disruption on ClusterIP setups where errors were showing more than 1 node being down. This meant there's a split brain in gossip state. Looking at multiple nodetool status of all nodes confirmed it.

I initially thought that maybe kube-proxy lags and iptables are not updated fast enough, but I ruled it out as experiments showed Service ClusterIP mappings in iptables are updated right after Pod is recreated with new PodIP.

Scylla keeps 4 connections between each shard and node on 7000 port used for inter-node communication. One of them is used for gossip. These connections are lazily initialized so it might be that there're less than 4 connections if one was dropped and there wasn't anyone needing one.

If we look at stable 2 node ClusterIP cluster where each node have 1 shard, we can see existing connections and their mapping using conntrack.
Brief info about the cluster state:

pod-0: 10.85.0.28
pod-1: 10.85.0.31

svc-0: 10.106.100.33
svc-1: 10.99.34.233
tcp      6 431999 ESTABLISHED src=10.85.0.31 dst=10.106.100.33 sport=61507 dport=7000 src=10.85.0.28 dst=10.85.0.31 sport=7000 dport=61507 [ASSURED] mark=0 use=1
tcp      6 431999 ESTABLISHED src=10.85.0.31 dst=10.106.100.33 sport=58256 dport=7000 src=10.85.0.28 dst=10.85.0.31 sport=7000 dport=58256 [ASSURED] mark=0 use=1
tcp      6 431972 ESTABLISHED src=10.85.0.31 dst=10.106.100.33 sport=60959 dport=7000 src=10.85.0.28 dst=10.85.0.31 sport=7000 dport=60959 [ASSURED] mark=0 use=1
tcp      6 431972 ESTABLISHED src=10.85.0.28 dst=10.99.34.233 sport=54112 dport=7000 src=10.85.0.31 dst=10.85.0.28 sport=7000 dport=54112 [ASSURED] mark=0 use=1 
tcp      6 431999 ESTABLISHED src=10.85.0.28 dst=10.99.34.233 sport=49551 dport=7000 src=10.85.0.31 dst=10.85.0.28 sport=7000 dport=49551 [ASSURED] mark=0 use=1 
tcp      6 431999 ESTABLISHED src=10.85.0.28 dst=10.99.34.233 sport=55568 dport=7000 src=10.85.0.31 dst=10.85.0.28 sport=7000 dport=55568 [ASSURED] mark=0 use=1 
tcp      6 431943 ESTABLISHED src=10.85.0.28 dst=10.99.34.233 sport=49506 dport=7000 src=10.85.0.31 dst=10.85.0.28 sport=7000 dport=49506 [ASSURED] mark=0 use=1 

There're 3 connections from pod-0, to svc-1 with correct NATed address of pod-1, and 4 connections from pod-1 to svc-0 with correct NATed address of pod-0.

When pod-1 is being deleted, conntrack shows multiple attempts where pod-0 is trying to reconnect to svc-1 but fails which is expected.
When pod-1 is recreated with new IP 10.85.0.32, we can see that there're 3 ongoing attempts to connect to svc-1 but with old pod-1 IP address in SYN_SENT state meaning they are awaiting ACK:

tcp      6 75 SYN_SENT src=10.85.0.28 dst=10.99.34.233 sport=53454 dport=7000 [UNREPLIED] src=10.85.0.31 dst=10.85.0.28 sport=7000 dport=53454 mark=0 use=1
tcp      6 74 SYN_SENT src=10.85.0.28 dst=10.99.34.233 sport=57715 dport=7000 [UNREPLIED] src=10.85.0.31 dst=10.85.0.28 sport=7000 dport=57715 mark=0 use=1
tcp      6 75 SYN_SENT src=10.85.0.28 dst=10.99.34.233 sport=64317 dport=7000 [UNREPLIED] src=10.85.0.31 dst=10.85.0.28 sport=7000 dport=64317 mark=0 use=1

Between when pod-1 was deleted, and new one recreated, pod-0 tried to reconnect but traffic was blackholed, meaning this SYN is lost, and this session needs to timeout.
At the same time, we can see that old sessions entered TIME_WAIT state which is normal, and that pod-1 managed to connect to svc-0:

tcp      6 65 TIME_WAIT src=10.85.0.28 dst=10.99.34.233 sport=54112 dport=7000 src=10.85.0.31 dst=10.85.0.28 sport=7000 dport=54112 [ASSURED] mark=0 use=2       
tcp      6 65 TIME_WAIT src=10.85.0.28 dst=10.99.34.233 sport=49551 dport=7000 src=10.85.0.31 dst=10.85.0.28 sport=7000 dport=49551 [ASSURED] mark=0 use=1       
tcp      6 65 TIME_WAIT src=10.85.0.28 dst=10.99.34.233 sport=55568 dport=7000 src=10.85.0.31 dst=10.85.0.28 sport=7000 dport=55568 [ASSURED] mark=0 use=1       
tcp      6 69 TIME_WAIT src=10.85.0.28 dst=10.99.34.233 sport=58760 dport=7000 src=10.85.0.31 dst=10.85.0.28 sport=7000 dport=58760 [ASSURED] mark=0 use=1       
tcp      6 65 TIME_WAIT src=10.85.0.28 dst=10.99.34.233 sport=49506 dport=7000 src=10.85.0.31 dst=10.85.0.28 sport=7000 dport=49506 [ASSURED] mark=0 use=1       
tcp      6 69 TIME_WAIT src=10.85.0.31 dst=10.106.100.33 sport=61507 dport=7000 src=10.85.0.28 dst=10.85.0.31 sport=7000 dport=61507 [ASSURED] mark=0 use=1     
tcp      6 69 TIME_WAIT src=10.85.0.31 dst=10.106.100.33 sport=58256 dport=7000 src=10.85.0.28 dst=10.85.0.31 sport=7000 dport=58256 [ASSURED] mark=0 use=1      
tcp      6 69 TIME_WAIT src=10.85.0.28 dst=10.99.34.233 sport=50906 dport=7000 src=10.85.0.31 dst=10.85.0.28 sport=7000 dport=50906 [ASSURED] mark=0 use=1      
tcp      6 69 TIME_WAIT src=10.85.0.31 dst=10.106.100.33 sport=60959 dport=7000 src=10.85.0.28 dst=10.85.0.31 sport=7000 dport=60959 [ASSURED] mark=0 use=1      
tcp      6 431999 ESTABLISHED src=10.85.0.32 dst=10.106.100.33 sport=62421 dport=7000 src=10.85.0.28 dst=10.85.0.32 sport=7000 dport=62421 [ASSURED] mark=0 use=1
tcp      6 431999 ESTABLISHED src=10.85.0.32 dst=10.106.100.33 sport=55614 dport=7000 src=10.85.0.28 dst=10.85.0.32 sport=7000 dport=55614 [ASSURED] mark=0 use=1
tcp      6 431998 ESTABLISHED src=10.85.0.32 dst=10.106.100.33 sport=60774 dport=7000 src=10.85.0.28 dst=10.85.0.32 sport=7000 dport=60774 [ASSURED] mark=0 use=1

After SYN_SENT sessions expired, there were no sessions from pod-0 to svc-1, only pod-1 to svc-0:

tcp      6 431999 ESTABLISHED src=10.85.0.32 dst=10.106.100.33 sport=62421 dport=7000 src=10.85.0.28 dst=10.85.0.32 sport=7000 dport=62421 [ASSURED] mark=0 use=1
tcp      6 431999 ESTABLISHED src=10.85.0.32 dst=10.106.100.33 sport=55614 dport=7000 src=10.85.0.28 dst=10.85.0.32 sport=7000 dport=55614 [ASSURED] mark=0 use=1
tcp      6 431999 ESTABLISHED src=10.85.0.32 dst=10.106.100.33 sport=60774 dport=7000 src=10.85.0.28 dst=10.85.0.32 sport=7000 dport=60774 [ASSURED] mark=0 use=1

Eventually pod-0 tried to reconnect and succeeded with correct pod-1 IP:

tcp      6 431999 ESTABLISHED src=10.85.0.28 dst=10.99.34.233 sport=54055 dport=7000 src=10.85.0.32 dst=10.85.0.28 sport=7000 dport=54055 [ASSURED] mark=0 use=1 
tcp      6 431991 ESTABLISHED src=10.85.0.28 dst=10.99.34.233 sport=55146 dport=7000 src=10.85.0.32 dst=10.85.0.28 sport=7000 dport=55146 [ASSURED] mark=0 use=1 
tcp      6 431999 ESTABLISHED src=10.85.0.28 dst=10.99.34.233 sport=52300 dport=7000 src=10.85.0.32 dst=10.85.0.28 sport=7000 dport=52300 [ASSURED] mark=0 use=1 
tcp      6 431999 ESTABLISHED src=10.85.0.32 dst=10.106.100.33 sport=55614 dport=7000 src=10.85.0.28 dst=10.85.0.32 sport=7000 dport=55614 [ASSURED] mark=0 use=1
tcp      6 431991 ESTABLISHED src=10.85.0.32 dst=10.106.100.33 sport=60774 dport=7000 src=10.85.0.28 dst=10.85.0.32 sport=7000 dport=60774 [ASSURED] mark=0 use=1
tcp      6 431999 ESTABLISHED src=10.85.0.32 dst=10.106.100.33 sport=62421 dport=7000 src=10.85.0.28 dst=10.85.0.32 sport=7000 dport=62421 [ASSURED] mark=0 use=1

Looks like SYN_SENT sessions are the root cause of our gossip brainsplit as nodes cannot establish session to node being restarted. They report UN, as they are able to talk to others, while others are stuck on connection. Immediately after these stuck session expire, new sessions are established and gossip state synchronizes.

On GKE, maximum SYN packets are retransmissed 6 times, meaning from first SYN to failure it can take 127s, and then Scylla needs to reconnect which can take another couple of seconds and then exchange gossip.

To solve it we have several options:

  • introduce new configuration value into Scylla which would control the timeout on rpc client connect - new feature meaning we will get it only in next version
  • remove UNREPLIED conntrack entries for ScyllaCluster Pods. When next SYN retransmission happens and there's no conntrack entry, connection attempt fails and Scylla reconnects. This could be a good workaround until timeout is introduced in Scylla, as we already have nodesetup privileged Pods pods running in host network on every Scylla node.
  • Set minReadySeconds to value high enough to extend maximum SYN retransmission. This has a big downside of cluster boostrap time being increased.
  • Come back to initial idea about collecting state of all (QUORUM?) nodes in readiness probe.
  • Maybe something else, suggestions are welcome.

@rzetelskik
Copy link
Member

rzetelskik commented Oct 27, 2023

@zimnx thanks for diving into it. I wonder how many more times conntrack is going to bite us. 😛

On GKE, maximum SYN packets are retransmissed 6 times, meaning from first SYN to failure it can take 127s

Just to understand it, could you explain the calculation here?

  • remove UNREPLIED conntrack entries for ScyllaCluster Pods. When next SYN retransmission happens and there's no conntrack entry, connection attempt fails and Scylla reconnects. This could be a good workaround until timeout is introduced in Scylla, as we already have nodesetup privileged Pods pods running in host network on every Scylla node.

Remember that such workaround would require running with additional linux capabilities. Looking at the conntrack cleaner daemonset (9a63e01), which we used in our CI to resolve a similar issue in the past (#971), it would require NET_ADMIN.

I also wonder if that wouldn't affect stability of otherwise healthy clusters.

  • Set minReadySeconds to value high enough to extend maximum SYN retransmission. This has a big downside of cluster boostrap time being increased.

I recall you mentioned 180s not being a value high enough to completely rid us of errors, which is quite surprising now given the 127s above. Was that caused by a higher syn_sent conntrack timeout?

Maybe something else, suggestions are welcome.

It's worth explicitly stating that the issue comes from running kube-proxy depending on netfilter's conntrack. From my understanding at this point we're not sure if this is specific to running kube-proxy in iptables mode, or if that also occurs in ipvs mode. @zimnx have you tried it?

Anyway, an approach that gets rid of the issue without us implementing any workarounds is to advise our users to remove the netfilter dependency if they require exposing their nodes on ClusterIPs. Like you said above, you haven't hit the issue while running in GKE with Dataplane V2 (kube-proxy-less Cillium).

@zimnx
Copy link
Collaborator

zimnx commented Oct 27, 2023

On GKE, maximum SYN packets are retransmissed 6 times, meaning from first SYN to failure it can take 127s

Just to understand it, could you explain the calculation here?

https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt

tcp_syn_retries - INTEGER
Number of times initial SYNs for an active TCP connection attempt
will be retransmitted. Should not be higher than 127. Default value
is 6, which corresponds to 63seconds till the last retransmission
with the current initial RTO of 1second. With this the final timeout
for an active TCP connection attempt will happen after 127seconds.


Remember that such workaround would require running with additional linux capabilities

Our nodesetup pods are already running as root, so no extra permissions are required.

I also wonder if that wouldn't affect stability of otherwise healthy clusters.

Removing only SYN_SENT entries to clusterIPs and Scylla ports after configurable and reasonable timeout should only cause more reconnection attempts.

I recall you mentioned 180s not being a value high enough to completely rid us of errors, which is quite surprising now given the 127s above. Was that caused by a higher syn_sent conntrack timeout?

That's what I plan to look into next, maybe there's something else causing disruption.

It's worth explicitly stating that the issue comes from running kube-proxy depending on netfilter's conntrack. From my understanding at this point we're not sure if this is specific to running kube-proxy in iptables mode, or if that also occurs in ipvs mode. @zimnx have you tried it?

Nope

Like you said above, you haven't hit the issue while running in GKE with Dataplane V2 (kube-proxy-less Cillium).

I haven't tried with different node teardown logic. It's something I want to tryout later as well.

@zimnx
Copy link
Collaborator

zimnx commented Oct 27, 2023

remove UNREPLIED conntrack entries for ScyllaCluster Pods. When next SYN retransmission happens and there's no conntrack entry, connection attempt fails and Scylla reconnects. This could be a good workaround until timeout is introduced in Scylla, as we already have nodesetup privileged Pods pods running in host network on every Scylla node.

Another option would be to lower net.netfilter.nf_conntrack_tcp_timeout_syn_sent from default 120s to lower value from out sysctls-buddy container. I haven't tested, but with this conntrack entry translating ClusterIP into stale Pod IP would vanish earlier, and next TCP retransmission would fail triggering reconnection. Although it's a global setting.

Leaving as a workaround we might consider.

Doesn't work, this setting needs to be changed on node, not in container network namespace.

@zimnx
Copy link
Collaborator

zimnx commented Oct 30, 2023

I added setting of conntrack TCP timeout of SYN_SENT entries to our node setup DaemonSet, it solved availability issues on both Cluster and PodIP settings without minReadySeconds.

setup requests EOF/request timeout availability failures success ratio
ClusterIP minReadySeconds=0 Scylla=5.0.12 conntrack_timeout_syn_sent=20s 56621 4 0 0.999
PodIP minReadySeconds=0 Scylla=5.0.12 conntrack_timeout_syn_sent=20s 56411 4 0 0.999

We no longer see any availability errors meaning Scylla rolls out without loosing Quorum.
EOFs might be scylla or gocql bug, not related to rollout.
Timeouts may happen, as setups runs with low non-guaranteed resources.

Since we found configuration where we no longer observe any availability issues, I verified how different Scylla versions behave.

setup requests EOF/request timeout availability failures success ratio
ClusterIP minReadySeconds=0 Scylla=5.0.12 conntrack_timeout_syn_sent=20s 33926 3 0 0.999
PodIP minReadySeconds=0 Scylla=5.0.12 conntrack_timeout_syn_sent=20s 33750 4 0 0.999
PodIP minReadySeconds=0 Scylla=5.1.18 conntrack_timeout_syn_sent=20s 31458 1 19 0.999
ClusterIP minReadySeconds=0 Scylla=5.1.18 conntrack_timeout_syn_sent=20s 34919 0 15 0.999
ClusterIP minReadySeconds=0 Scylla=5.2.9 conntrack_timeout_syn_sent=20s 30176 2 16 0.999
PodIP minReadySeconds=0 Scylla=5.2.9 conntrack_timeout_syn_sent=20s 30512 3 13 0.999
ClusterIP minReadySeconds=0 Scylla=5.3.0-rc0 conntrack_timeout_syn_sent=20s 32213 1 11 0.999
PodIP minReadySeconds=0 Scylla=5.3.0-rc0 conntrack_timeout_syn_sent=20s 31847 2 20 0.999
PodIP minReadySeconds=0 Scylla=5.4.0-rc0 conntrack_timeout_syn_sent=20s 40221 1 5 0.999
ClusterIP minReadySeconds=0 Scylla=5.4.0-rc0 conntrack_timeout_syn_sent=20s 40892 2 7 0.999

Versions >=5.1 cause request failures during when node is shutting down.

Changing net.netfilter.nf_conntrack_tcp_timeout_syn_sent to 60s breaks ClusterIP scenarios:

setup requests EOF/request timeout availability failures success ratio
ClusterIP minReadySeconds=0 Scylla=5.0.12 conntrack_timeout_syn_sent=60s 35792 9 0 0.999
PodIP minReadySeconds=0 Scylla=5.0.12 conntrack_timeout_syn_sent=60s 35688 2 0 0.999
PodIP minReadySeconds=0 Scylla=5.1.18 conntrack_timeout_syn_sent=60s 36021 1 21 0.999
ClusterIP minReadySeconds=0 Scylla=5.1.18 conntrack_timeout_syn_sent=60s 37997 9 2620 0.93
ClusterIP minReadySeconds=0 Scylla=5.2.9 conntrack_timeout_syn_sent=60s 36494 10 3055 0.92
PodIP minReadySeconds=0 Scylla=5.2.9 conntrack_timeout_syn_sent=60s 33809 3 10 0.999
ClusterIP minReadySeconds=0 Scylla=5.3.0-rc0 conntrack_timeout_syn_sent=60s 38529 12 2816 0.93
PodIP minReadySeconds=0 Scylla=5.3.0-rc0 conntrack_timeout_syn_sent=60s 40817 2 9 0.999
PodIP minReadySeconds=0 Scylla=5.4.0-rc0 conntrack_timeout_syn_sent=60s 41197 1 3 0.999
ClusterIP minReadySeconds=0 Scylla=5.4.0-rc0 conntrack_timeout_syn_sent=60s 42154 12 3599 0.91

Looks like setting net.netfilter.nf_conntrack_tcp_timeout_syn_sent to 20s would fix ClusterIP setups as it would enforce shorter timeout on rpc_client connection attempts.
Adding configuration option to Scylla which would allow controlling this timeout would allow us to get rid of this workaround.
We also need to fix Scylla, as supported versions have a bug causing availability issues when node is shutting down.

@rzetelskik
Copy link
Member

rzetelskik commented Oct 30, 2023

@zimnx could we make it an optional feature in node setup? The users would then have a choice between running on PodIPs without any issues, using this workaround, or running a kube-proxy-less cluster, which sounds like reasonably many choices.

@zimnx
Copy link
Collaborator

zimnx commented Oct 31, 2023

Issues in scylla regarding loss of availability during node shutdown on newer versions and rpc_client connection timeout
scylladb/scylladb#15899
scylladb/scylladb#15901

@tnozicka
Copy link
Member

Great job drilling down into this!

I don't think we want to get into managing contrack settings for a subset of platforms in node setup DS. It seems fragile, potentially affecting other workloads and quite platform dependent. Scylla timeout seems like more fitting and stable place even if we have to wait a bit.

@mykaul
Copy link
Contributor

mykaul commented Nov 2, 2023

Are we sure shutdown is not hung, which causes RPC connections not to close properly, from both sides?

@mykaul
Copy link
Contributor

mykaul commented Nov 2, 2023

@xemul looked at multiple RPC issues recently, may be he has some idea.

@zimnx
Copy link
Collaborator

zimnx commented Nov 3, 2023

Are we sure shutdown is not hung, which causes RPC connections not to close properly, from both sides?

Rollout proceeds, process dies some time after we drain it and kill it without need for ungraceful shutdown.
Kostia suggested it might be related to too short shutdown-announce-in-ms, but when I set it to 10minutes it didn't help. Details and logs from the run are available under scylladb/scylladb#15899

@zimnx
Copy link
Collaborator

zimnx commented Nov 7, 2023

After couple of days I have doubts whether timeout on rpc_client connection would solve the issue. I think, it would only postpone fixing the real issue.

The reason why timeout would help us, is because it would give nodes another attempt to connect to restarted node. New attempt usually means new source port, session would have new ip/port tuple which requires iptables traversal. Effectively using updated ClusterIP mapping. And this is what fixes the issue we hit with ClusterIP clusters.
The problem is that it's not reliable.

  1. If for some reason, new attempt wouldn't be successful or would take longer to connect, timeout on other nodes wouldn't prevent node being restarted to pass readiness check.
  2. If Scylla would be able to boot within seconds, timeout set to value higher than boot time wouldn't prevent restarted node from passing readiness check.
  3. New source port is not guaranteed. In case it wouldn't be, nodes would be stuck with the same stale conntrack entry preventing from using updated ClusterIP mapping.
  4. Setting timeout to too low value, would cause troubles on environments with slow connection between nodes.

We can force connection timeout by namespaced net.ipv4.tcp_syn_retries which would mimic the proposed timeout only in Scylla Pod. It wouldn't help in all these cases by itself. Additionally we could set non-namespaced net.netfilter.nf_conntrack_tcp_timeout_syn_sent but it would only help with case 3. Although on setups where ClusterIP is NATed not using netfliter but something else, similar situation with stale entry in NAT map may happen. We won't be able to solve it once for all.

Looks like even with configurable timeout, we wouldn't be able to survive rollouts without availability loss, because UN information from single node is not enough.
We need to be sure every required connection is up both ways. Gossip is exchanged with one connection. Depending on the timing, some connections may be established, and some may be stuck. If one of the established ones was gossip connection, we could have UN state agreement across all nodes if just gossip exchange is enough to call it UN, I don't know, worth asking core.

Seems like one of the first idea about asking multiple nodes about status of ourselves would be a good way to solve these availability issues. Together with recommendation to set additional sysctls they should make readiness pass quick enough.
I wonder if we will be able to ask the status of a given node using raft.

@zimnx
Copy link
Collaborator

zimnx commented Nov 7, 2023

I wonder if we will be able to ask the status of a given node using raft.

We won't.
Status of a node consist of two states, node state (normal, bootstrapping, joining, leaving) and state of failure detector (up/down). Node state will be consistent across cluster with raft, but failure detector state will still be per node.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
No open projects
7 participants