Session might not reconnect if all nodes are restarted at once #230

piodul · 2023-06-16T12:25:16Z

Observed in the following test run: https://jenkins.scylladb.com/job/scylla-master/job/next/6139/artifact/testlog/x86_64/dev/topology.test_cluster_features.2.log
I'm also attaching the logs to the issue: jenkins.scylladb.com_job_scylla-master_job_next_6139_artifact_testlog_x86_64_dev_topology.test_cluster_features.2.log

Link to the source of the failing test: https://github.com/scylladb/scylladb/blob/3a73048bc9a15bebca78dc89143e4e332fb50645/test/topology/test_cluster_features.py#L150

The test test_downgrade_after_successful_upgrade_fails shuts down all nodes in the cluster, reconfigures them and then starts them again. In the linked run, the test session didn't reconnect after it happened:

14:31:46.962 DEBUG> refresh driver node list
14:31:46.963 DEBUG> [control connection] Refreshing node list and token map
14:31:46.963 DEBUG> [control connection] Error refreshing node list and token map
Traceback (most recent call last):
  File "cassandra/cluster.py", line 3790, in cassandra.cluster.ControlConnection.refresh_node_list_and_token_map
  File "cassandra/cluster.py", line 3816, in cassandra.cluster.ControlConnection._refresh_node_list_and_token_map
  File "cassandra/connection.py", line 1097, in cassandra.connection.Connection.wait_for_responses
cassandra.connection.ConnectionShutdown: Connection <LibevConnection(139961298484496) 127.76.173.27:9042 (closed)> is already closed
14:31:47.003 DEBUG> refresh driver node list
14:31:47.003 DEBUG> [control connection] Refreshing node list and token map
14:31:47.003 DEBUG> [control connection] Error refreshing node list and token map
Traceback (most recent call last):
  File "cassandra/cluster.py", line 3790, in cassandra.cluster.ControlConnection.refresh_node_list_and_token_map
  File "cassandra/cluster.py", line 3816, in cassandra.cluster.ControlConnection._refresh_node_list_and_token_map
  File "cassandra/connection.py", line 1097, in cassandra.connection.Connection.wait_for_responses
cassandra.connection.ConnectionShutdown: Connection <LibevConnection(139961298484496) 127.76.173.27:9042 (closed)> is already closed
14:31:49.004 DEBUG> Error querying host 127.76.173.27:9042
Traceback (most recent call last):
  File "cassandra/cluster.py", line 4581, in cassandra.cluster.ResponseFuture._query
  File "cassandra/connection.py", line 1066, in cassandra.connection.Connection.send_msg
cassandra.connection.ConnectionShutdown: Connection to 127.76.173.27:9042 is closed
14:31:49.004 DEBUG> Defunct or closed connection (139961296996112) returned to pool, potentially marking host 127.76.173.27:9042 as down
14:31:49.004 DEBUG> Shutting down connections to 127.76.173.27:9042
14:31:49.004 DEBUG> Closing connection (139961296996112) to 127.76.173.27:9042
14:31:51.005 DEBUG> Error querying host 127.76.173.21:9042
Traceback (most recent call last):
  File "cassandra/cluster.py", line 4581, in cassandra.cluster.ResponseFuture._query
  File "cassandra/connection.py", line 1066, in cassandra.connection.Connection.send_msg
cassandra.connection.ConnectionShutdown: Connection to 127.76.173.21:9042 is closed
14:31:51.005 DEBUG> Defunct or closed connection (139961297243024) returned to pool, potentially marking host 127.76.173.21:9042 as down
14:31:51.005 DEBUG> Shutting down connections to 127.76.173.21:9042
14:31:51.005 DEBUG> Closing connection (139961297243024) to 127.76.173.21:9042
14:31:51.005 DEBUG> Closing excess connection (139961297729488) to 127.76.173.21:9042
14:31:53.006 DEBUG> Error querying host 127.76.173.6:9042
Traceback (most recent call last):
  File "cassandra/cluster.py", line 4581, in cassandra.cluster.ResponseFuture._query
  File "cassandra/connection.py", line 1066, in cassandra.connection.Connection.send_msg
cassandra.connection.ConnectionShutdown: Connection to 127.76.173.6:9042 is closed
14:31:53.006 DEBUG> Defunct or closed connection (139961297252176) returned to pool, potentially marking host 127.76.173.6:9042 as down
14:31:53.006 DEBUG> Shutting down connections to 127.76.173.6:9042
14:31:53.006 DEBUG> Closing connection (139961297252176) to 127.76.173.6:9042
14:31:53.007 INFO> Driver not connected to 127.76.173.27:9042 yet
14:31:53.007 INFO> Driver not connected to 127.76.173.21:9042 yet
14:31:53.007 INFO> Driver not connected to 127.76.173.6:9042 yet

...

14:32:46.100 INFO> Driver not connected to 127.76.173.27:9042 yet
14:32:46.100 INFO> Driver not connected to 127.76.173.21:9042 yet
14:32:46.101 INFO> Driver not connected to 127.76.173.6:9042 yet
---------------------------- Captured log teardown -----------------------------
14:32:47.140 DEBUG> after_test for test_downgrade_after_successful_upgrade_fails (success: False)

The driver doesn't reconnect automatically within a minute. IPs of the restarted nodes are the same as before

The text was updated successfully, but these errors were encountered:

kbr-scylla · 2023-06-16T12:27:24Z

I wonder if it's the same thing as #170 - which was supposedly fixed - or maybe it's some different kind of race which only happens when we restart all nodes?

The test `test_downgrade_after_successful_upgrade_fails` stops all nodes, reconfigures them to support the test-only feature and restarts them. Unfortunately, it looks like python driver sometimes does not handle this properly and might not reconnect after all nodes are shut down. This commit adds a workaround for scylladb/python-driver#230 - the test re-creates python driver session right after nodes are restarted.

…ver not reconnecting after full cluster restart' from Piotr Dulikowski The test `test_downgrade_after_successful_upgrade_fails` shuts down the whole cluster, reconfigures the nodes and then restarts. Apparently, the python driver sometimes does not handle this correctly; in one test run we observed that the driver did not manage to reconnect to any of the nodes, even though the nodes managed to start successfully. More context can be found on the python driver issue. This PR works around this issue by using the existing `reconnect_driver` function (which is a workaround for a _different_ python driver issue already) to help the driver reconnect after the full cluster restart. Refs: scylladb/python-driver#230 Closes #14276 * github.com:scylladb/scylladb: tests/topology: work around python driver issue in cluster feature tests test/topology{_raft_disabled}: move reconnect_driver to topology utils

kostja · 2023-11-22T18:59:30Z

Please we need some traction on this, it affects the stability of the tests.

Unfortunately, scylladb/python-driver#230 is not fixed yet, so it is necessary for the sake of our CI's stability to re-create the driver session after all nodes in the cluster are restarted. There is one place in test_topology_recovery_basic where all nodes are restarted but the driver session is not re-created. Even though nodes are not restarted at once but rather sequentially, we observed a failure with similar symptoms in a CI run for scylla-enterprise. Add the missing driver reconnect as a workaround for the issue. Fixes: scylladb#17277

mykaul · 2024-02-12T09:26:27Z

@avelanarius - ping

avelanarius · 2024-02-12T11:06:44Z

@sylwiaszunejko is currently working on this issue: #295 which seemed more current (before issues reported by @piodul today).

I'll try to get to this soon.

Unfortunately, scylladb/python-driver#230 is not fixed yet, so it is necessary for the sake of our CI's stability to re-create the driver session after all nodes in the cluster are restarted. There is one place in test_topology_recovery_basic where all nodes are restarted but the driver session is not re-created. Even though nodes are not restarted at once but rather sequentially, we observed a failure with similar symptoms in a CI run for scylla-enterprise. Add the missing driver reconnect as a workaround for the issue. Fixes: #17277 Closes #17278

Unfortunately, scylladb/python-driver#230 is not fixed yet, so it is necessary for the sake of our CI's stability to re-create the driver session after all nodes in the cluster are restarted. There is one place in test_topology_recovery_basic where all nodes are restarted but the driver session is not re-created. Even though nodes are not restarted at once but rather sequentially, we observed a failure with similar symptoms in a CI run for scylla-enterprise. Add the missing driver reconnect as a workaround for the issue. Fixes: scylladb#17277 Closes scylladb#17278

One source of flakiness is in test_tablet_metadata_propagates_with_schema_changes_in_snapshot_mode due to gossiper being aborted prematurely, and causing reconnection storm. Another is test_tablet_missing_data_repair which is flaky due an issue in python driver that session might not reconnect on rolling restart (tracked by scylladb/python-driver#230) Refs scylladb#15356. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

One source of flakiness is in test_tablet_metadata_propagates_with_schema_changes_in_snapshot_mode due to gossiper being aborted prematurely, and causing reconnection storm. Another is test_tablet_missing_data_repair which is flaky due an issue in python driver that session might not reconnect on rolling restart (tracked by scylladb/python-driver#230) Refs #15356. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> (cherry picked from commit e724675)

kbr-scylla assigned avelanarius Jun 16, 2023

kbr-scylla mentioned this issue Jun 16, 2023

raft topology: wait_for_peers_to_enter_synchronize_state doesn't need to resolve all IPs scylladb/scylladb#14046

Merged

piodul mentioned this issue Jun 16, 2023

test/topology/test_cluster_features: workaround for python driver not reconnecting after full cluster restart scylladb/scylladb#14276

Merged

piodul mentioned this issue Feb 12, 2024

test: topology: missing driver reconnect in test_topology_recovery_basic might cause the test to get stuck scylladb/scylladb#17277

Closed

piodul mentioned this issue Feb 12, 2024

test: test_topology_recovery_basic: add missing driver reconnect scylladb/scylladb#17278

Closed

kbr-scylla mentioned this issue Feb 15, 2024

After a node restarts, driver reconnects multiple times, causing queries to fail #295

Open

kbr-scylla mentioned this issue Feb 27, 2024

topology/test_change_ip.py times out after server restarts scylladb/scylladb#17444

Closed

This was referenced May 27, 2024

topology_experimental_raft/test_tablets is flaky scylladb/scylladb#18896

Open

storage_service: Fix race between tablet split and stats retrieval scylladb/scylladb#18287

Merged

roydahan unassigned avelanarius Jun 13, 2024

Lorak-mmk added the bug Something isn't working label Jun 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Session might not reconnect if all nodes are restarted at once #230

Session might not reconnect if all nodes are restarted at once #230

piodul commented Jun 16, 2023

kbr-scylla commented Jun 16, 2023

kostja commented Nov 22, 2023

mykaul commented Feb 12, 2024

avelanarius commented Feb 12, 2024

Session might not reconnect if all nodes are restarted at once #230

Session might not reconnect if all nodes are restarted at once #230

Comments

piodul commented Jun 16, 2023

kbr-scylla commented Jun 16, 2023

kostja commented Nov 22, 2023

mykaul commented Feb 12, 2024

avelanarius commented Feb 12, 2024