Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Session might not reconnect if all nodes are restarted at once #230

Open
piodul opened this issue Jun 16, 2023 · 4 comments
Open

Session might not reconnect if all nodes are restarted at once #230

piodul opened this issue Jun 16, 2023 · 4 comments
Labels
bug Something isn't working

Comments

@piodul
Copy link

piodul commented Jun 16, 2023

Observed in the following test run: https://jenkins.scylladb.com/job/scylla-master/job/next/6139/artifact/testlog/x86_64/dev/topology.test_cluster_features.2.log
I'm also attaching the logs to the issue: jenkins.scylladb.com_job_scylla-master_job_next_6139_artifact_testlog_x86_64_dev_topology.test_cluster_features.2.log

Link to the source of the failing test: https://github.com/scylladb/scylladb/blob/3a73048bc9a15bebca78dc89143e4e332fb50645/test/topology/test_cluster_features.py#L150

The test test_downgrade_after_successful_upgrade_fails shuts down all nodes in the cluster, reconfigures them and then starts them again. In the linked run, the test session didn't reconnect after it happened:

14:31:46.962 DEBUG> refresh driver node list
14:31:46.963 DEBUG> [control connection] Refreshing node list and token map
14:31:46.963 DEBUG> [control connection] Error refreshing node list and token map
Traceback (most recent call last):
  File "cassandra/cluster.py", line 3790, in cassandra.cluster.ControlConnection.refresh_node_list_and_token_map
  File "cassandra/cluster.py", line 3816, in cassandra.cluster.ControlConnection._refresh_node_list_and_token_map
  File "cassandra/connection.py", line 1097, in cassandra.connection.Connection.wait_for_responses
cassandra.connection.ConnectionShutdown: Connection <LibevConnection(139961298484496) 127.76.173.27:9042 (closed)> is already closed
14:31:47.003 DEBUG> refresh driver node list
14:31:47.003 DEBUG> [control connection] Refreshing node list and token map
14:31:47.003 DEBUG> [control connection] Error refreshing node list and token map
Traceback (most recent call last):
  File "cassandra/cluster.py", line 3790, in cassandra.cluster.ControlConnection.refresh_node_list_and_token_map
  File "cassandra/cluster.py", line 3816, in cassandra.cluster.ControlConnection._refresh_node_list_and_token_map
  File "cassandra/connection.py", line 1097, in cassandra.connection.Connection.wait_for_responses
cassandra.connection.ConnectionShutdown: Connection <LibevConnection(139961298484496) 127.76.173.27:9042 (closed)> is already closed
14:31:49.004 DEBUG> Error querying host 127.76.173.27:9042
Traceback (most recent call last):
  File "cassandra/cluster.py", line 4581, in cassandra.cluster.ResponseFuture._query
  File "cassandra/connection.py", line 1066, in cassandra.connection.Connection.send_msg
cassandra.connection.ConnectionShutdown: Connection to 127.76.173.27:9042 is closed
14:31:49.004 DEBUG> Defunct or closed connection (139961296996112) returned to pool, potentially marking host 127.76.173.27:9042 as down
14:31:49.004 DEBUG> Shutting down connections to 127.76.173.27:9042
14:31:49.004 DEBUG> Closing connection (139961296996112) to 127.76.173.27:9042
14:31:51.005 DEBUG> Error querying host 127.76.173.21:9042
Traceback (most recent call last):
  File "cassandra/cluster.py", line 4581, in cassandra.cluster.ResponseFuture._query
  File "cassandra/connection.py", line 1066, in cassandra.connection.Connection.send_msg
cassandra.connection.ConnectionShutdown: Connection to 127.76.173.21:9042 is closed
14:31:51.005 DEBUG> Defunct or closed connection (139961297243024) returned to pool, potentially marking host 127.76.173.21:9042 as down
14:31:51.005 DEBUG> Shutting down connections to 127.76.173.21:9042
14:31:51.005 DEBUG> Closing connection (139961297243024) to 127.76.173.21:9042
14:31:51.005 DEBUG> Closing excess connection (139961297729488) to 127.76.173.21:9042
14:31:53.006 DEBUG> Error querying host 127.76.173.6:9042
Traceback (most recent call last):
  File "cassandra/cluster.py", line 4581, in cassandra.cluster.ResponseFuture._query
  File "cassandra/connection.py", line 1066, in cassandra.connection.Connection.send_msg
cassandra.connection.ConnectionShutdown: Connection to 127.76.173.6:9042 is closed
14:31:53.006 DEBUG> Defunct or closed connection (139961297252176) returned to pool, potentially marking host 127.76.173.6:9042 as down
14:31:53.006 DEBUG> Shutting down connections to 127.76.173.6:9042
14:31:53.006 DEBUG> Closing connection (139961297252176) to 127.76.173.6:9042
14:31:53.007 INFO> Driver not connected to 127.76.173.27:9042 yet
14:31:53.007 INFO> Driver not connected to 127.76.173.21:9042 yet
14:31:53.007 INFO> Driver not connected to 127.76.173.6:9042 yet

...

14:32:46.100 INFO> Driver not connected to 127.76.173.27:9042 yet
14:32:46.100 INFO> Driver not connected to 127.76.173.21:9042 yet
14:32:46.101 INFO> Driver not connected to 127.76.173.6:9042 yet
---------------------------- Captured log teardown -----------------------------
14:32:47.140 DEBUG> after_test for test_downgrade_after_successful_upgrade_fails (success: False)

The driver doesn't reconnect automatically within a minute. IPs of the restarted nodes are the same as before

@kbr-scylla
Copy link

I wonder if it's the same thing as #170 - which was supposedly fixed - or maybe it's some different kind of race which only happens when we restart all nodes?

piodul added a commit to piodul/scylla that referenced this issue Jun 16, 2023
The test `test_downgrade_after_successful_upgrade_fails` stops all
nodes, reconfigures them to support the test-only feature and restarts
them. Unfortunately, it looks like python driver sometimes does not
handle this properly and might not reconnect after all nodes are shut
down.

This commit adds a workaround for scylladb/python-driver#230 - the test
re-creates python driver session right after nodes are restarted.
kbr-scylla added a commit to scylladb/scylladb that referenced this issue Jun 16, 2023
…ver not reconnecting after full cluster restart' from Piotr Dulikowski

The test `test_downgrade_after_successful_upgrade_fails` shuts down the whole cluster, reconfigures the nodes and then restarts. Apparently, the python driver sometimes does not handle this correctly; in one test run we observed that the driver did not manage to reconnect to any of the nodes, even though the nodes managed to start successfully.

More context can be found on the python driver issue.

This PR works around this issue by using the existing `reconnect_driver` function (which is a workaround for a _different_ python driver issue already) to help the driver reconnect after the full cluster restart.

Refs: scylladb/python-driver#230

Closes #14276

* github.com:scylladb/scylladb:
  tests/topology: work around python driver issue in cluster feature tests
  test/topology{_raft_disabled}: move reconnect_driver to topology utils
@kostja
Copy link

kostja commented Nov 22, 2023

Please we need some traction on this, it affects the stability of the tests.

piodul added a commit to piodul/scylla that referenced this issue Feb 12, 2024
Unfortunately, scylladb/python-driver#230 is not fixed yet, so it is
necessary for the sake of our CI's stability to re-create the driver
session after all nodes in the cluster are restarted.

There is one place in test_topology_recovery_basic where all nodes are
restarted but the driver session is not re-created. Even though nodes
are not restarted at once but rather sequentially, we observed a failure
with similar symptoms in a CI run for scylla-enterprise.

Add the missing driver reconnect as a workaround for the issue.

Fixes: scylladb#17277
@mykaul
Copy link

mykaul commented Feb 12, 2024

@avelanarius - ping

@avelanarius
Copy link

@sylwiaszunejko is currently working on this issue: #295 which seemed more current (before issues reported by @piodul today).

I'll try to get to this soon.

kbr-scylla pushed a commit to scylladb/scylladb that referenced this issue Feb 13, 2024
Unfortunately, scylladb/python-driver#230 is not fixed yet, so it is
necessary for the sake of our CI's stability to re-create the driver
session after all nodes in the cluster are restarted.

There is one place in test_topology_recovery_basic where all nodes are
restarted but the driver session is not re-created. Even though nodes
are not restarted at once but rather sequentially, we observed a failure
with similar symptoms in a CI run for scylla-enterprise.

Add the missing driver reconnect as a workaround for the issue.

Fixes: #17277

Closes #17278
dgarcia360 pushed a commit to dgarcia360/scylla that referenced this issue Apr 30, 2024
Unfortunately, scylladb/python-driver#230 is not fixed yet, so it is
necessary for the sake of our CI's stability to re-create the driver
session after all nodes in the cluster are restarted.

There is one place in test_topology_recovery_basic where all nodes are
restarted but the driver session is not re-created. Even though nodes
are not restarted at once but rather sequentially, we observed a failure
with similar symptoms in a CI run for scylla-enterprise.

Add the missing driver reconnect as a workaround for the issue.

Fixes: scylladb#17277

Closes scylladb#17278
raphaelsc added a commit to raphaelsc/scylla that referenced this issue May 22, 2024
One source of flakiness is in test_tablet_metadata_propagates_with_schema_changes_in_snapshot_mode
due to gossiper being aborted prematurely, and causing reconnection
storm.

Another is test_tablet_missing_data_repair which is flaky due an issue
in python driver that session might not reconnect on rolling restart
(tracked by scylladb/python-driver#230)

Refs scylladb#15356.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
raphaelsc added a commit to raphaelsc/scylla that referenced this issue May 22, 2024
One source of flakiness is in test_tablet_metadata_propagates_with_schema_changes_in_snapshot_mode
due to gossiper being aborted prematurely, and causing reconnection
storm.

Another is test_tablet_missing_data_repair which is flaky due an issue
in python driver that session might not reconnect on rolling restart
(tracked by scylladb/python-driver#230)

Refs scylladb#15356.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
mergify bot pushed a commit to scylladb/scylladb that referenced this issue May 27, 2024
One source of flakiness is in test_tablet_metadata_propagates_with_schema_changes_in_snapshot_mode
due to gossiper being aborted prematurely, and causing reconnection
storm.

Another is test_tablet_missing_data_repair which is flaky due an issue
in python driver that session might not reconnect on rolling restart
(tracked by scylladb/python-driver#230)

Refs #15356.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit e724675)
@Lorak-mmk Lorak-mmk added the bug Something isn't working label Jun 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants