cluster: fence host state and pool races by endpoint by dkropachev · Pull Request #836 · scylladb/python-driver

dkropachev · 2026-04-30T15:57:52Z

Summary

Add identity-keyed event fence state for host liveness and session pool creation so concurrent up/down handling, queued-up replay, reconnectors, and pool callbacks can be invalidated by epoch instead of mutable Host equality.
Replay queued UP events after active DOWN handling finishes, while preserving newer down events, removals, and endpoint swaps that should cancel stale replay.
Scope status events, reconnection success, connection failures, auth-failure quarantine, and pool cleanup to the endpoint observed when the work started, including SNI and client-routes endpoints whose identity is not just address/port.
Make session pool creation and removal race-safe: reuse in-flight pool creation, discard stale creations after endpoint changes, remove pools by host identity when endpoint hashes change, and avoid signaling down from stale keyspace/auth failures.
Carry expected endpoints through control-connection status changes, defunct connection returns, shard-aware replacement connections, and scheduler de-duplication so delayed work cannot affect a replacement endpoint.

Driver surface and risks

Touches internal cluster host-state handling, session pool lifecycle, control-connection status handling, scheduler uniqueness, and HostConnection reconnect/shard-aware paths.
No CQL protocol or public API change is intended. The main compatibility risk is changed timing around concurrent host up/down and endpoint migration handling; the branch adds focused unit coverage for those race cases.

Fixes #317

Test coverage

Added unit coverage in tests/unit/test_cluster.py for pool-creation races, stale up/down callbacks, queued-up replay, endpoint swaps, reconnector fencing, and non-retryable auth failure handling.
Added unit coverage in tests/unit/test_control_connection.py for preserving endpoint identity in delayed status events.
Added unit coverage in tests/unit/test_host_connection_pool.py for ignoring defunct stale connections after endpoint/client-route changes.

Validation

uv run pytest tests/unit/test_cluster.py -q
git diff --check

Pre-review checklist

I have split my patch into logically separate commits.
All commit messages clearly explain what they change and why.
I added relevant tests for new features and bug fixes.
All commits compile, pass static checks and pass test.
PR description sums up the changes and reasons why they should be introduced.
I have provided docstrings for the public items that I want to introduce.
I have adjusted the documentation in ./docs/source/.
I added appropriate Fixes: annotations to PR description.

Track the endpoint tied to host up/down handling, reconnection callbacks, and pool cleanup so stale work from a previous endpoint cannot mark or reconnect a replacement host. Preserve endpoint-specific identity for SNI and client-routes endpoints, scope non-retryable auth failures to the matching endpoint, and remove stale pools by host identity instead of endpoint equality. Add unit coverage for endpoint swaps, queued up/down races, stale reconnector success, and defunct connection handling after client-route port changes.

+    def _on_up(self, host, expected_epoch=None, expected_endpoint=None,
+               expected_reconnector=None):


github-code-quality Bot found potential problems Apr 30, 2026

View reviewed changes

Comment thread cassandra/cluster.py Fixed

dkropachev force-pushed the fix/replay-up-after-down-handling branch 3 times, most recently from 2094ebd to db683a0 Compare April 30, 2026 17:33

github-code-quality Bot found potential problems Apr 30, 2026

View reviewed changes

Comment thread cassandra/cluster.py Fixed

dkropachev force-pushed the fix/replay-up-after-down-handling branch from db683a0 to cff85ac Compare April 30, 2026 17:56

github-code-quality Bot found potential problems Apr 30, 2026

View reviewed changes

Comment thread cassandra/cluster.py Fixed

dkropachev force-pushed the fix/replay-up-after-down-handling branch from cff85ac to 368e7e6 Compare April 30, 2026 21:19

github-code-quality Bot found potential problems Apr 30, 2026

View reviewed changes

Comment thread cassandra/cluster.py Fixed

github-code-quality Bot found potential problems May 2, 2026

View reviewed changes

Comment thread cassandra/cluster.py Fixed

dkropachev added 3 commits May 2, 2026 09:14

cluster: introduce event fence state

ccf262c

cluster: replay up events after down handling

c3ad1ec

session: fence pool creation and stale endpoints

72c4f4c

dkropachev force-pushed the fix/replay-up-after-down-handling branch from f72c05d to 72c4f4c Compare May 2, 2026 13:36

github-code-quality Bot found potential problems May 4, 2026

View reviewed changes

Comment thread cassandra/cluster.py Fixed

dkropachev mentioned this pull request May 4, 2026

session: fix pool renewal race causing double statement execution #838

Open

dkropachev changed the title ~~cluster: replay queued up events~~ cluster: fence host state and pool races by endpoint May 4, 2026

dkropachev force-pushed the fix/replay-up-after-down-handling branch from f5a6b91 to f77de24 Compare May 4, 2026 15:32

github-code-quality Bot found potential problems May 4, 2026

View reviewed changes

Comment thread cassandra/cluster.py

Comment on lines +2290 to +2291

def _on_up(self, host, expected_epoch=None, expected_endpoint=None,

expected_reconnector=None):

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cluster: fence host state and pool races by endpoint#836

cluster: fence host state and pool races by endpoint#836
dkropachev wants to merge 4 commits intomasterfrom
fix/replay-up-after-down-handling

dkropachev commented Apr 30, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		def _on_up(self, host, expected_epoch=None, expected_endpoint=None,
		expected_reconnector=None):

Conversation

dkropachev commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Driver surface and risks

Test coverage

Validation

Pre-review checklist

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

dkropachev commented Apr 30, 2026 •

edited

Loading