Skip to content

cluster: fence host state and pool races by endpoint#836

Draft
dkropachev wants to merge 4 commits intomasterfrom
fix/replay-up-after-down-handling
Draft

cluster: fence host state and pool races by endpoint#836
dkropachev wants to merge 4 commits intomasterfrom
fix/replay-up-after-down-handling

Conversation

@dkropachev
Copy link
Copy Markdown
Collaborator

@dkropachev dkropachev commented Apr 30, 2026

Summary

  • Add identity-keyed event fence state for host liveness and session pool creation so concurrent up/down handling, queued-up replay, reconnectors, and pool callbacks can be invalidated by epoch instead of mutable Host equality.
  • Replay queued UP events after active DOWN handling finishes, while preserving newer down events, removals, and endpoint swaps that should cancel stale replay.
  • Scope status events, reconnection success, connection failures, auth-failure quarantine, and pool cleanup to the endpoint observed when the work started, including SNI and client-routes endpoints whose identity is not just address/port.
  • Make session pool creation and removal race-safe: reuse in-flight pool creation, discard stale creations after endpoint changes, remove pools by host identity when endpoint hashes change, and avoid signaling down from stale keyspace/auth failures.
  • Carry expected endpoints through control-connection status changes, defunct connection returns, shard-aware replacement connections, and scheduler de-duplication so delayed work cannot affect a replacement endpoint.

Driver surface and risks

  • Touches internal cluster host-state handling, session pool lifecycle, control-connection status handling, scheduler uniqueness, and HostConnection reconnect/shard-aware paths.
  • No CQL protocol or public API change is intended. The main compatibility risk is changed timing around concurrent host up/down and endpoint migration handling; the branch adds focused unit coverage for those race cases.

Fixes #317

Test coverage

  • Added unit coverage in tests/unit/test_cluster.py for pool-creation races, stale up/down callbacks, queued-up replay, endpoint swaps, reconnector fencing, and non-retryable auth failure handling.
  • Added unit coverage in tests/unit/test_control_connection.py for preserving endpoint identity in delayed status events.
  • Added unit coverage in tests/unit/test_host_connection_pool.py for ignoring defunct stale connections after endpoint/client-route changes.

Validation

  • uv run pytest tests/unit/test_cluster.py -q
  • git diff --check

Pre-review checklist

  • I have split my patch into logically separate commits.
  • All commit messages clearly explain what they change and why.
  • I added relevant tests for new features and bug fixes.
  • All commits compile, pass static checks and pass test.
  • PR description sums up the changes and reasons why they should be introduced.
  • I have provided docstrings for the public items that I want to introduce.
  • I have adjusted the documentation in ./docs/source/.
  • I added appropriate Fixes: annotations to PR description.

Comment thread cassandra/cluster.py Fixed
@dkropachev dkropachev force-pushed the fix/replay-up-after-down-handling branch 3 times, most recently from 2094ebd to db683a0 Compare April 30, 2026 17:33
Comment thread cassandra/cluster.py Fixed
@dkropachev dkropachev force-pushed the fix/replay-up-after-down-handling branch from db683a0 to cff85ac Compare April 30, 2026 17:56
Comment thread cassandra/cluster.py Fixed
@dkropachev dkropachev force-pushed the fix/replay-up-after-down-handling branch from cff85ac to 368e7e6 Compare April 30, 2026 21:19
Comment thread cassandra/cluster.py Fixed
Comment thread cassandra/cluster.py Fixed
@dkropachev dkropachev force-pushed the fix/replay-up-after-down-handling branch from f72c05d to 72c4f4c Compare May 2, 2026 13:36
Comment thread cassandra/cluster.py Fixed
@dkropachev dkropachev changed the title cluster: replay queued up events cluster: fence host state and pool races by endpoint May 4, 2026
Track the endpoint tied to host up/down handling, reconnection callbacks, and pool cleanup so stale work from a previous endpoint cannot mark or reconnect a replacement host.

Preserve endpoint-specific identity for SNI and client-routes endpoints, scope non-retryable auth failures to the matching endpoint, and remove stale pools by host identity instead of endpoint equality.

Add unit coverage for endpoint swaps, queued up/down races, stale reconnector success, and defunct connection handling after client-route port changes.
@dkropachev dkropachev force-pushed the fix/replay-up-after-down-handling branch from f5a6b91 to f77de24 Compare May 4, 2026 15:32
Comment thread cassandra/cluster.py
Comment on lines +2290 to +2291
def _on_up(self, host, expected_epoch=None, expected_endpoint=None,
expected_reconnector=None):
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Connection pool renewal after concurrent node bootstraps causes double statement execution

1 participant