session: fix pool renewal race causing double statement execution#838
session: fix pool renewal race causing double statement execution#838nyh wants to merge 1 commit intoscylladb:masterfrom
Conversation
|
This is another fix for #317. This one was written entirely by AI (vscode + Claude Sonnet 4.6). Unlike #835 which uses "optimistic locking" - it can create two pools but then closes the unneeded one - this implementation tries to avoid creating a new pool if one is in the process of being created, so the implementation is a bit longer and it's not clear if the performance advantage is worth it. |
There was a problem hiding this comment.
Pull request overview
Fixes a race during concurrent host additions where Session.update_created_pools() could schedule duplicate pool creations for the same host, leading to an in-flight request being orphaned and retried on a new pool (double execution / “already exists” errors).
Changes:
- Track in-flight per-host pool-creation tasks via
Session._pending_pool_futuresand reuse them to avoid duplicate scheduling. - Make pool replacement/discard decisions atomic by reading
previous = self._pools.get(host)inside the session lock and discarding newly-created pools when a live pool already exists. - Add unit tests covering reuse of in-flight futures and pending-future invalidation on
remove_pool.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
cassandra/cluster.py |
Adds _pending_pool_futures and uses it to dedupe concurrent pool creation / renewal work; adjusts locking to prevent replacing a live pool. |
tests/unit/test_cluster.py |
Adds regression tests for the pool-renewal race and for _pending_pool_futures behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@dkropachev one of the test runs, "libev 3.11", hung. I don't know what to make of it, or how to restart it. Anyway, @avikivity @mykaul this is (issue #317) is one of the issues that hit MANY MANY Scylla "cluster tests" ever since the Scylla "cluster test" framework got a function that spawns several serers concurrently. This is perhaps the driver issues that caused most flaky tests ever. Let's get it fixed for good (there's this patch, and also #835, as candidate fixes). |
Thanks for fixing it! |
dkropachev
left a comment
There was a problem hiding this comment.
I think this still has a few race holes around coalescing by host only:
cassandra/cluster.py:3321
_pending_pool_futures is keyed only by host, so a later add_or_renew_pool(host, is_host_addition=True) can reuse a future created by update_created_pools() with is_host_addition=False (cluster.py:3376). The submitted closure keeps the first flag and passes it to signal_connection_failure() at cluster.py:3254 / 3262. On reconnect, _HostReconnectionHandler dispatches based on that flag (pool.py:360-363), so a failed new-host add can reconnect through on_up() instead of on_add().
cassandra/cluster.py:3321
The reused future also keeps the first computed distance from cluster.py:3245, used to construct HostConnection at cluster.py:3251. If distance changes while creation is pending, the later call cannot replace the old creation. A later update_created_pools() can update pool.host_distance (cluster.py:3377-3382), but it cannot fix construction done with the old distance, e.g. a REMOTE pool created with connect_to_remote_hosts=False has no connections.
cassandra/cluster.py:3342
remove_pool() clears _pending_pool_futures, but the old running task can still publish its pool at cluster.py:3301. If a fresh creation starts after removal, the stale task can finish first, install its pool, and make the fresh task discard its pool via the live-pool check at cluster.py:3288. Need a generation/identity guard before publishing, not just cleanup of the pending map.
When two or more nodes are bootstrapped concurrently the Python driver
can execute the same CQL statement twice, causing spurious "already
exists" errors in the caller. This has been observed as flaky test
failures across the ScyllaDB test suite for the past two years, and
worked around by using idempotent DDL forms (IF NOT EXISTS / IF EXISTS)
in dozens of tests.
Root cause
----------
The race unfolds as follows:
1. Two on_add notifications arrive at roughly the same time, one for
each new node. Each one calls session.add_or_renew_pool(), which
submits run_add_or_renew_pool() to the thread pool and returns.
Both submissions are in-flight concurrently.
2. The first add_or_renew_pool() finishes and calls _finalize_add(),
which notifies load-balancing policies and then calls
session.update_created_pools() for every live session.
3. update_created_pools() iterates all known hosts. For the second
host, whose run_add_or_renew_pool() has not yet completed, it sees
self._pools.get(host) == None (or a shut-down pool) and therefore
submits *another* run_add_or_renew_pool() for that host.
4. Now two tasks are connecting to the same host. The first one
finishes and installs pool-A in self._pools, then runs a statement
(e.g. CREATE ROLE) that is in-flight on pool-A.
5. The second task finishes, reads the stale `previous = self._pools.get(host)`
value (captured *before* the lock was taken — another bug), installs
pool-B and then shuts down pool-A. The in-flight CREATE ROLE request
is orphaned; the driver retries it on pool-B. The server executes it
a second time and returns "Role ... already exists".
Fix
---
Three coordinated changes to cassandra/cluster.py:
* Session.__init__: add self._pending_pool_futures = {}, a dict mapping
host -> entry (with future, creation_id, distance,
is_host_addition_cell) for any in-flight pool creation, guarded by
_lock.
* add_or_renew_pool: before submitting run_add_or_renew_pool(), check
_pending_pool_futures under _lock. If an in-flight future already
exists for the host with the same distance, return it immediately —
this is the primary fix that prevents the duplicate submission from
update_created_pools. If is_host_addition=True on the new call but
the existing entry has False, upgrade it in-place via a shared
is_host_addition_cell so the closure passes the correct flag to
signal_connection_failure() and _HostReconnectionHandler dispatches
through on_add() instead of on_up() on reconnect. If the distance
changed, submit a fresh task (the old HostConnection was constructed
with stale distance, e.g. no connections for REMOTE with
connect_to_remote_hosts=False).
Each submission gets a unique creation_id token. The closure checks
_pending_pool_futures[host].creation_id before installing its pool:
if remove_pool() ran and a fresher creation was submitted while this
task was connecting, the stale task discards its pool rather than
overwriting the fresher one.
Additionally, move the `previous = self._pools.get(host)` read inside
the lock so the live-pool check is atomic with the installation of the
new pool: if a concurrent creation has already installed a live pool
by the time we finish connecting, discard our new pool instead of
replacing the live one (defense-in-depth).
Cleanup of _pending_pool_futures is handled by a done_callback
registered on the future. The callback acquires _lock and only clears
the entry if it still holds the same creation_id it was registered
on, so a concurrent remove_pool followed by a new add_or_renew_pool
is not affected. The entry is stored before calling submit() so that
the closure always finds a valid creation_id in the dict, even when
the executor runs the task synchronously.
* remove_pool: clear _pending_pool_futures[host] under _lock so that
if a host is removed and immediately re-added, add_or_renew_pool
submits a fresh creation rather than reusing a stale done future.
Tests
-----
Five new unit tests are added in PoolRenewalRaceTest
(tests/unit/test_cluster.py). They exercise the new code paths without
requiring a real cluster connection by constructing a minimal Session
via object.__new__ and mocking the executor and profile manager.
The tests use the new dict-based entry format for _pending_pool_futures:
* test_add_or_renew_pool_reuses_inflight_future: places a pending
entry in _pending_pool_futures and verifies that add_or_renew_pool
returns the existing future without submitting a new task.
* test_add_or_renew_pool_discards_duplicate_when_live_pool_exists:
exercises the real production code path by patching HostConnection
to a lightweight stub and using a synchronous executor shim that
runs the submitted callable inline. Pre-installs a live pool for
the host, then calls add_or_renew_pool() and asserts that the live
pool is not replaced and the newly connected stub pool is shut down.
* test_remove_pool_clears_pending_future: verifies that remove_pool
clears _pending_pool_futures so the next add_or_renew_pool call
submits a fresh task.
* test_done_callback_clears_pending_future: verifies that the
done_callback fires and removes the entry from _pending_pool_futures
once the future completes.
* test_done_callback_does_not_clear_newer_future: verifies the
creation_id guard — an old future's callback does not evict a newer
entry installed in its place after a remove_pool + add_or_renew_pool.
Fixes: scylladb#317
|
(My AI) fixed all three issues identified by @dkropachev ('s AI). Thanks for the detailed review! You identified three real races. Here's what the new version does to fix each one:
Instead of a plain bool, the closure now captures is_host_addition_cell = [is_host_addition] — a mutable one-element list. If a second call arrives with is_host_addition=True while the first future is still in-flight, the coalescing path upgrades the cell in-place (entry['is_host_addition_cell'][0] = True) before returning the existing future. The closure then reads is_host_addition_cell[0] at call time, so it always sees the stricter on_add() reconnect path when needed.
The entry dict now stores the distance computed at submission time. When a second add_or_renew_pool() call arrives and finds an in-flight entry, it only coalesces if entry['distance'] == distance. If the distance has changed, the code falls through and submits a fresh task with the new distance. The creation_id guard (see below) ensures the fresh task's pool wins.
Each submission generates a unique creation_id = object(). Before calling submit(), the entry (including creation_id) is stored in _pending_pool_futures under _lock. The closure, still under _lock, checks entry['creation_id'] is creation_id before installing the pool — if it doesn't match (because remove_pool() cleared the entry, or a fresh submission replaced it), the closure discards its pool. remove_pool() pops the entry under _lock, so any task that started before the removal and checks afterward will see the mismatch and discard. The done-callback that clears the entry also verifies the creation_id before removing, so a concurrent fresh submission isn't cleared. The entry is stored before submit() returns so the guard works even when the executor runs the task synchronously (e.g. in tests with a direct-call executor). |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| def run_add_or_renew_pool(): | ||
| try: | ||
| new_pool = HostConnection(host, distance, self) |
|
@nyh , i have found 3 more correcness issues and it will go like this forever, I have already tried to solve it this way. I am working on another solution that will work better at - #836 that basically do event queuing and fencing, it will 100% work better. |
When two or more nodes are bootstrapped concurrently the Python driver
can execute the same CQL statement twice, causing spurious "already
exists" errors in the caller. This has been observed as flaky test
failures across the ScyllaDB test suite for the past two years, and
worked around by using idempotent DDL forms (IF NOT EXISTS / IF EXISTS)
in dozens of tests.
Root cause
The race unfolds as follows:
Two on_add notifications arrive at roughly the same time, one for
each new node. Each one calls session.add_or_renew_pool(), which
submits run_add_or_renew_pool() to the thread pool and returns.
Both submissions are in-flight concurrently.
The first add_or_renew_pool() finishes and calls _finalize_add(),
which notifies load-balancing policies and then calls
session.update_created_pools() for every live session.
update_created_pools() iterates all known hosts. For the second
host, whose run_add_or_renew_pool() has not yet completed, it sees
self._pools.get(host) == None (or a shut-down pool) and therefore
submits another run_add_or_renew_pool() for that host.
Now two tasks are connecting to the same host. The first one
finishes and installs pool-A in self._pools, then runs a statement
(e.g. CREATE ROLE) that is in-flight on pool-A.
The second task finishes, reads the stale
previous = self._pools.get(host)value (captured before the lock was taken — another bug), installs
pool-B and then shuts down pool-A. The in-flight CREATE ROLE request
is orphaned; the driver retries it on pool-B. The server executes it
a second time and returns "Role ... already exists".
Fix
Three coordinated changes to cassandra/cluster.py:
Session.init: add self._pending_pool_futures = {}, a dict mapping
host -> Future for any in-flight pool creation, guarded by _lock.
add_or_renew_pool: before submitting run_add_or_renew_pool(), check
_pending_pool_futures under _lock. If an in-flight future already
exists for the host, return it immediately — this is the primary fix
that prevents the duplicate submission from update_created_pools.
Additionally, move the
previous = self._pools.get(host)read insidethe lock so the live-pool check is atomic with the installation of the
new pool: if a concurrent creation has already installed a live pool
by the time we finish connecting, discard our new pool instead of
replacing the live one (defense-in-depth).
Cleanup of _pending_pool_futures is handled by a done_callback
registered on the future immediately after it is stored, both
operations performed under _lock. The callback only removes the entry
if it still points at the same future it was registered on, so a
concurrent remove_pool followed by a new add_or_renew_pool is not
affected. This guarantees cleanup under all exit paths including
unhandled exceptions inside run_add_or_renew_pool, and avoids the
race where a fast-completing task pops the key before the outer code
has stored the future.
remove_pool: clear _pending_pool_futures[host] under _lock so that
if a host is removed and immediately re-added, add_or_renew_pool
submits a fresh creation rather than reusing a stale done future.
Tests
Five new unit tests are added in PoolRenewalRaceTest
(tests/unit/test_cluster.py). They exercise the new code paths without
requiring a real cluster connection by constructing a minimal Session
via object.new and mocking the executor and profile manager:
test_add_or_renew_pool_reuses_inflight_future: places a pending
Future in _pending_pool_futures and verifies that add_or_renew_pool
returns it without submitting a new task to the executor.
test_add_or_renew_pool_discards_duplicate_when_live_pool_exists:
exercises the real production code path by patching HostConnection
to a lightweight stub and using a synchronous executor shim that
runs the submitted callable inline. Pre-installs a live pool for
the host, then calls add_or_renew_pool() and asserts that the live
pool is not replaced and the newly connected stub pool is shut down.
test_remove_pool_clears_pending_future: verifies that remove_pool
clears _pending_pool_futures so the next add_or_renew_pool call
submits a fresh task.
test_done_callback_clears_pending_future: verifies that the
done_callback fires and removes the entry from _pending_pool_futures
once the future completes.
test_done_callback_does_not_clear_newer_future: verifies the identity
guard — an old future's callback does not evict a newer future that
was installed in its place after a remove_pool + add_or_renew_pool.
Fixes: #317