Skip to content

session: fix pool renewal race causing double statement execution#838

Open
nyh wants to merge 1 commit intoscylladb:masterfrom
nyh:fix-317
Open

session: fix pool renewal race causing double statement execution#838
nyh wants to merge 1 commit intoscylladb:masterfrom
nyh:fix-317

Conversation

@nyh
Copy link
Copy Markdown

@nyh nyh commented May 3, 2026

When two or more nodes are bootstrapped concurrently the Python driver
can execute the same CQL statement twice, causing spurious "already
exists" errors in the caller. This has been observed as flaky test
failures across the ScyllaDB test suite for the past two years, and
worked around by using idempotent DDL forms (IF NOT EXISTS / IF EXISTS)
in dozens of tests.

Root cause

The race unfolds as follows:

  1. Two on_add notifications arrive at roughly the same time, one for
    each new node. Each one calls session.add_or_renew_pool(), which
    submits run_add_or_renew_pool() to the thread pool and returns.
    Both submissions are in-flight concurrently.

  2. The first add_or_renew_pool() finishes and calls _finalize_add(),
    which notifies load-balancing policies and then calls
    session.update_created_pools() for every live session.

  3. update_created_pools() iterates all known hosts. For the second
    host, whose run_add_or_renew_pool() has not yet completed, it sees
    self._pools.get(host) == None (or a shut-down pool) and therefore
    submits another run_add_or_renew_pool() for that host.

  4. Now two tasks are connecting to the same host. The first one
    finishes and installs pool-A in self._pools, then runs a statement
    (e.g. CREATE ROLE) that is in-flight on pool-A.

  5. The second task finishes, reads the stale previous = self._pools.get(host)
    value (captured before the lock was taken — another bug), installs
    pool-B and then shuts down pool-A. The in-flight CREATE ROLE request
    is orphaned; the driver retries it on pool-B. The server executes it
    a second time and returns "Role ... already exists".

Fix

Three coordinated changes to cassandra/cluster.py:

  • Session.init: add self._pending_pool_futures = {}, a dict mapping
    host -> Future for any in-flight pool creation, guarded by _lock.

  • add_or_renew_pool: before submitting run_add_or_renew_pool(), check
    _pending_pool_futures under _lock. If an in-flight future already
    exists for the host, return it immediately — this is the primary fix
    that prevents the duplicate submission from update_created_pools.
    Additionally, move the previous = self._pools.get(host) read inside
    the lock so the live-pool check is atomic with the installation of the
    new pool: if a concurrent creation has already installed a live pool
    by the time we finish connecting, discard our new pool instead of
    replacing the live one (defense-in-depth).
    Cleanup of _pending_pool_futures is handled by a done_callback
    registered on the future immediately after it is stored, both
    operations performed under _lock. The callback only removes the entry
    if it still points at the same future it was registered on, so a
    concurrent remove_pool followed by a new add_or_renew_pool is not
    affected. This guarantees cleanup under all exit paths including
    unhandled exceptions inside run_add_or_renew_pool, and avoids the
    race where a fast-completing task pops the key before the outer code
    has stored the future.

  • remove_pool: clear _pending_pool_futures[host] under _lock so that
    if a host is removed and immediately re-added, add_or_renew_pool
    submits a fresh creation rather than reusing a stale done future.

Tests

Five new unit tests are added in PoolRenewalRaceTest
(tests/unit/test_cluster.py). They exercise the new code paths without
requiring a real cluster connection by constructing a minimal Session
via object.new and mocking the executor and profile manager:

  • test_add_or_renew_pool_reuses_inflight_future: places a pending
    Future in _pending_pool_futures and verifies that add_or_renew_pool
    returns it without submitting a new task to the executor.

  • test_add_or_renew_pool_discards_duplicate_when_live_pool_exists:
    exercises the real production code path by patching HostConnection
    to a lightweight stub and using a synchronous executor shim that
    runs the submitted callable inline. Pre-installs a live pool for
    the host, then calls add_or_renew_pool() and asserts that the live
    pool is not replaced and the newly connected stub pool is shut down.

  • test_remove_pool_clears_pending_future: verifies that remove_pool
    clears _pending_pool_futures so the next add_or_renew_pool call
    submits a fresh task.

  • test_done_callback_clears_pending_future: verifies that the
    done_callback fires and removes the entry from _pending_pool_futures
    once the future completes.

  • test_done_callback_does_not_clear_newer_future: verifies the identity
    guard — an old future's callback does not evict a newer future that
    was installed in its place after a remove_pool + add_or_renew_pool.

Fixes: #317

@nyh
Copy link
Copy Markdown
Author

nyh commented May 3, 2026

This is another fix for #317. This one was written entirely by AI (vscode + Claude Sonnet 4.6). Unlike #835 which uses "optimistic locking" - it can create two pools but then closes the unneeded one - this implementation tries to avoid creating a new pool if one is in the process of being created, so the implementation is a bit longer and it's not clear if the performance advantage is worth it.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a race during concurrent host additions where Session.update_created_pools() could schedule duplicate pool creations for the same host, leading to an in-flight request being orphaned and retried on a new pool (double execution / “already exists” errors).

Changes:

  • Track in-flight per-host pool-creation tasks via Session._pending_pool_futures and reuse them to avoid duplicate scheduling.
  • Make pool replacement/discard decisions atomic by reading previous = self._pools.get(host) inside the session lock and discarding newly-created pools when a live pool already exists.
  • Add unit tests covering reuse of in-flight futures and pending-future invalidation on remove_pool.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
cassandra/cluster.py Adds _pending_pool_futures and uses it to dedupe concurrent pool creation / renewal work; adjusts locking to prevent replacing a live pool.
tests/unit/test_cluster.py Adds regression tests for the pool-renewal race and for _pending_pool_futures behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread cassandra/cluster.py
Comment thread cassandra/cluster.py Outdated
Comment thread tests/unit/test_cluster.py Outdated
@nyh nyh mentioned this pull request May 3, 2026
8 tasks
@nyh nyh requested a review from Lorak-mmk May 3, 2026 19:14
@nyh nyh requested a review from Copilot May 3, 2026 19:25
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread cassandra/cluster.py Outdated
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@nyh
Copy link
Copy Markdown
Author

nyh commented May 4, 2026

@dkropachev one of the test runs, "libev 3.11", hung. I don't know what to make of it, or how to restart it.

Anyway, @avikivity @mykaul this is (issue #317) is one of the issues that hit MANY MANY Scylla "cluster tests" ever since the Scylla "cluster test" framework got a function that spawns several serers concurrently. This is perhaps the driver issues that caused most flaky tests ever. Let's get it fixed for good (there's this patch, and also #835, as candidate fixes).

@avikivity
Copy link
Copy Markdown
Member

@dkropachev one of the test runs, "libev 3.11", hung. I don't know what to make of it, or how to restart it.

Anyway, @avikivity @mykaul this is (issue #317) is one of the issues that hit MANY MANY Scylla "cluster tests" ever since the Scylla "cluster test" framework got a function that spawns several serers concurrently. This is perhaps the driver issues that caused most flaky tests ever. Let's get it fixed for good (there's this patch, and also #835, as candidate fixes).

Thanks for fixing it!

Copy link
Copy Markdown
Collaborator

@dkropachev dkropachev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this still has a few race holes around coalescing by host only:

  1. cassandra/cluster.py:3321

_pending_pool_futures is keyed only by host, so a later add_or_renew_pool(host, is_host_addition=True) can reuse a future created by update_created_pools() with is_host_addition=False (cluster.py:3376). The submitted closure keeps the first flag and passes it to signal_connection_failure() at cluster.py:3254 / 3262. On reconnect, _HostReconnectionHandler dispatches based on that flag (pool.py:360-363), so a failed new-host add can reconnect through on_up() instead of on_add().

  1. cassandra/cluster.py:3321

The reused future also keeps the first computed distance from cluster.py:3245, used to construct HostConnection at cluster.py:3251. If distance changes while creation is pending, the later call cannot replace the old creation. A later update_created_pools() can update pool.host_distance (cluster.py:3377-3382), but it cannot fix construction done with the old distance, e.g. a REMOTE pool created with connect_to_remote_hosts=False has no connections.

  1. cassandra/cluster.py:3342

remove_pool() clears _pending_pool_futures, but the old running task can still publish its pool at cluster.py:3301. If a fresh creation starts after removal, the stale task can finish first, install its pool, and make the fresh task discard its pool via the live-pool check at cluster.py:3288. Need a generation/identity guard before publishing, not just cleanup of the pending map.

When two or more nodes are bootstrapped concurrently the Python driver
can execute the same CQL statement twice, causing spurious "already
exists" errors in the caller.  This has been observed as flaky test
failures across the ScyllaDB test suite for the past two years, and
worked around by using idempotent DDL forms (IF NOT EXISTS / IF EXISTS)
in dozens of tests.

Root cause
----------
The race unfolds as follows:

1. Two on_add notifications arrive at roughly the same time, one for
   each new node.  Each one calls session.add_or_renew_pool(), which
   submits run_add_or_renew_pool() to the thread pool and returns.
   Both submissions are in-flight concurrently.

2. The first add_or_renew_pool() finishes and calls _finalize_add(),
   which notifies load-balancing policies and then calls
   session.update_created_pools() for every live session.

3. update_created_pools() iterates all known hosts.  For the second
   host, whose run_add_or_renew_pool() has not yet completed, it sees
   self._pools.get(host) == None (or a shut-down pool) and therefore
   submits *another* run_add_or_renew_pool() for that host.

4. Now two tasks are connecting to the same host.  The first one
   finishes and installs pool-A in self._pools, then runs a statement
   (e.g. CREATE ROLE) that is in-flight on pool-A.

5. The second task finishes, reads the stale `previous = self._pools.get(host)`
   value (captured *before* the lock was taken — another bug), installs
   pool-B and then shuts down pool-A.  The in-flight CREATE ROLE request
   is orphaned; the driver retries it on pool-B.  The server executes it
   a second time and returns "Role ... already exists".

Fix
---
Three coordinated changes to cassandra/cluster.py:

* Session.__init__: add self._pending_pool_futures = {}, a dict mapping
  host -> entry (with future, creation_id, distance,
  is_host_addition_cell) for any in-flight pool creation, guarded by
  _lock.

* add_or_renew_pool: before submitting run_add_or_renew_pool(), check
  _pending_pool_futures under _lock.  If an in-flight future already
  exists for the host with the same distance, return it immediately —
  this is the primary fix that prevents the duplicate submission from
  update_created_pools.  If is_host_addition=True on the new call but
  the existing entry has False, upgrade it in-place via a shared
  is_host_addition_cell so the closure passes the correct flag to
  signal_connection_failure() and _HostReconnectionHandler dispatches
  through on_add() instead of on_up() on reconnect.  If the distance
  changed, submit a fresh task (the old HostConnection was constructed
  with stale distance, e.g. no connections for REMOTE with
  connect_to_remote_hosts=False).
  Each submission gets a unique creation_id token.  The closure checks
  _pending_pool_futures[host].creation_id before installing its pool:
  if remove_pool() ran and a fresher creation was submitted while this
  task was connecting, the stale task discards its pool rather than
  overwriting the fresher one.
  Additionally, move the `previous = self._pools.get(host)` read inside
  the lock so the live-pool check is atomic with the installation of the
  new pool: if a concurrent creation has already installed a live pool
  by the time we finish connecting, discard our new pool instead of
  replacing the live one (defense-in-depth).
  Cleanup of _pending_pool_futures is handled by a done_callback
  registered on the future.  The callback acquires _lock and only clears
  the entry if it still holds the same creation_id it was registered
  on, so a concurrent remove_pool followed by a new add_or_renew_pool
  is not affected.  The entry is stored before calling submit() so that
  the closure always finds a valid creation_id in the dict, even when
  the executor runs the task synchronously.

* remove_pool: clear _pending_pool_futures[host] under _lock so that
  if a host is removed and immediately re-added, add_or_renew_pool
  submits a fresh creation rather than reusing a stale done future.

Tests
-----
Five new unit tests are added in PoolRenewalRaceTest
(tests/unit/test_cluster.py).  They exercise the new code paths without
requiring a real cluster connection by constructing a minimal Session
via object.__new__ and mocking the executor and profile manager.
The tests use the new dict-based entry format for _pending_pool_futures:

* test_add_or_renew_pool_reuses_inflight_future: places a pending
  entry in _pending_pool_futures and verifies that add_or_renew_pool
  returns the existing future without submitting a new task.

* test_add_or_renew_pool_discards_duplicate_when_live_pool_exists:
  exercises the real production code path by patching HostConnection
  to a lightweight stub and using a synchronous executor shim that
  runs the submitted callable inline.  Pre-installs a live pool for
  the host, then calls add_or_renew_pool() and asserts that the live
  pool is not replaced and the newly connected stub pool is shut down.

* test_remove_pool_clears_pending_future: verifies that remove_pool
  clears _pending_pool_futures so the next add_or_renew_pool call
  submits a fresh task.

* test_done_callback_clears_pending_future: verifies that the
  done_callback fires and removes the entry from _pending_pool_futures
  once the future completes.

* test_done_callback_does_not_clear_newer_future: verifies the
  creation_id guard — an old future's callback does not evict a newer
  entry installed in its place after a remove_pool + add_or_renew_pool.

Fixes: scylladb#317
@nyh
Copy link
Copy Markdown
Author

nyh commented May 4, 2026

(My AI) fixed all three issues identified by @dkropachev ('s AI).

Thanks for the detailed review! You identified three real races. Here's what the new version does to fix each one:

  1. is_host_addition flag race

Instead of a plain bool, the closure now captures is_host_addition_cell = [is_host_addition] — a mutable one-element list. If a second call arrives with is_host_addition=True while the first future is still in-flight, the coalescing path upgrades the cell in-place (entry['is_host_addition_cell'][0] = True) before returning the existing future. The closure then reads is_host_addition_cell[0] at call time, so it always sees the stricter on_add() reconnect path when needed.

  1. Distance coalescing race

The entry dict now stores the distance computed at submission time. When a second add_or_renew_pool() call arrives and finds an in-flight entry, it only coalesces if entry['distance'] == distance. If the distance has changed, the code falls through and submits a fresh task with the new distance. The creation_id guard (see below) ensures the fresh task's pool wins.

  1. Stale task publishing stale pool

Each submission generates a unique creation_id = object(). Before calling submit(), the entry (including creation_id) is stored in _pending_pool_futures under _lock. The closure, still under _lock, checks entry['creation_id'] is creation_id before installing the pool — if it doesn't match (because remove_pool() cleared the entry, or a fresh submission replaced it), the closure discards its pool. remove_pool() pops the entry under _lock, so any task that started before the removal and checks afterward will see the mismatch and discard. The done-callback that clears the entry also verifies the creation_id before removing, so a concurrent fresh submission isn't cleared.

The entry is stored before submit() returns so the guard works even when the executor runs the task synchronously (e.g. in tests with a direct-call executor).

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread cassandra/cluster.py

def run_add_or_renew_pool():
try:
new_pool = HostConnection(host, distance, self)
@dkropachev
Copy link
Copy Markdown
Collaborator

@nyh , i have found 3 more correcness issues and it will go like this forever, I have already tried to solve it this way.
Unfrtunately driver code is too interconnected that does not allow solve problems like this this way.

I am working on another solution that will work better at - #836 that basically do event queuing and fencing, it will 100% work better.

@nyh
Copy link
Copy Markdown
Author

nyh commented May 4, 2026

I am working on another solution that will work better at - #836

Funny how we didn't touch this issue #317 for a couple of years (since test/cluster became more aggressive in creating nodes in parallel), and now in two days we have three different pull requests to fix it ;-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Connection pool renewal after concurrent node bootstraps causes double statement execution

4 participants