This PR fixes UNPREPARED handling in the Python driver execution path… by calebxyz · Pull Request #725 · scylladb/python-driver

calebxyz · 2026-03-04T09:25:18Z

Problem

When the server returned UNPREPARED (PreparedQueryNotFound) during an EXECUTE, the driver could fail to recover in some valid cases (cache miss, race, coordinator restart, id mismatch edge cases or crushes), even when the in-flight request still had enough PreparedStatement context to recover.

Expected behavior

If the driver still has the original PreparedStatement context, it should:

issue PREPARE,
then retry the original EXECUTE, regardless of why the coordinator no longer recognizes the prepared id.

What this PR changes

In ResponseFuture PreparedQueryNotFound handling:

Cache the in-flight self.prepared_statement first (when present).
Attempt lookup by the returned UNPREPARED id from cluster._prepared_statements.
Reprepare using the resolved statement.
If returned-id lookup fails:
- fallback to in-flight context if available,
- otherwise fail as unknown prepared statement.

This preserves correctness while improving recovery in real-world coordinator invalidation/race/restart scenarios.

Tests

Added/updated unit coverage in tests/unit/test_response_future.py:

validates reprepare path for PreparedQueryNotFound,
validates fallback to in-flight context,
validates preference for statement resolved by returned UNPREPARED id when available.

Fixes: scylladb/scylladb#27657

Pre-review checklist

I have split my patch into logically separate commits.
All commit messages clearly explain what they change and why.
I added relevant tests for new features and bug fixes.
All commits compile, pass static checks and pass test.
PR description sums up the changes and reasons why they should be introduced.
I have provided docstrings for the public items that I want to introduce.
I have adjusted the documentation in ./docs/source/.
I added appropriate Fixes: annotations to PR description.

Copilot

Pull request overview

This PR improves the Python driver’s recovery behavior when an EXECUTE receives an UNPREPARED (PreparedQueryNotFound) error, allowing the driver to re-PREPARE and retry using either the returned query id (preferred) or the in-flight prepared statement context (fallback).

Changes:

Update ResponseFuture._set_result() to cache in-flight prepared_statement, resolve by returned UNPREPARED id when possible, and fall back to in-flight context when not.
Adjust logging for unknown-prepared and mismatch/fallback scenarios.
Add unit tests covering fallback to local context and preference for returned-id cache resolution.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
`cassandra/cluster.py`	Updates UNPREPARED handling in `ResponseFuture._set_result()` to improve reprepare/retry correctness across cache-miss and id-mismatch scenarios.
`tests/unit/test_response_future.py`	Adds unit coverage for the new UNPREPARED resolution and fallback behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Copilot · 2026-03-04T10:24:57Z

cassandra/cluster.py

-                    else:
-                        query_id = response.info
+                        # Cache local in-flight context first so lookup by either id can succeed.
+                        self.session.cluster._prepared_statements[self.prepared_statement.query_id] = self.prepared_statement


This adds to cluster._prepared_statements via direct dict assignment. The Cluster has add_prepared() which wraps updates in _prepared_statement_lock (and _prepared_statements is a WeakValueDictionary, which is not thread-safe). To avoid races with concurrent prepares/updates, use self.session.cluster.add_prepared(...) (or take _prepared_statement_lock) instead of writing the dict directly.

Suggested change

self.session.cluster._prepared_statements[self.prepared_statement.query_id] = self.prepared_statement

self.session.cluster.add_prepared(self.prepared_statement)

well almost right the api is def add_prepared(self, query_id, prepared_statement):

Copilot · 2026-03-04T10:24:58Z

cassandra/cluster.py

                        if not self.prepared_statement:
-                            log.error("Tried to execute unknown prepared statement: id=%s",
-                                      query_id.encode('hex'))
+                            log.error(f"Tried to execute unknown prepared statement: id={hexlify(query_id)}")


The new log.error uses an f-string with hexlify(query_id), which eagerly formats even if the log level is disabled and also renders the bytes representation (e.g., b'616263') in the message. Prefer logger parameterized formatting (e.g., log.error("... %s", ...)) and consider decoding/formatting the hex value to a plain string for readability/consistency with other messages (e.g., .decode('ascii') or query_id.hex()).

Suggested change

log.error(f"Tried to execute unknown prepared statement: id={hexlify(query_id)}")

log.error(

"Tried to execute unknown prepared statement: id=%s",

hexlify(query_id).decode('ascii'),

)

cassandra/cluster.py

… for prepared statements. ### Problem When the server returned UNPREPARED (PreparedQueryNotFound) during an EXECUTE, the driver could fail to recover in some valid cases (cache miss, race, coordinator restart, id mismatch edge cases), even when the in-flight request still had enough PreparedStatement context to recover. ### Expected behavior If the driver still has the original PreparedStatement context, it should: 1. issue PREPARE, 2. then retry the original EXECUTE, regardless of why the coordinator no longer recognizes the prepared id. ### What this PR changes In ResponseFuture PreparedQueryNotFound handling: 1. Cache the in-flight self.prepared_statement first (when present). 2. Attempt lookup by the returned UNPREPARED id from cluster._prepared_statements. 3. Reprepare using the resolved statement. 4. If returned-id lookup fails: - fallback to in-flight context if available, - otherwise fail as unknown prepared statement. This preserves correctness while improving recovery in real-world coordinator invalidation/race/restart scenarios. ### Tests Added/updated unit coverage in tests/unit/test_response_future.py: - validates reprepare path for PreparedQueryNotFound, - validates fallback to in-flight context, - validates preference for statement resolved by returned UNPREPARED id when available. Fixes: scylladb/scylladb#27657

Lorak-mmk · 2026-03-04T12:33:34Z

tests/unit/test_response_future.py

+        rf.prepared_statement.query_string = "SELECT * FROM foobar"
+        rf.prepared_statement.keyspace = "FooKeyspace"
+
+        # Different query id in UNPREPARED response should not prevent reprepare when local context exists.


Why not? Prepared id suddenly changing seems like a serious protocol violation. What are the scenarios when it can happen?

because thats what i understand , whats expected , there can be a crush a race with the server , and then the prepared statement will be evicted and we need to rerun it.
it has lower performance but returning an error is incorrect.

I'm not questioning the need for automatic repreparation. It is a well-known mechanism, and all drivers do this, including Python driver before your changes.
What this review comment is about is about prepared statements id. Those IDs are constant - dependent on query string, and connection keyspace. All existing drivers depend on that.

Statement getting evicted from the cache is a known and handled scenario, but the id doesn't change in this scenario, even across restarts.

Lorak-mmk · 2026-03-04T12:39:14Z

cassandra/cluster.py

@@ -4745,11 +4743,13 @@ def _set_result(self, host, connection, pool, response):
                                       (current_keyspace, prepared_keyspace)))
                        return

-                    log.debug("Re-preparing unrecognized prepared statement against host %s: %s",
-                              host, prepared_statement.query_string)
-                    prepared_keyspace = prepared_statement.keyspace \
+                    log.debug(
+                        "Re-preparing unrecognized prepared statement against host %s: %s",
+                        host, self.prepared_statement.query_string
+                    )
+                    prepared_keyspace = self.prepared_statement.keyspace \
                        if ProtocolVersion.uses_keyspace_flag(self.session.cluster.protocol_version) else None
-                    prepare_message = PrepareMessage(query=prepared_statement.query_string,
+                    prepare_message = PrepareMessage(query=self.prepared_statement.query_string,


I'm looking at those changes and I don't see what you are trying to achieve.
In PR description you wrote:

Expected behavior

If the driver still has the original PreparedStatement context, it should:

issue PREPARE,

then retry the original EXECUTE, regardless of why the coordinator no longer recognizes the prepared id.

But this is what the driver already does?

Cache the in-flight self.prepared_statement first (when present).

Why does it need to be done first?

In general the existing logic looks sound to me:

In first step it determines id of preprared statement, trying both self.prepared_statement and response.info, and also validating protocol assumptions.

Then it resolves prepared statement object, also trying both cluster-scoped cache (self.session.cluster._prepared_statements) and self.prepared_statement. It also adds self.prepared_statement to cache in that case.

So, what problem is this PR trying to solve? In what scenario does existing logic fail?

this is one of this scenarios scylladb/scylladb#27657
the driver should retry but it doesn't

Linked issue describes the need for automatic reprepare. Driver already had mechanism for automatic repreparation - your PR doesn't add it, just modifies its logic. Which is why I'm asking what EXACTLY was wrong with the previous logic, how exactly it not working.

Ideally we'd have a test that fails before the fix and passes after it. May not be easy to reproduce perhaps.

Ideally we'd have a test that fails before the fix and passes after it.

Sure, but for now I'd be happy with a textual reasoning, so that I can understand the changes.

@Lorak-mmk , here is the part of the mentioned discussion: scylladb/scylladb#27657 (comment)

If I understand correctly there was an experiment when Scylla returned UNPREPARED = 0x2500 sc to driver and it didnt repreapre the statement. So if driver already has mechanism for automatic repreparation it didnt work during the experiment - scylladb/scylladb#27657 (comment)

Yes, I got that. What I don't understand from this PR description is why exactly it didn't work, and how those changes address that.

Before this PR the logic works roughly like this:

If self.prepared_statement exists, the code forces the query_id to be the local one (self.prepared_statement.query_id).
It then asserts that this id matches the one returned by the server (response.info).
After that it tries to resolve the statement:
It looks it up in the cache using
cluster._prepared_statements[query_id].
If that lookup fails (KeyError), it falls back to the local prepared statement (self.prepared_statement) and stores it in the cache under that query_id.

now this is might be problematic because:

It ignores the server’s returned ID when a local statement exists
If we already have a local prepared statement, the code overwrites the ID with the local one before doing the lookup.
That means the ID returned by the server is effectively ignored in this path, even though it might be the correct one to use.

The assertion can break recovery
The code asserts that the local ID and the server-returned ID must match.
However, in real situations (for example cache invalidation, server restart or crushes), the server might legitimately return a different ID.
When that happens, the assertion triggers and kills the flow before the driver has a chance to recover or re-prepare the statement gracefully.

Cache updates may reinforce the wrong mapping
If the cache lookup fails, the code stores the local prepared statement back into the cache under the current query_id.
But in this branch the query_id is the local ID, not necessarily the one the server returned.
Because of this, the logic never explicitly tries the clean path of:
First resolve using the ID the server returned, and only fall back to the local statement if needed.

In some LWT tests we actually see this behavior, instead of re-sending its just returns does nothing and waits forever

Yes, I got that. What I don't understand from this PR description is why exactly it didn't work, and how those changes address that.

why exactly it didn't work

This was actually a question for the drivers team :)

This was actually a question for the drivers team :)

sadly almost 2 months ago :(

mykaul · 2026-03-04T15:19:27Z

Dear @copilot (1) explain what is wrong in the current implementation and how this code change fixes the issue (2) can you come up with a test that fails before the change and passes after?

Copilot · 2026-03-04T15:19:36Z

@mykaul I've opened a new pull request, #726, to work on those changes. Once the pull request is ready, I'll request review from you.

mykaul · 2026-03-04T15:23:17Z

@mykaul I've opened a new pull request, #726, to work on those changes. Once the pull request is ready, I'll request review from you.

Oh dear. Oh well. Will take a look.

calebxyz · 2026-03-04T15:24:04Z

@mykaul I've opened a new pull request, #726, to work on those changes. Once the pull request is ready, I'll request review from you.

Oh dear. Oh well. Will take a look.

@mykaul was my explanation bad ? :) #725 (comment)

mykaul · 2026-03-04T16:01:23Z

@mykaul I've opened a new pull request, #726, to work on those changes. Once the pull request is ready, I'll request review from you.

Oh dear. Oh well. Will take a look.

@mykaul was my explanation bad ? :) #725 (comment)

Honestly - #726 (comment) is not bad as an explanation.

(What I'm still missing is how we can get to a situation of a different ID. That's not very clear to me, but I suspect it's a server-side issue?)

calebxyz · 2026-03-04T16:13:18Z

server-side issue

yes exactly

calebxyz requested review from dkropachev and timtimb0t March 4, 2026 09:25

calebxyz mentioned this pull request Mar 4, 2026

test/pylib: centralize timeout scaling and propagate build_mode in LWT helpers scylladb/scylladb#28840

Open

calebxyz requested a review from Copilot March 4, 2026 10:19

Copilot started reviewing on behalf of calebxyz March 4, 2026 10:20 View session

Copilot AI reviewed Mar 4, 2026

View reviewed changes

Lorak-mmk self-requested a review March 4, 2026 10:36

calebxyz force-pushed the scylladb-27657-v2 branch from 0f3737f to 8ed0715 Compare March 4, 2026 10:36

github-code-quality bot found potential problems Mar 4, 2026

View reviewed changes

cassandra/cluster.py Fixed Show fixed Hide fixed

calebxyz force-pushed the scylladb-27657-v2 branch 2 times, most recently from c7706e5 to 3d3dc24 Compare March 4, 2026 10:45

calebxyz force-pushed the scylladb-27657-v2 branch from 3d3dc24 to 9815e0f Compare March 4, 2026 10:48

Lorak-mmk reviewed Mar 4, 2026

View reviewed changes

Copilot AI mentioned this pull request Mar 4, 2026

Fix UNPREPARED handling in ResponseFuture to allow recovery on query ID mismatch #726

Draft

8 tasks

	self.session.cluster._prepared_statements[self.prepared_statement.query_id] = self.prepared_statement
	self.session.cluster.add_prepared(self.prepared_statement)

Conversation

calebxyz commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Expected behavior

What this PR changes

Tests

Pre-review checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Expected behavior

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Lorak-mmk Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Lorak-mmk Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mykaul commented Mar 4, 2026

Uh oh!

Copilot AI commented Mar 4, 2026

Uh oh!

mykaul commented Mar 4, 2026

Uh oh!

calebxyz commented Mar 4, 2026

Uh oh!

mykaul commented Mar 4, 2026

Uh oh!

calebxyz commented Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

calebxyz commented Mar 4, 2026 •

edited

Loading

Lorak-mmk Mar 4, 2026 •

edited

Loading

Lorak-mmk Mar 4, 2026 •

edited

Loading