This PR fixes UNPREPARED handling in the Python driver execution path…#725
This PR fixes UNPREPARED handling in the Python driver execution path…#725
Conversation
There was a problem hiding this comment.
Pull request overview
This PR improves the Python driver’s recovery behavior when an EXECUTE receives an UNPREPARED (PreparedQueryNotFound) error, allowing the driver to re-PREPARE and retry using either the returned query id (preferred) or the in-flight prepared statement context (fallback).
Changes:
- Update
ResponseFuture._set_result()to cache in-flightprepared_statement, resolve by returned UNPREPARED id when possible, and fall back to in-flight context when not. - Adjust logging for unknown-prepared and mismatch/fallback scenarios.
- Add unit tests covering fallback to local context and preference for returned-id cache resolution.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
cassandra/cluster.py |
Updates UNPREPARED handling in ResponseFuture._set_result() to improve reprepare/retry correctness across cache-miss and id-mismatch scenarios. |
tests/unit/test_response_future.py |
Adds unit coverage for the new UNPREPARED resolution and fallback behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
cassandra/cluster.py
Outdated
| else: | ||
| query_id = response.info | ||
| # Cache local in-flight context first so lookup by either id can succeed. | ||
| self.session.cluster._prepared_statements[self.prepared_statement.query_id] = self.prepared_statement |
There was a problem hiding this comment.
This adds to cluster._prepared_statements via direct dict assignment. The Cluster has add_prepared() which wraps updates in _prepared_statement_lock (and _prepared_statements is a WeakValueDictionary, which is not thread-safe). To avoid races with concurrent prepares/updates, use self.session.cluster.add_prepared(...) (or take _prepared_statement_lock) instead of writing the dict directly.
| self.session.cluster._prepared_statements[self.prepared_statement.query_id] = self.prepared_statement | |
| self.session.cluster.add_prepared(self.prepared_statement) |
There was a problem hiding this comment.
well almost right the api is def add_prepared(self, query_id, prepared_statement):
cassandra/cluster.py
Outdated
| if not self.prepared_statement: | ||
| log.error("Tried to execute unknown prepared statement: id=%s", | ||
| query_id.encode('hex')) | ||
| log.error(f"Tried to execute unknown prepared statement: id={hexlify(query_id)}") |
There was a problem hiding this comment.
The new log.error uses an f-string with hexlify(query_id), which eagerly formats even if the log level is disabled and also renders the bytes representation (e.g., b'616263') in the message. Prefer logger parameterized formatting (e.g., log.error("... %s", ...)) and consider decoding/formatting the hex value to a plain string for readability/consistency with other messages (e.g., .decode('ascii') or query_id.hex()).
| log.error(f"Tried to execute unknown prepared statement: id={hexlify(query_id)}") | |
| log.error( | |
| "Tried to execute unknown prepared statement: id=%s", | |
| hexlify(query_id).decode('ascii'), | |
| ) |
0f3737f to
8ed0715
Compare
c7706e5 to
3d3dc24
Compare
… for prepared statements.
### Problem
When the server returned UNPREPARED (PreparedQueryNotFound) during an EXECUTE, the driver could fail to recover in some valid cases (cache miss, race, coordinator restart, id mismatch edge cases), even when the in-flight request still had enough PreparedStatement context to recover.
### Expected behavior
If the driver still has the original PreparedStatement context, it should:
1. issue PREPARE,
2. then retry the original EXECUTE,
regardless of why the coordinator no longer recognizes the prepared id.
### What this PR changes
In ResponseFuture PreparedQueryNotFound handling:
1. Cache the in-flight self.prepared_statement first (when present).
2. Attempt lookup by the returned UNPREPARED id from cluster._prepared_statements.
3. Reprepare using the resolved statement.
4. If returned-id lookup fails:
- fallback to in-flight context if available,
- otherwise fail as unknown prepared statement.
This preserves correctness while improving recovery in real-world coordinator invalidation/race/restart scenarios.
### Tests
Added/updated unit coverage in tests/unit/test_response_future.py:
- validates reprepare path for PreparedQueryNotFound,
- validates fallback to in-flight context,
- validates preference for statement resolved by returned UNPREPARED id when available.
Fixes: scylladb/scylladb#27657
3d3dc24 to
9815e0f
Compare
| rf.prepared_statement.query_string = "SELECT * FROM foobar" | ||
| rf.prepared_statement.keyspace = "FooKeyspace" | ||
|
|
||
| # Different query id in UNPREPARED response should not prevent reprepare when local context exists. |
There was a problem hiding this comment.
Why not? Prepared id suddenly changing seems like a serious protocol violation. What are the scenarios when it can happen?
There was a problem hiding this comment.
because thats what i understand , whats expected , there can be a crush a race with the server , and then the prepared statement will be evicted and we need to rerun it.
it has lower performance but returning an error is incorrect.
There was a problem hiding this comment.
I'm not questioning the need for automatic repreparation. It is a well-known mechanism, and all drivers do this, including Python driver before your changes.
What this review comment is about is about prepared statements id. Those IDs are constant - dependent on query string, and connection keyspace. All existing drivers depend on that.
Statement getting evicted from the cache is a known and handled scenario, but the id doesn't change in this scenario, even across restarts.
| @@ -4745,11 +4743,13 @@ def _set_result(self, host, connection, pool, response): | |||
| (current_keyspace, prepared_keyspace))) | |||
| return | |||
|
|
|||
| log.debug("Re-preparing unrecognized prepared statement against host %s: %s", | |||
| host, prepared_statement.query_string) | |||
| prepared_keyspace = prepared_statement.keyspace \ | |||
| log.debug( | |||
| "Re-preparing unrecognized prepared statement against host %s: %s", | |||
| host, self.prepared_statement.query_string | |||
| ) | |||
| prepared_keyspace = self.prepared_statement.keyspace \ | |||
| if ProtocolVersion.uses_keyspace_flag(self.session.cluster.protocol_version) else None | |||
| prepare_message = PrepareMessage(query=prepared_statement.query_string, | |||
| prepare_message = PrepareMessage(query=self.prepared_statement.query_string, | |||
There was a problem hiding this comment.
I'm looking at those changes and I don't see what you are trying to achieve.
In PR description you wrote:
Expected behavior
If the driver still has the original PreparedStatement context, it should:
- issue PREPARE,
- then retry the original EXECUTE, regardless of why the coordinator no longer recognizes the prepared id.
But this is what the driver already does?
- Cache the in-flight self.prepared_statement first (when present).
Why does it need to be done first?
In general the existing logic looks sound to me:
- In first step it determines id of preprared statement, trying both
self.prepared_statementandresponse.info, and also validating protocol assumptions. - Then it resolves prepared statement object, also trying both cluster-scoped cache (
self.session.cluster._prepared_statements) andself.prepared_statement. It also addsself.prepared_statementto cache in that case.
So, what problem is this PR trying to solve? In what scenario does existing logic fail?
There was a problem hiding this comment.
this is one of this scenarios scylladb/scylladb#27657
the driver should retry but it doesn't
There was a problem hiding this comment.
Linked issue describes the need for automatic reprepare. Driver already had mechanism for automatic repreparation - your PR doesn't add it, just modifies its logic. Which is why I'm asking what EXACTLY was wrong with the previous logic, how exactly it not working.
There was a problem hiding this comment.
Ideally we'd have a test that fails before the fix and passes after it. May not be easy to reproduce perhaps.
There was a problem hiding this comment.
Ideally we'd have a test that fails before the fix and passes after it.
Sure, but for now I'd be happy with a textual reasoning, so that I can understand the changes.
There was a problem hiding this comment.
@Lorak-mmk , here is the part of the mentioned discussion: scylladb/scylladb#27657 (comment)
If I understand correctly there was an experiment when Scylla returned UNPREPARED = 0x2500 sc to driver and it didnt repreapre the statement. So if driver already has mechanism for automatic repreparation it didnt work during the experiment - scylladb/scylladb#27657 (comment)
There was a problem hiding this comment.
Yes, I got that. What I don't understand from this PR description is why exactly it didn't work, and how those changes address that.
There was a problem hiding this comment.
Before this PR the logic works roughly like this:
If self.prepared_statement exists, the code forces the query_id to be the local one (self.prepared_statement.query_id).
It then asserts that this id matches the one returned by the server (response.info).
After that it tries to resolve the statement:
It looks it up in the cache using
cluster._prepared_statements[query_id].
If that lookup fails (KeyError), it falls back to the local prepared statement (self.prepared_statement) and stores it in the cache under that query_id.
now this is might be problematic because:
-
It ignores the server’s returned ID when a local statement exists
If we already have a local prepared statement, the code overwrites the ID with the local one before doing the lookup.
That means the ID returned by the server is effectively ignored in this path, even though it might be the correct one to use. -
The assertion can break recovery
The code asserts that the local ID and the server-returned ID must match.
However, in real situations (for example cache invalidation, server restart or crushes), the server might legitimately return a different ID.
When that happens, the assertion triggers and kills the flow before the driver has a chance to recover or re-prepare the statement gracefully. -
Cache updates may reinforce the wrong mapping
If the cache lookup fails, the code stores the local prepared statement back into the cache under the current query_id.
But in this branch the query_id is the local ID, not necessarily the one the server returned.
Because of this, the logic never explicitly tries the clean path of:
First resolve using the ID the server returned, and only fall back to the local statement if needed.
In some LWT tests we actually see this behavior, instead of re-sending its just returns does nothing and waits forever
There was a problem hiding this comment.
Yes, I got that. What I don't understand from this PR description is why exactly it didn't work, and how those changes address that.
why exactly it didn't work
This was actually a question for the drivers team :)
There was a problem hiding this comment.
This was actually a question for the drivers team :)
sadly almost 2 months ago :(
|
Dear @copilot (1) explain what is wrong in the current implementation and how this code change fixes the issue (2) can you come up with a test that fails before the change and passes after? |
@mykaul was my explanation bad ? :) #725 (comment) |
Honestly - #726 (comment) is not bad as an explanation. (What I'm still missing is how we can get to a situation of a different ID. That's not very clear to me, but I suspect it's a server-side issue?) |
yes exactly |
Problem
When the server returned UNPREPARED (PreparedQueryNotFound) during an EXECUTE, the driver could fail to recover in some valid cases (cache miss, race, coordinator restart, id mismatch edge cases or crushes), even when the in-flight request still had enough PreparedStatement context to recover.
Expected behavior
If the driver still has the original PreparedStatement context, it should:
What this PR changes
In ResponseFuture PreparedQueryNotFound handling:
This preserves correctness while improving recovery in real-world coordinator invalidation/race/restart scenarios.
Tests
Added/updated unit coverage in tests/unit/test_response_future.py:
Fixes: scylladb/scylladb#27657
Pre-review checklist
./docs/source/.Fixes:annotations to PR description.