Add LWTRetryPolicy: retry CAS timeouts on same host with backoff by mykaul · Pull Request #783 · scylladb/python-driver

mykaul · 2026-04-01T09:13:02Z

Summary

LWT queries use Paxos consensus where the first replica (Paxos coordinator/leader) drives the consensus rounds. When a CAS write times out, retrying on a different host causes Paxos contention — the new coordinator must compete with the original, potentially causing cascading timeouts across the cluster.

Currently, no built-in retry policy retries CAS write timeouts at all — they are all RETHROWN immediately:

RetryPolicy.on_write_timeout: CAS → RETHROW
ExponentialBackoffRetryPolicy.on_write_timeout: CAS → RETHROW
DowngradingConsistencyRetryPolicy.on_write_timeout: CAS → RETHROW

This PR adds LWTRetryPolicy, a new retry policy that extends ExponentialBackoffRetryPolicy with LWT-aware behavior:

Scenario	Decision	Rationale
CAS write timeout	RETRY same host + backoff	Stay on Paxos coordinator to avoid contention
Serial read timeout	RETRY same host + backoff	CAS read at serial CL, same coordinator logic
Serial unavailable	RETRY next host + backoff	Paxos quorum lost on this node, try another
Non-CAS operations	Delegate to parent	Standard `ExponentialBackoffRetryPolicy` behavior

This is modeled after gocql's LWTRetryPolicy interface, which retries LWT queries on the same host to avoid Paxos contention. The key comment from gocql (line 188):

"Retrying on a different host is fine for normal (non-LWT) queries, but in case of LWTs it will cause Paxos contention and possibly even timeouts if other clients send statements touching the same partition to the same time."

Usage

from cassandra.cluster import Cluster
from cassandra.policies import LWTRetryPolicy

# Use as the default retry policy
cluster = Cluster(default_retry_policy=LWTRetryPolicy(max_num_retries=3))

# Or assign to a specific statement
statement.retry_policy = LWTRetryPolicy(max_num_retries=5)

Changes

cassandra/policies.py: Added LWTRetryPolicy class (extends ExponentialBackoffRetryPolicy)
tests/unit/test_policies.py: Added LWTRetryPolicyTest with 21 tests

Tests

21 new tests covering:

CAS write timeout retries on same host with backoff
Backoff delay increases with retry attempts
Max retries exceeded → RETHROW
Consistency level preserved across retries
Non-CAS writes delegate to parent (SIMPLE→RETHROW, BATCH_LOG→RETRY, COUNTER→RETHROW)
Serial read timeout retries on same host (SERIAL and LOCAL_SERIAL)
Serial unavailable retries on next host
Non-serial operations delegate to parent policy
Request errors inherit parent behavior
Constructor defaults and customization
All methods return proper 3-tuples

All 103 tests in tests/unit/test_policies.py pass.

@calebxyz - it's pointless to compare the different drivers' performance - they differ greatly. What is important is the correct and optimized behavior - and there we still have gaps. I think we are very far from testing the correct behavior - we need many more system level tests on one hand (and on the other hand, I'm against testing it in full setup - which is why I've created scylladb/scylla-ccm#731 (that is probably not ready yet , but that's a different issue)

calebxyz · 2026-04-07T07:50:56Z

CC @calebxyz It needs more review (for me first of all), but looks important to push for at some point.

If this behavior is something that we have on go drivers it should be good, do we know the performance for LWT on go vs java for example? Or vs python. Cc @temichus

@calebxyz - it's pointless to compare the different drivers' performance - they differ greatly. What is important is the correct and optimized behavior - and there we still have gaps.

This is sad, the amount of unpredictability is horrible

mykaul · 2026-04-07T07:53:29Z

CC @calebxyz It needs more review (for me first of all), but looks important to push for at some point.

If this behavior is something that we have on go drivers it should be good, do we know the performance for LWT on go vs java for example? Or vs python. Cc @temichus

@calebxyz - it's pointless to compare the different drivers' performance - they differ greatly. What is important is the correct and optimized behavior - and there we still have gaps.

This is sad, the amount of unpredictability is horrible

That's one of the major reasons to move some to be Rust based - Rust, CPP-over-Rust, NodeJS-over-Rust, Python-over-Rust. (and we'll stay with Java and Go, I reckon).
SAME situation with our Alternator clients!

LWT queries use Paxos consensus where the coordinator is the Paxos leader. Retrying on a different host causes Paxos contention — the new coordinator must compete with the original one, potentially causing cascading timeouts. LWTRetryPolicy (extends ExponentialBackoffRetryPolicy) handles this by: - CAS write timeouts: retry on SAME host with exponential backoff - Serial consistency read timeouts: retry on SAME host with backoff - Serial consistency unavailable: retry on NEXT host (paxos quorum lost) - Non-CAS operations: delegate to base ExponentialBackoffRetryPolicy Modeled after gocql's LWTRetryPolicy interface.

mykaul mentioned this pull request Apr 1, 2026

Fix SimpleStatement.is_lwt(): detect LWT from CQL query string #784

Draft

mykaul marked this pull request as draft April 1, 2026 12:46

mykaul force-pushed the feature/lwt-retry-policy branch from f1a865b to d2a8538 Compare April 7, 2026 16:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add LWTRetryPolicy: retry CAS timeouts on same host with backoff#783

Add LWTRetryPolicy: retry CAS timeouts on same host with backoff#783
mykaul wants to merge 1 commit intoscylladb:masterfrom
mykaul:feature/lwt-retry-policy

mykaul commented Apr 1, 2026

Uh oh!

mykaul commented Apr 7, 2026

Uh oh!

calebxyz commented Apr 7, 2026

Uh oh!

mykaul commented Apr 7, 2026

Uh oh!

calebxyz commented Apr 7, 2026

Uh oh!

mykaul commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mykaul commented Apr 1, 2026

Summary

Usage

Changes

Tests

Related

Uh oh!

mykaul commented Apr 7, 2026

Uh oh!

calebxyz commented Apr 7, 2026

Uh oh!

mykaul commented Apr 7, 2026

Uh oh!

calebxyz commented Apr 7, 2026

Uh oh!

mykaul commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants