Skip to content

fix: improve driver-side timeout logging with timeout value and pool state (3.x)#867

Merged
nikagra merged 1 commit intoscylladb:scylla-3.xfrom
nikagra:mhradovich/3.x-improve-timeout-logging
Apr 15, 2026
Merged

fix: improve driver-side timeout logging with timeout value and pool state (3.x)#867
nikagra merged 1 commit intoscylladb:scylla-3.xfrom
nikagra:mhradovich/3.x-improve-timeout-logging

Conversation

@nikagra
Copy link
Copy Markdown

@nikagra nikagra commented Apr 14, 2026

Fixes: DRIVER-541

Problem

When OperationTimedOutException fires, the existing log output gives no information about why it happened. There are two distinct failure modes that look identical in the logs:

  1. Pool contention — the request was queued inside the driver waiting to borrow a connection, and the cumulative wait pushed the total elapsed time past readTimeoutMillis. The request may never have reached the server.
  2. Server-side slowness — the request reached the server but the server took longer than readTimeoutMillis to respond (e.g. due to a reader_concurrency_semaphore stall). The driver gave up, but the server may still be processing it, orphaning the stream ID.

Without knowing the configured timeout value or the connection state at the moment of timeout, diagnosing which scenario occurred requires guesswork.

Motivated by CUSTOMER-301 (Whoop): repeated bursts of OperationTimedOutException on the client side with no corresponding server-side timeouts. The server-side investigation pointed to a reader_concurrency_semaphore bug in ScyllaDB 2025.4.5, but the driver-side picture was unclear without the pool state at timeout time.

Changes

Connection.java

ResponseHandler.readTimeoutMillis: private → package-private.

This field holds the exact timeout value used to schedule the HashedWheelTimer for this request (either the per-statement override from Statement.setReadTimeoutMillis(), or the driver-wide SocketOptions.readTimeoutMillis() as fallback). Making it package-private allows RequestHandler (same package) to read it directly at timeout time, guaranteeing the logged value matches what the timer was set to.

RequestHandler.java

In SpeculativeExecution.onTimeout(), add a WARN log line immediately before the OperationTimedOutException is constructed:

[<execution-id>] Driver-side timeout waiting for response from <host> —
    configured timeout: <N>ms,
    connection in-flight: <N>,
    pool pending borrows: <N>,
    pool total in-flight: <N>

What each field tells you:

Field Meaning
configured timeout The readTimeoutMillis value actually used for the timer — confirms whether a per-statement or driver-wide timeout was in effect
connection in-flight Number of requests currently awaiting a response on this specific connection (Connection.inFlight)
pool pending borrows Number of requests waiting to acquire any connection from the pool (HostConnectionPool.pendingBorrowCount) — non-zero means pool contention
pool total in-flight Total in-flight across all connections to this host (HostConnectionPool.totalInFlight)

Diagnosing failure modes with this output:

  • pool pending borrows > 0 → pool contention; requests are queuing inside the driver before reaching the server
  • pool pending borrows = 0 + high connection in-flight → server is slow; requests are sent but not answered within the timeout
  • configured timeout is unexpectedly small → non-default readTimeoutMillis is set somewhere (e.g. in a framework wrapper)

Notes

  • No logic changes; the new log line fires at existing WARN level alongside the existing timeout handling path.
  • connectionHandler is null-checked defensively; in practice it will always be set when the timer fires, but the field is volatile and set to null at the start of each retry write.
  • pool is null-checked since manager.pools.get(queriedHost) can return null if the host was removed between request dispatch and timeout.
  • No test changes: there are no existing unit tests for SpeculativeExecution.onTimeout() in RequestHandlerTest; adding meaningful coverage requires a non-trivial mock of HostConnectionPool and is better tracked as a follow-up.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Improves driver-side diagnostics for OperationTimedOutException by logging the effective read timeout and relevant pool/connection state at the moment a driver-side timeout fires, helping distinguish pool contention vs server slowness.

Changes:

  • Add a WARN log in RequestHandler.SpeculativeExecution.onTimeout() including configured timeout, connection in-flight, and pool queue/load counters.
  • Expose the exact timeout used by the request timer by making Connection.ResponseHandler.readTimeoutMillis package-private.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
driver-core/src/main/java/com/datastax/driver/core/RequestHandler.java Adds timeout WARN logging with connection/pool state at the time of driver-side timeout.
driver-core/src/main/java/com/datastax/driver/core/Connection.java Makes ResponseHandler.readTimeoutMillis accessible within the package so the timeout log can reflect the timer’s configured value.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread driver-core/src/main/java/com/datastax/driver/core/RequestHandler.java Outdated
Comment thread driver-core/src/main/java/com/datastax/driver/core/RequestHandler.java Outdated
Comment on lines +1011 to +1023
logger.warn(
"[{}] Driver-side timeout waiting for response from {} — "
+ "configured timeout: {}ms, "
+ "connection in-flight: {}, "
+ "pool pending borrows: {}, "
+ "pool total in-flight: {}",
id,
connection.endPoint,
connectionHandler != null ? connectionHandler.readTimeoutMillis : "unknown",
connection.inFlight.get(),
pool != null ? pool.pendingBorrowCount.get() : "unknown",
pool != null ? pool.totalInFlight.get() : "unknown");

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should be using warn in something that can happen in a typical operation, and is produced in the request path. It may result in a lot of spam.

Can you instead add new fields to OperationTimedOutException?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — logger.warn() is gone entirely. I've added four new fields to OperationTimedOutException:

  • configuredTimeoutMs — the driver-side read timeout in effect for the request
  • connectionInFlight — in-flight count on the specific connection at timeout time
  • poolPendingBorrows — requests queued waiting for a connection from the pool
  • poolTotalInFlight — total in-flight across all connections to that host

A named sentinel constant UNAVAILABLE = -1 is used when any value cannot be determined. The enriched message is baked into the exception's message string (visible in stack traces automatically) and all fields are also accessible via getters for programmatic use (e.g. in a LatencyTracker). No logging added to the hot path.

@nikagra nikagra force-pushed the mhradovich/3.x-improve-timeout-logging branch from a421668 to 3d9f63f Compare April 14, 2026 14:04
@nikagra nikagra requested a review from Lorak-mmk April 14, 2026 14:05
@roydahan
Copy link
Copy Markdown
Collaborator

@nikagra please change the JIRA reference to https://scylladb.atlassian.net/browse/DRIVER-541

…state

When OperationTimedOutException fires in SpeculativeExecution.onTimeout(),
log the configured driver-side timeout, per-connection in-flight count,
pool pending borrow queue length, and total pool in-flight count.

This helps distinguish between two failure modes:
- Pool contention (requests queued inside the driver before reaching server):
  visible as pool pending borrows > 0
- Server-side slowness (requests reached server but driver gave up first):
  visible as pool pending borrows = 0 with high connection in-flight

To expose the timeout value, ResponseHandler.readTimeoutMillis is widened
from private to package-private. All access remains within
com.datastax.driver.core.

Motivated by CUSTOMER-301 (Whoop): client-side OperationTimedOutException
bursts with no corresponding server-side timeouts, requiring diagnostics
to determine whether the bottleneck is in the driver pool or on the server.
@nikagra nikagra force-pushed the mhradovich/3.x-improve-timeout-logging branch from 3d9f63f to 887cf96 Compare April 15, 2026 12:25
@nikagra nikagra merged commit cfb7018 into scylladb:scylla-3.x Apr 15, 2026
14 checks passed
@nikagra nikagra deleted the mhradovich/3.x-improve-timeout-logging branch April 15, 2026 16:48
@roydahan
Copy link
Copy Markdown
Collaborator

Looking at the integration tests, it looks like the fix is working:

Starting com.datastax.driver.core.QueryLoggerErrorsTest.should_log_timed_out_queries [Test #714]...
    1423749 [cluster1017-timeouter-0] DEBUG com.datastax.driver.core.QueryLogger.ERROR - [cluster1017] [/127.0.1.1:46829] Query error after 95 ms: [0 bound values] SELECT foo FROM bar;
com.datastax.driver.core.exceptions.OperationTimedOutException: [/127.0.1.1:46829] Timed out waiting for server response [configured timeout: 1ms, connection in-flight: 1, pool pending borrows: 0, pool total in-flight: 1]
	at com.datastax.driver.core.RequestHandler$SpeculativeExecution.onTimeout(RequestHandler.java:1021)
	at com.datastax.driver.core.Connection$ResponseHandler$1.run(Connection.java:1943)
	at io.netty.util.HashedWheelTimer$HashedWheelTimeout.run(HashedWheelTimer.java:713)
	at io.netty.util.concurrent.ImmediateExecutor.execute(ImmediateExecutor.java:34)
	at io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:701)
	at io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:787)
	at io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:501)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.lang.Thread.run(Thread.java:750)
SUCCESS: should_log_timed_out_queries

@nikagra Am i correct?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants