fix: improve driver-side timeout logging with timeout value and pool state (3.x)#867
Conversation
There was a problem hiding this comment.
Pull request overview
Improves driver-side diagnostics for OperationTimedOutException by logging the effective read timeout and relevant pool/connection state at the moment a driver-side timeout fires, helping distinguish pool contention vs server slowness.
Changes:
- Add a WARN log in
RequestHandler.SpeculativeExecution.onTimeout()including configured timeout, connection in-flight, and pool queue/load counters. - Expose the exact timeout used by the request timer by making
Connection.ResponseHandler.readTimeoutMillispackage-private.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| driver-core/src/main/java/com/datastax/driver/core/RequestHandler.java | Adds timeout WARN logging with connection/pool state at the time of driver-side timeout. |
| driver-core/src/main/java/com/datastax/driver/core/Connection.java | Makes ResponseHandler.readTimeoutMillis accessible within the package so the timeout log can reflect the timer’s configured value. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| logger.warn( | ||
| "[{}] Driver-side timeout waiting for response from {} — " | ||
| + "configured timeout: {}ms, " | ||
| + "connection in-flight: {}, " | ||
| + "pool pending borrows: {}, " | ||
| + "pool total in-flight: {}", | ||
| id, | ||
| connection.endPoint, | ||
| connectionHandler != null ? connectionHandler.readTimeoutMillis : "unknown", | ||
| connection.inFlight.get(), | ||
| pool != null ? pool.pendingBorrowCount.get() : "unknown", | ||
| pool != null ? pool.totalInFlight.get() : "unknown"); | ||
|
|
There was a problem hiding this comment.
I don't think we should be using warn in something that can happen in a typical operation, and is produced in the request path. It may result in a lot of spam.
Can you instead add new fields to OperationTimedOutException?
There was a problem hiding this comment.
Done — logger.warn() is gone entirely. I've added four new fields to OperationTimedOutException:
configuredTimeoutMs— the driver-side read timeout in effect for the requestconnectionInFlight— in-flight count on the specific connection at timeout timepoolPendingBorrows— requests queued waiting for a connection from the poolpoolTotalInFlight— total in-flight across all connections to that host
A named sentinel constant UNAVAILABLE = -1 is used when any value cannot be determined. The enriched message is baked into the exception's message string (visible in stack traces automatically) and all fields are also accessible via getters for programmatic use (e.g. in a LatencyTracker). No logging added to the hot path.
a421668 to
3d9f63f
Compare
|
@nikagra please change the JIRA reference to https://scylladb.atlassian.net/browse/DRIVER-541 |
…state When OperationTimedOutException fires in SpeculativeExecution.onTimeout(), log the configured driver-side timeout, per-connection in-flight count, pool pending borrow queue length, and total pool in-flight count. This helps distinguish between two failure modes: - Pool contention (requests queued inside the driver before reaching server): visible as pool pending borrows > 0 - Server-side slowness (requests reached server but driver gave up first): visible as pool pending borrows = 0 with high connection in-flight To expose the timeout value, ResponseHandler.readTimeoutMillis is widened from private to package-private. All access remains within com.datastax.driver.core. Motivated by CUSTOMER-301 (Whoop): client-side OperationTimedOutException bursts with no corresponding server-side timeouts, requiring diagnostics to determine whether the bottleneck is in the driver pool or on the server.
3d9f63f to
887cf96
Compare
|
Looking at the integration tests, it looks like the fix is working: @nikagra Am i correct? |
Fixes: DRIVER-541
Problem
When
OperationTimedOutExceptionfires, the existing log output gives no information about why it happened. There are two distinct failure modes that look identical in the logs:readTimeoutMillis. The request may never have reached the server.readTimeoutMillisto respond (e.g. due to areader_concurrency_semaphorestall). The driver gave up, but the server may still be processing it, orphaning the stream ID.Without knowing the configured timeout value or the connection state at the moment of timeout, diagnosing which scenario occurred requires guesswork.
Motivated by CUSTOMER-301 (Whoop): repeated bursts of
OperationTimedOutExceptionon the client side with no corresponding server-side timeouts. The server-side investigation pointed to areader_concurrency_semaphorebug in ScyllaDB 2025.4.5, but the driver-side picture was unclear without the pool state at timeout time.Changes
Connection.javaResponseHandler.readTimeoutMillis:private→ package-private.This field holds the exact timeout value used to schedule the
HashedWheelTimerfor this request (either the per-statement override fromStatement.setReadTimeoutMillis(), or the driver-wideSocketOptions.readTimeoutMillis()as fallback). Making it package-private allowsRequestHandler(same package) to read it directly at timeout time, guaranteeing the logged value matches what the timer was set to.RequestHandler.javaIn
SpeculativeExecution.onTimeout(), add aWARNlog line immediately before theOperationTimedOutExceptionis constructed:What each field tells you:
configured timeoutreadTimeoutMillisvalue actually used for the timer — confirms whether a per-statement or driver-wide timeout was in effectconnection in-flightConnection.inFlight)pool pending borrowsHostConnectionPool.pendingBorrowCount) — non-zero means pool contentionpool total in-flightHostConnectionPool.totalInFlight)Diagnosing failure modes with this output:
pool pending borrows > 0→ pool contention; requests are queuing inside the driver before reaching the serverpool pending borrows = 0+ highconnection in-flight→ server is slow; requests are sent but not answered within the timeoutconfigured timeoutis unexpectedly small → non-defaultreadTimeoutMillisis set somewhere (e.g. in a framework wrapper)Notes
connectionHandleris null-checked defensively; in practice it will always be set when the timer fires, but the field isvolatileand set tonullat the start of each retry write.poolis null-checked sincemanager.pools.get(queriedHost)can return null if the host was removed between request dispatch and timeout.SpeculativeExecution.onTimeout()inRequestHandlerTest; adding meaningful coverage requires a non-trivial mock ofHostConnectionPooland is better tracked as a follow-up.