Skip to content

Retry once on stale-connection ConnectionError in Pool#with#2

Merged
tycooon merged 2 commits into
masterfrom
feat/retry-stale-connection
Apr 29, 2026
Merged

Retry once on stale-connection ConnectionError in Pool#with#2
tycooon merged 2 commits into
masterfrom
feat/retry-stale-connection

Conversation

@tycooon
Copy link
Copy Markdown
Member

@tycooon tycooon commented Apr 29, 2026

Summary

We hit this in production:

ClickhouseNative::ConnectionError: closed: Success
  at app/data/ch.rb:53 in CH#execute

The error originates in clickhouse-cpp's SocketInput::DoRead:

if (ret == 0) {
    throw std::system_error(getSocketErrorCode(), getErrorCategory(), "closed");
}

recv() == 0 means the peer cleanly closed the TCP connection (FIN). On a clean close errno isn't set, so strerror(0) == "Success" — hence the cryptic "closed: Success" message.

Common causes for a pooled connection going stale:

  • CH server's idle_connection_timeout / tcp_keep_alive_timeout (defaults: 3600s / 290s).
  • Load balancer / firewall idle-timeout drops (often 60–300s).
  • Server restart / OOM kill.

In the prod log, the failed INSERT took only 8.388ms — far faster than any real round-trip; recv returned 0 immediately. The connection was already half-closed by the server before any meaningful work happened, so the INSERT didn't execute server-side and a retry on a fresh socket is safe.

Implementation

Pool#with already discards the dead client on error (discard_current_connection) so the next checkout would be fresh. We just need to actually do that next checkout instead of bubbling the error to callers. New behavior:

  • ConnectionError → discard, retry once. If the retry also fails (e.g. server genuinely down), raise.
  • All other errors (ServerError, ProtocolError, EncoderError, UnsupportedTypeError, …) → discard, raise immediately. No retry.

The retry is intentionally narrow — ConnectionError is what fires for stale-pool sockets where the server closed before we sent anything meaningful. Other error classes can fire mid-operation and re-running them could mask bugs or double-execute writes.

Tests

Three new specs in the existing Pool describe block:

  • success on retry after a ConnectionError,
  • raise after the second ConnectionError,
  • no retry on ServerError.

Full suite: 63 examples, 0 failures. Rubocop clean.

Test plan

  • bundle exec rspec
  • bundle exec rubocop

🤖 Generated with Claude Code

tycooon and others added 2 commits April 29, 2026 12:48
Pool checkouts can hand out connections whose server-side end has
been closed while idle (CH idle_connection_timeout / tcp_keep_alive_timeout,
LB / firewall idle drops, server restart). The first read on such a
socket sees recv() == 0, which clickhouse-cpp surfaces as

  std::system_error(errno=0, "closed")

and we wrap as ClickhouseNative::ConnectionError("closed: Success").
The pool already discards the dead client (`discard_current_connection`
in the rescue) so the next checkout is fresh — but we used to re-raise
the error to the caller, who'd see a noisy "closed: Success" failure
on the very first operation after their pooled client went stale.

Now Pool#with discards-and-retries once on ConnectionError. If the
retry also fails (e.g. server actually down), we raise. Other error
classes (ServerError, ProtocolError, EncoderError, etc.) are still
not retried — those are real and re-running could mask logic bugs
or, for ServerError on a half-completed write, double-execute.

Three new specs cover the path: success on retry, raise after the
second failure, and no retry on non-ConnectionError.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@tycooon tycooon merged commit 5ade825 into master Apr 29, 2026
9 checks passed
@tycooon tycooon deleted the feat/retry-stale-connection branch April 29, 2026 09:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant