Make transparent proxy health check parameters configurable

## Problem

When a transient network disruption occurs between the transparent proxy and a container, the proxy's health check fails consecutively and triggers proxy shutdown. This invalidates the client's MCP session ("Session not found"), requiring a full client restart to recover.

The total window from first failure to shutdown is ~40 seconds with current defaults (3 ticks at 10s intervals + 5s retry delays). Any network blip lasting longer than this kills the proxy, even though the MCP server process inside the container is perfectly healthy.

## Root cause

The transparent proxy health check parameters are mostly hardcoded:

| Parameter | Default | Configurable? |
|---|---|---|
| Health check interval | 10s | Yes — `TOOLHIVE_HEALTH_CHECK_INTERVAL` |
| Ping timeout | 5s | No — `DefaultPingerTimeout` in `pinger.go` |
| Retry delay between failures | 5s | No — `DefaultHealthCheckRetryDelay` |
| Consecutive failures before shutdown | 3 | No — `healthCheckRetryCount` constant |

The `with*` option functions (`withHealthCheckPingTimeout()`, `withHealthCheckRetryDelay()`) already exist but are unexported and only used in tests.

## Proposed fix

1. Expose the hardcoded parameters as environment variables, following the existing pattern of `TOOLHIVE_HEALTH_CHECK_INTERVAL`:

| Env Var | Controls | Default |
|---|---|---|
| `TOOLHIVE_HEALTH_CHECK_PING_TIMEOUT` | Per-attempt ping timeout | 5s |
| `TOOLHIVE_HEALTH_CHECK_RETRY_DELAY` | Delay between consecutive failures | 5s |
| `TOOLHIVE_HEALTH_CHECK_FAILURE_THRESHOLD` | Consecutive failures before shutdown | 5 |

2. Raise the default failure threshold from 3 to 5 to improve resilience out of the box. This extends the tolerance window from ~40s to ~60s, reducing false positives from transient network disruptions without significantly delaying detection of genuinely dead servers.

### Why not just increase the interval?

Increasing `TOOLHIVE_HEALTH_CHECK_INTERVAL` delays detection but doesn't increase resilience — the failure count is still fixed. A user who wants to tolerate longer network blips would need a very large interval, which means legitimate failures also take a long time to detect. Configuring the failure threshold separately allows both quick detection (short interval) and network tolerance (higher threshold).

### Scope

This proposal is only for the transparent proxy (SSE and streamable HTTP transports). The vMCP server has a separate circuit breaker pattern.

## Related

- ToolHive issues: #3783, #1357
- ToolHive PRs: #4063

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make transparent proxy health check parameters configurable #4084

Problem

Root cause

Proposed fix

Why not just increase the interval?

Scope

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Parameter	Default	Configurable?
Health check interval	10s	Yes — `TOOLHIVE_HEALTH_CHECK_INTERVAL`
Ping timeout	5s	No — `DefaultPingerTimeout` in `pinger.go`
Retry delay between failures	5s	No — `DefaultHealthCheckRetryDelay`
Consecutive failures before shutdown	3	No — `healthCheckRetryCount` constant

Env Var	Controls	Default
`TOOLHIVE_HEALTH_CHECK_PING_TIMEOUT`	Per-attempt ping timeout	5s
`TOOLHIVE_HEALTH_CHECK_RETRY_DELAY`	Delay between consecutive failures	5s
`TOOLHIVE_HEALTH_CHECK_FAILURE_THRESHOLD`	Consecutive failures before shutdown	5

Make transparent proxy health check parameters configurable #4084

Description

Problem

Root cause

Proposed fix

Why not just increase the interval?

Scope

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions