Skip to content

PDO pool deadlocks when over-saturated against a permanently-failing connection (chaos finding) #141

@EdmondDantes

Description

@EdmondDantes

Summary

The PDO connection pool can deadlock waiters when the underlying connection establishment fails permanently. With pool max=N and M > N concurrent coroutines, the first N coroutines acquire a slot, hit the failure, release the slot — but the remaining M-N waiters never acquire one, because every new connection attempt fails too. Pool reports Pool(idle=0, active=0, max=N) — the slot is conceptually free, but the pool can't establish a connection to back it. Waiters park until the global Async\DeadlockError detector trips.

Surfaced while writing db/pool_max_reset_chaos.feature for #138 (see #140) and documented in that feature's header as out-of-scope for the chaos backstop.

Reproducer (~1 s, requires Toxiproxy + MySQL)

<?php
require_once 'ext/async/fuzzy-tests/_peers/ToxiproxyClient.php';
use Async\Chaos\ToxiproxyClient;
use function Async\spawn;
use function Async\await_all;

$client    = new ToxiproxyClient('127.0.0.1:8474');
$proxyName = 'pool_deadlock_' . getmypid();
$listen    = $client->createProxy($proxyName, '127.0.0.1:0', '127.0.0.1:3306');
register_shutdown_function(fn() => $client->deleteProxy($proxyName));

[$h, $p] = explode(':', $listen);
$pdo = new \PDO("mysql:host=$h;port=$p;dbname=chaos_test", 'test', 'test', [
    \PDO::ATTR_ERRMODE      => \PDO::ERRMODE_EXCEPTION,
    \PDO::ATTR_POOL_ENABLED => true,
    \PDO::ATTR_POOL_MIN     => 0,
    \PDO::ATTR_POOL_MAX     => 1,   // deliberately undersized
    \PDO::ATTR_TIMEOUT      => 5,
]);

// permanent reset_peer — every new connection through the proxy gets RST
$client->addToxic($proxyName, $proxyName . '_rst', 'reset_peer', 'downstream',
    ['timeout' => 0], 1.0);

$tasks = [];
for ($i = 1; $i <= 3; $i++) {
    $tasks[] = spawn(function () use ($pdo, $i) {
        try {
            $stmt = $pdo->query('SELECT 1');
            while ($stmt->fetch(\PDO::FETCH_NUM) !== false) {}
            echo "coro $i: ok\n";
        } catch (\Throwable $e) {
            echo "coro $i: " . $e::class . ': ' . $e->getMessage() . "\n";
        }
    });
}
await_all($tasks);
echo "main: all coroutines joined\n";

Observed output

coro 1: PDOException: SQLSTATE[HY000]: General error: 2006 MySQL server has gone away

=== DEADLOCK REPORT START ===
Coroutines waiting: 3, active_events: 0

Coroutine 5 spawned at :0, suspended at /tmp/pool_deadlock_repro.php:58 (main)
  waiting for:
    - Coroutine 9 spawned at … line 50 ({closure})
    - Coroutine 11 spawned at … line 50 ({closure})

Coroutine 9 …
  waiting for:
    - Pool(idle=0, active=0, max=1)

Coroutine 11 …
  waiting for:
    - Pool(idle=0, active=0, max=1)
=== DEADLOCK REPORT END   ===

coro 2: PDOException: SQLSTATE[HY000]: General error: Failed to acquire connection from pool
coro 3: PDOException: SQLSTATE[HY000]: General error: Failed to acquire connection from pool

Fatal error: Uncaught Async\DeadlockError: Deadlock detected …

Expected

Either:

  1. Fail-fast on the pool acquire path — when the pool tries to back a free slot with a fresh connection and that connection establishment fails, the waiter should receive the same "Failed to acquire connection from pool" PDOException immediately, not after the global deadlock detector trips.
  2. Bounded retry budget — N retries with backoff, then fail-fast.

Crucially the Async\DeadlockError thrown at process level shouldn't be the mechanism that wakes the waiters — it's a system-wide signal, not a per-pool one, and it makes the failure look like a runtime bug to the application code.

Note that coro 2/coro 3 do eventually print their "Failed to acquire connection from pool" — meaning the pool already has the fail-fast code path. The bug is the ordering: the per-waiter fail-fast triggers after the deadlock detector, instead of being the proximate cause of the waiter waking up.

Reproduction environment

Related

Scope

This is the pool-acquire path; it does not affect the in-flight query teardown (those raise PDOException normally — coro 1 above demonstrates).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions