Skip to content

VIP31: Backend connection queue

Dridi Boukelmoune edited this page May 12, 2021 · 2 revisions

Synopsis

We want to avoid the instant failure nature of fetch transactions working with a backend that has reached its max_connections setting.

Why

Currently, we can mitigate a backend overload by limiting the number of concurrent connections, and for HTTP/1 that implies the same limit on concurrent requests. However, once a backend is saturated with work, any attempt of a new fetch fails immediately. For traffic spikes we would rather get a chance to wait for an opportunity before effectively failing on the max_connections criterion.

How

Connections queue

A transaction trying to acquire a connection when the backend is already saturated should be queued. We don't want this queue to be able to grow forever, so we want both a limit and a timeout. Additionally, while waiting for a connection the task should disembark the fetch state machine instead of blocking a worker thread. When a transaction reembarks successfully, it should be guaranteed to reuse a connection or attempt a new one, in other words a successful reembark should not run into max_connections saturation again.

New parameters

New global parameters are needed to define a default queue limit and timeout:

  • backend_wait_timeout (defaults to 0, meaning no timeout)
  • backend_wait_limit (defaults to 0, meaning no queuing)

The default values don't change the current max_connections behavior. The parameter names were inspired by the existing backend_idle_timeout also related directly to backend connection management.

VCL changes

It should be possible to override the global parameters on a per-backend basis:

backend unreliable {
    .host = "unreliable.example.com";
    .max_connections = 100;
    .wait_timeout = 10s;
    .wait_limit = 20;
}

In addition, it should be possible to override the timeout on a per-transaction basis:

sub vcl_backend_fetch {
    if (bereq.backend == unreliable && bereq.url ~ "/non/critical") {
        set bereq.wait_timeout = 1m; # we can afford to wait longer
    }
}

This means that the max_connections queue cannot be a mere fifo.

Disembarking

We will need to change the backend API and fetch state machine to introduce a step dedicated to attempting a backend connection. This could involve the waiter facility.

Backend resolution

A director should ideally not consider a backend that will neither connect immediately nor wait for a connection to be available. We could teach regular backends to report sick when saturated, but only in the context of a transaction (note: vbe_healthy() currently disregards ctx).

Clone this wiki locally