Permalink
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Browse files
net-tcp_bbr: v2: BBRv2 ("bbr2") congestion control for Linux TCP
BBR v2 is an enhacement to the BBR v1 algorithm. It's designed to aim for lower
queues, lower loss, and better Reno/CUBIC coexistence than BBR v1.
BBR v2 maintains the core of BBR v1: an explicit model of the network
path that is two-dimensional, adapting to estimate the (a) maximum
available bandwidth and (b) maximum safe volume of data a flow can
keep in-flight in the network. It maintains the estimated BDP as a
core guide for estimating an appropriate level of in-flight data.
BBR v2 makes several key enhancements:
o Its bandwidth-probing time scale is adapted, within bounds, to allow improved
coexistence with Reno and CUBIC. The bandwidth-probing time scale is (a)
extended dynamically based on estimated BDP to improve coexistence with
Reno/CUBIC; (b) bounded by an interactive wall-clock time-scale to be more
scalable and responsive than Reno and CUBIC.
o Rather than being largely agnostic to loss and ECN marks, it explicitly uses
loss and (DCTCP-style) ECN signals to maintain its model.
o It aims for lower losses than v1 by adjusting its model to attempt to stay
within loss rate and ECN mark rate bounds (loss_thresh and ecn_thresh,
respectively).
o It adapts to loss/ECN signals even when the application is running out of
data ("application-limited"), in case the "application-limited" flow is also
"network-limited" (the bw and/or inflight available to this flow is lower than
previously estimated when the flow ran out of data).
o It has a three-part model: the model explicit three tracks operating points,
where an operating point is a tuple: (bandwidth, inflight). The three operating
points are:
o latest: the latest measurement from the current round trip
o upper bound: robust, optimistic, long-term upper bound
o lower bound: robust, conservative, short-term lower bound
These are stored in the following state variables:
o latest: bw_latest, inflight_latest
o lo: bw_lo, inflight_lo
o hi: bw_hi[2], inflight_hi
To gain intuition about the meaning of the three operating points, it
may help to consider the analogs in CUBIC, which has a somewhat
analogous three-part model used by its probing state machine:
BBR param CUBIC param
----------- -------------
latest ~ cwnd
lo ~ ssthresh
hi ~ last_max_cwnd
The analogy is only a loose one, though, since the BBR operating
points are calculated differently, and are 2-dimensional (bw,inflight)
rather than CUBIC's one-dimensional notion of operating point
(inflight).
o It uses the three-part model to adapt the magnitude of its bandwidth
to match the estimated space available in the buffer, rather than (as
in BBR v1) assuming that it was always acceptable to place 0.25*BDP in
the bottleneck buffer when probing (commodity datacenter switches
commonly do not have that much buffer for WAN flows). When BBR v2
estimates it hit a buffer limit during probing, its bandwidth probing
then starts gently in case little space is still available in the
buffer, and the accelerates, slowly at first and then rapidly if it
can grow inflight without seeing congestion signals. In such cases,
probing is bounded by inflight_hi + inflight_probe, where
inflight_probe grows as: [0, 1, 2, 4, 8, 16,...]. This allows BBR to
keep losses low and bounded if a bottleneck remains congested, while
rapidly/scalably utilizing free bandwidth when it becomes available.
o It has a slightly revised state machine, to achieve the goals above.
BBR_BW_PROBE_UP: pushes up inflight to probe for bw/vol
BBR_BW_PROBE_DOWN: drain excess inflight from the queue
BBR_BW_PROBE_CRUISE: use pipe, w/ headroom in queue/pipe
BBR_BW_PROBE_REFILL: try refill the pipe again to 100%, leaving queue empty
o The estimated BDP: BBR v2 continues to maintain an estimate of the
path's two-way propagation delay, by tracking a windowed min_rtt, and
coordinating (on an as-ndeeded basis) to try to expose the two-way
propagation delay by draining the bottleneck queue.
BBR v2 continues to use its min_rtt and (currently-applicable) bandwidth
estimate to estimate the current bandwidth-delay product. The estimated BDP
still provides one important guideline for bounding inflight data. However,
because any min-filtered RTT and max-filtered bw inherently tend to both
overestimate, the estimated BDP is often too high; in this case loss or ECN
marks can ensue, in which case BBR v2 adjusts inflight_hi and inflight_lo to
adapt its sending rate and inflight down to match the available capacity of the
path.
o Space: Note that ICSK_CA_PRIV_SIZE increased. This is because BBR v2
requires more space. Note that much of the space is due to support for
per-socket parameterization and debugging in this release for research
and debugging. With that state removed, the full "struct bbr" is 140
bytes, or 144 with padding. This is an increase of 40 bytes over the
existing ca_priv space.
o Code: BBR v2 reuses many pieces from BBR v1. But it omits the following
significant pieces:
o "packet conservation" (bbr_set_cwnd_to_recover_or_restore(),
bbr_can_grow_inflight())
o long-term bandwidth estimator ("policer mode")
The code layout tries to keep BBR v2 code near the bottom of the
file, so that v1-applicable code in the top does not accidentally
refer to v2 code.
o Docs:
See the following docs for more details and diagrams decsribing the BBR v2
algorithm:
https://datatracker.ietf.org/meeting/104/materials/slides-104-iccrg-an-update-on-bbr-00
https://datatracker.ietf.org/meeting/102/materials/slides-102-iccrg-an-update-on-bbr-work-at-google-00
o Internal notes:
For this upstream rebase, Neal started from:
git show fed518041ac6:net/ipv4/tcp_bbr.c > net/ipv4/tcp_bbr.c
then removed dev instrumentation (dynamic get/set for parameters)
and code that was only used by BBRv1
Effort: net-tcp_bbr
Origin-9xx-SHA1: 2c84098e60bed6d67dde23cd7538c51dee273102
Change-Id: I125cf26ba2a7a686f2fa5e87f4c2afceb65f7a05- Loading branch information