Skip to content

Commit

Permalink
net-tcp_bbr: v3: update TCP "bbr" congestion control module to BBRv3
Browse files Browse the repository at this point in the history
BBR v3 is an enhacement to the BBR v1 algorithm. It's designed to aim for lower
queues, lower loss, and better Reno/CUBIC coexistence than BBR v1.

BBR v3 maintains the core of BBR v1: an explicit model of the network
path that is two-dimensional, adapting to estimate the (a) maximum
available bandwidth and (b) maximum safe volume of data a flow can
keep in-flight in the network. It maintains the estimated BDP as a
core guide for estimating an appropriate level of in-flight data.

BBR v3 makes several key enhancements:

o Its bandwidth-probing time scale is adapted, within bounds, to allow improved
coexistence with Reno and CUBIC. The bandwidth-probing time scale is (a)
extended dynamically based on estimated BDP to improve coexistence with
Reno/CUBIC; (b) bounded by an interactive wall-clock time-scale to be more
scalable and responsive than Reno and CUBIC.

o Rather than being largely agnostic to loss and ECN marks, it explicitly uses
loss and (DCTCP-style) ECN signals to maintain its model.

o It aims for lower losses than v1 by adjusting its model to attempt to stay
within loss rate and ECN mark rate bounds (loss_thresh and ecn_thresh,
respectively).

o It adapts to loss/ECN signals even when the application is running out of
data ("application-limited"), in case the "application-limited" flow is also
"network-limited" (the bw and/or inflight available to this flow is lower than
previously estimated when the flow ran out of data).

o It has a three-part model: the model explicit three tracks operating points,
where an operating point is a tuple: (bandwidth, inflight). The three operating
points are:

  o latest:        the latest measurement from the current round trip
  o upper bound:   robust, optimistic, long-term upper bound
  o lower bound:   robust, conservative, short-term lower bound

These are stored in the following state variables:

  o latest:  bw_latest, inflight_latest
  o lo:      bw_lo,     inflight_lo
  o hi:      bw_hi[2],  inflight_hi

To gain intuition about the meaning of the three operating points, it
may help to consider the analogs in CUBIC, which has a somewhat
analogous three-part model used by its probing state machine:

  BBR param     CUBIC param
  -----------   -------------
  latest     ~  cwnd
  lo         ~  ssthresh
  hi         ~  last_max_cwnd

The analogy is only a loose one, though, since the BBR operating
points are calculated differently, and are 2-dimensional (bw,inflight)
rather than CUBIC's one-dimensional notion of operating point
(inflight).

o It uses the three-part model to adapt the magnitude of its bandwidth
to match the estimated space available in the buffer, rather than (as
in BBR v1) assuming that it was always acceptable to place 0.25*BDP in
the bottleneck buffer when probing (commodity datacenter switches
commonly do not have that much buffer for WAN flows). When BBR v3
estimates it hit a buffer limit during probing, its bandwidth probing
then starts gently in case little space is still available in the
buffer, and the accelerates, slowly at first and then rapidly if it
can grow inflight without seeing congestion signals. In such cases,
probing is bounded by inflight_hi + inflight_probe, where
inflight_probe grows as: [0, 1, 2, 4, 8, 16,...]. This allows BBR to
keep losses low and bounded if a bottleneck remains congested, while
rapidly/scalably utilizing free bandwidth when it becomes available.

o It has a slightly revised state machine, to achieve the goals above.
    BBR_BW_PROBE_UP:    pushes up inflight to probe for bw/vol
    BBR_BW_PROBE_DOWN:  drain excess inflight from the queue
    BBR_BW_PROBE_CRUISE: use pipe, w/ headroom in queue/pipe
    BBR_BW_PROBE_REFILL: try refill the pipe again to 100%, leaving queue empty

o The estimated BDP: BBR v3 continues to maintain an estimate of the
path's two-way propagation delay, by tracking a windowed min_rtt, and
coordinating (on an as-ndeeded basis) to try to expose the two-way
propagation delay by draining the bottleneck queue.

BBR v3 continues to use its min_rtt and (currently-applicable) bandwidth
estimate to estimate the current bandwidth-delay product. The estimated BDP
still provides one important guideline for bounding inflight data. However,
because any min-filtered RTT and max-filtered bw inherently tend to both
overestimate, the estimated BDP is often too high; in this case loss or ECN
marks can ensue, in which case BBR v3 adjusts inflight_hi and inflight_lo to
adapt its sending rate and inflight down to match the available capacity of the
path.

o Space: Note that ICSK_CA_PRIV_SIZE increased. This is because BBR v3
requires more space. Note that much of the space is due to support for
per-socket parameterization and debugging in this release for research
and debugging. With that state removed, the full "struct bbr" is 140
bytes, or 144 with padding. This is an increase of 40 bytes over the
existing ca_priv space.

o Code: BBR v3 reuses many pieces from BBR v1. But it omits the following
  significant pieces:

  o "packet conservation" (bbr_set_cwnd_to_recover_or_restore(),
    bbr_can_grow_inflight())
  o long-term bandwidth estimator ("policer mode")

  The code layout tries to keep BBR v3 code near the bottom of the
  file, so that v1-applicable code in the top does not accidentally
  refer to v3 code.

o Docs:
  See the following docs for more details and diagrams decsribing the BBR v3
  algorithm:
    https://datatracker.ietf.org/meeting/104/materials/slides-104-iccrg-an-update-on-bbr-00
    https://datatracker.ietf.org/meeting/102/materials/slides-102-iccrg-an-update-on-bbr-work-at-google-00

o Internal notes:
  For this upstream rebase, Neal started from:
    git show fed518041ac6:net/ipv4/tcp_bbr.c > net/ipv4/tcp_bbr.c
  then removed dev instrumentation (dynamic get/set for parameters)
  and code that was only used by BBRv1

Effort: net-tcp_bbr
Origin-9xx-SHA1: 2c84098e60bed6d67dde23cd7538c51dee273102
Change-Id: I125cf26ba2a7a686f2fa5e87f4c2afceb65f7a05
Signed-off-by: Alexandre Frade <kernel@xanmod.org>
  • Loading branch information
nealcardwell authored and xanmod committed Aug 16, 2023
1 parent 3f63752 commit 8e2c994
Show file tree
Hide file tree
Showing 5 changed files with 1,742 additions and 525 deletions.
4 changes: 2 additions & 2 deletions include/net/inet_connection_sock.h
Expand Up @@ -135,8 +135,8 @@ struct inet_connection_sock {
u32 icsk_probes_tstamp;
u32 icsk_user_timeout;

u64 icsk_ca_priv[104 / sizeof(u64)];
#define ICSK_CA_PRIV_SIZE sizeof_field(struct inet_connection_sock, icsk_ca_priv)
#define ICSK_CA_PRIV_SIZE (144)
u64 icsk_ca_priv[ICSK_CA_PRIV_SIZE / sizeof(u64)];
};

#define ICSK_TIME_RETRANS 1 /* Retransmit timer */
Expand Down
2 changes: 1 addition & 1 deletion include/net/tcp.h
Expand Up @@ -2232,7 +2232,7 @@ struct tcp_plb_state {
u8 consec_cong_rounds:5, /* consecutive congested rounds */
unused:3;
u32 pause_until; /* jiffies32 when PLB can resume rerouting */
};
} __attribute__ ((__packed__));

static inline void tcp_plb_init(const struct sock *sk,
struct tcp_plb_state *plb)
Expand Down
23 changes: 23 additions & 0 deletions include/uapi/linux/inet_diag.h
Expand Up @@ -229,6 +229,29 @@ struct tcp_bbr_info {
__u32 bbr_min_rtt; /* min-filtered RTT in uSec */
__u32 bbr_pacing_gain; /* pacing gain shifted left 8 bits */
__u32 bbr_cwnd_gain; /* cwnd gain shifted left 8 bits */
__u32 bbr_bw_hi_lsb; /* lower 32 bits of bw_hi */
__u32 bbr_bw_hi_msb; /* upper 32 bits of bw_hi */
__u32 bbr_bw_lo_lsb; /* lower 32 bits of bw_lo */
__u32 bbr_bw_lo_msb; /* upper 32 bits of bw_lo */
__u8 bbr_mode; /* current bbr_mode in state machine */
__u8 bbr_phase; /* current state machine phase */
__u8 unused1; /* alignment padding; not used yet */
__u8 bbr_version; /* BBR algorithm version */
__u32 bbr_inflight_lo; /* lower short-term data volume bound */
__u32 bbr_inflight_hi; /* higher long-term data volume bound */
__u32 bbr_extra_acked; /* max excess packets ACKed in epoch */
};

/* TCP BBR congestion control bbr_phase as reported in netlink/ss stats. */
enum tcp_bbr_phase {
BBR_PHASE_INVALID = 0,
BBR_PHASE_STARTUP = 1,
BBR_PHASE_DRAIN = 2,
BBR_PHASE_PROBE_RTT = 3,
BBR_PHASE_PROBE_BW_UP = 4,
BBR_PHASE_PROBE_BW_DOWN = 5,
BBR_PHASE_PROBE_BW_CRUISE = 6,
BBR_PHASE_PROBE_BW_REFILL = 7,
};

union tcp_cc_info {
Expand Down
21 changes: 12 additions & 9 deletions net/ipv4/Kconfig
Expand Up @@ -668,15 +668,18 @@ config TCP_CONG_BBR
default n
help

BBR (Bottleneck Bandwidth and RTT) TCP congestion control aims to
maximize network utilization and minimize queues. It builds an explicit
model of the bottleneck delivery rate and path round-trip propagation
delay. It tolerates packet loss and delay unrelated to congestion. It
can operate over LAN, WAN, cellular, wifi, or cable modem links. It can
coexist with flows that use loss-based congestion control, and can
operate with shallow buffers, deep buffers, bufferbloat, policers, or
AQM schemes that do not provide a delay signal. It requires the fq
("Fair Queue") pacing packet scheduler.
BBR (Bottleneck Bandwidth and RTT) TCP congestion control is a
model-based congestion control algorithm that aims to maximize
network utilization, keep queues and retransmit rates low, and to be
able to coexist with Reno/CUBIC in common scenarios. It builds an
explicit model of the network path. It tolerates a targeted degree
of random packet loss and delay. It can operate over LAN, WAN,
cellular, wifi, or cable modem links, and can use shallow-threshold
ECN signals. It can coexist to some degree with flows that use
loss-based congestion control, and can operate with shallow buffers,
deep buffers, bufferbloat, policers, or AQM schemes that do not
provide a delay signal. It requires pacing, using either TCP internal
pacing or the fq ("Fair Queue") pacing packet scheduler.

choice
prompt "Default TCP congestion control"
Expand Down

0 comments on commit 8e2c994

Please sign in to comment.