Skip to content

Commit

Permalink
net-tcp_bbr: v2: BBRv2 ("bbr2") congestion control for Linux TCP
Browse files Browse the repository at this point in the history
BBR v2 is an enhacement to the BBR v1 algorithm. It's designed to aim for lower
queues, lower loss, and better Reno/CUBIC coexistence than BBR v1.

BBR v2 maintains the core of BBR v1: an explicit model of the network
path that is two-dimensional, adapting to estimate the (a) maximum
available bandwidth and (b) maximum safe volume of data a flow can
keep in-flight in the network. It maintains the estimated BDP as a
core guide for estimating an appropriate level of in-flight data.

BBR v2 makes several key enhancements:

o Its bandwidth-probing time scale is adapted, within bounds, to allow improved
coexistence with Reno and CUBIC. The bandwidth-probing time scale is (a)
extended dynamically based on estimated BDP to improve coexistence with
Reno/CUBIC; (b) bounded by an interactive wall-clock time-scale to be more
scalable and responsive than Reno and CUBIC.

o Rather than being largely agnostic to loss and ECN marks, it explicitly uses
loss and (DCTCP-style) ECN signals to maintain its model.

o It aims for lower losses than v1 by adjusting its model to attempt to stay
within loss rate and ECN mark rate bounds (loss_thresh and ecn_thresh,
respectively).

o It adapts to loss/ECN signals even when the application is running out of
data ("application-limited"), in case the "application-limited" flow is also
"network-limited" (the bw and/or inflight available to this flow is lower than
previously estimated when the flow ran out of data).

o It has a three-part model: the model explicit three tracks operating points,
where an operating point is a tuple: (bandwidth, inflight). The three operating
points are:

  o latest:        the latest measurement from the current round trip
  o upper bound:   robust, optimistic, long-term upper bound
  o lower bound:   robust, conservative, short-term lower bound

These are stored in the following state variables:

  o latest:  bw_latest, inflight_latest
  o lo:      bw_lo,     inflight_lo
  o hi:      bw_hi[2],  inflight_hi

To gain intuition about the meaning of the three operating points, it
may help to consider the analogs in CUBIC, which has a somewhat
analogous three-part model used by its probing state machine:

  BBR param     CUBIC param
  -----------   -------------
  latest     ~  cwnd
  lo         ~  ssthresh
  hi         ~  last_max_cwnd

The analogy is only a loose one, though, since the BBR operating
points are calculated differently, and are 2-dimensional (bw,inflight)
rather than CUBIC's one-dimensional notion of operating point
(inflight).

o It uses the three-part model to adapt the magnitude of its bandwidth
to match the estimated space available in the buffer, rather than (as
in BBR v1) assuming that it was always acceptable to place 0.25*BDP in
the bottleneck buffer when probing (commodity datacenter switches
commonly do not have that much buffer for WAN flows). When BBR v2
estimates it hit a buffer limit during probing, its bandwidth probing
then starts gently in case little space is still available in the
buffer, and the accelerates, slowly at first and then rapidly if it
can grow inflight without seeing congestion signals. In such cases,
probing is bounded by inflight_hi + inflight_probe, where
inflight_probe grows as: [0, 1, 2, 4, 8, 16,...]. This allows BBR to
keep losses low and bounded if a bottleneck remains congested, while
rapidly/scalably utilizing free bandwidth when it becomes available.

o It has a slightly revised state machine, to achieve the goals above.
    BBR_BW_PROBE_UP:    pushes up inflight to probe for bw/vol
    BBR_BW_PROBE_DOWN:  drain excess inflight from the queue
    BBR_BW_PROBE_CRUISE: use pipe, w/ headroom in queue/pipe
    BBR_BW_PROBE_REFILL: try refill the pipe again to 100%, leaving queue empty

o The estimated BDP: BBR v2 continues to maintain an estimate of the
path's two-way propagation delay, by tracking a windowed min_rtt, and
coordinating (on an as-ndeeded basis) to try to expose the two-way
propagation delay by draining the bottleneck queue.

BBR v2 continues to use its min_rtt and (currently-applicable) bandwidth
estimate to estimate the current bandwidth-delay product. The estimated BDP
still provides one important guideline for bounding inflight data. However,
because any min-filtered RTT and max-filtered bw inherently tend to both
overestimate, the estimated BDP is often too high; in this case loss or ECN
marks can ensue, in which case BBR v2 adjusts inflight_hi and inflight_lo to
adapt its sending rate and inflight down to match the available capacity of the
path.

o Space: Note that ICSK_CA_PRIV_SIZE increased. This is because BBR v2
requires more space. Note that much of the space is due to support for
per-socket parameterization and debugging in this release for research
and debugging. With that state removed, the full "struct bbr" is 140
bytes, or 144 with padding. This is an increase of 40 bytes over the
existing ca_priv space.

o Code: BBR v2 reuses many pieces from BBR v1. But it omits the following
  significant pieces:

  o "packet conservation" (bbr_set_cwnd_to_recover_or_restore(),
    bbr_can_grow_inflight())
  o long-term bandwidth estimator ("policer mode")

  The code layout tries to keep BBR v2 code near the bottom of the
  file, so that v1-applicable code in the top does not accidentally
  refer to v2 code.

o Docs:
  See the following docs for more details and diagrams decsribing the BBR v2
  algorithm:
    https://datatracker.ietf.org/meeting/104/materials/slides-104-iccrg-an-update-on-bbr-00
    https://datatracker.ietf.org/meeting/102/materials/slides-102-iccrg-an-update-on-bbr-work-at-google-00

o Internal notes:
  For this upstream rebase, Neal started from:
    git show fed518041ac6:net/ipv4/tcp_bbr.c > net/ipv4/tcp_bbr.c
  then removed dev instrumentation (dynamic get/set for parameters)
  and code that was only used by BBRv1

Effort: net-tcp_bbr
Origin-9xx-SHA1: 2c84098e60bed6d67dde23cd7538c51dee273102
Change-Id: I125cf26ba2a7a686f2fa5e87f4c2afceb65f7a05

Signed-off-by: Alexandre Frade <kernel@xanmod.org>
  • Loading branch information
nealcardwell authored and xanmod committed Dec 31, 2020
1 parent 2760431 commit b3b500b
Show file tree
Hide file tree
Showing 15 changed files with 2,879 additions and 46 deletions.
1 change: 1 addition & 0 deletions include/linux/tcp.h
Expand Up @@ -226,6 +226,7 @@ struct tcp_sock {
u8 dup_ack_counter:2,
tlp_retrans:1, /* TLP is a retransmission */
unused:5;
u8 fast_ack_mode:2; /* which fast ack mode ? */
u32 chrono_start; /* Start time in jiffies of a TCP chrono */
u32 chrono_stat[3]; /* Time in jiffies for chrono_stat stats */
u8 chrono_type:2, /* current chronograph type */
Expand Down
5 changes: 3 additions & 2 deletions include/net/inet_connection_sock.h
Expand Up @@ -131,8 +131,9 @@ struct inet_connection_sock {
} icsk_mtup;
u32 icsk_user_timeout;

u64 icsk_ca_priv[104 / sizeof(u64)];
#define ICSK_CA_PRIV_SIZE (13 * sizeof(u64))
/* XXX inflated by temporary internal debugging info */
#define ICSK_CA_PRIV_SIZE (216)
u64 icsk_ca_priv[ICSK_CA_PRIV_SIZE / sizeof(u64)];
};

#define ICSK_TIME_RETRANS 1 /* Retransmit timer */
Expand Down
44 changes: 37 additions & 7 deletions include/net/tcp.h
Expand Up @@ -792,6 +792,11 @@ static inline u32 tcp_stamp_us_delta(u64 t1, u64 t0)
return max_t(s64, t1 - t0, 0);
}

static inline u32 tcp_stamp32_us_delta(u32 t1, u32 t0)
{
return max_t(s32, t1 - t0, 0);
}

static inline u32 tcp_skb_timestamp(const struct sk_buff *skb)
{
return tcp_ns_to_ts(skb->skb_mstamp_ns);
Expand Down Expand Up @@ -859,16 +864,22 @@ struct tcp_skb_cb {
__u32 ack_seq; /* Sequence number ACK'd */
union {
struct {
#define TCPCB_DELIVERED_CE_MASK ((1U<<20) - 1)
/* There is space for up to 24 bytes */
__u32 in_flight:30,/* Bytes in flight at transmit */
is_app_limited:1, /* cwnd not fully used? */
unused:1;
__u32 is_app_limited:1, /* cwnd not fully used? */
delivered_ce:20,
unused:11;
/* pkts S/ACKed so far upon tx of skb, incl retrans: */
__u32 delivered;
/* start of send pipeline phase */
u64 first_tx_mstamp;
u32 first_tx_mstamp;
/* when we reached the "delivered" count */
u64 delivered_mstamp;
u32 delivered_mstamp;
#define TCPCB_IN_FLIGHT_BITS 20
#define TCPCB_IN_FLIGHT_MAX ((1U << TCPCB_IN_FLIGHT_BITS) - 1)
u32 in_flight:20, /* packets in flight at transmit */
unused2:12;
u32 lost; /* packets lost so far upon tx of skb */
} tx; /* only used for outgoing skbs */
union {
struct inet_skb_parm h4;
Expand Down Expand Up @@ -1019,6 +1030,8 @@ enum tcp_ca_ack_event_flags {
/* Requires ECN/ECT set on all packets */
#define TCP_CONG_NEEDS_ECN 0x2
#define TCP_CONG_MASK (TCP_CONG_NON_RESTRICTED | TCP_CONG_NEEDS_ECN)
/* Wants notification of CE events (CA_EVENT_ECN_IS_CE, CA_EVENT_ECN_NO_CE). */
#define TCP_CONG_WANTS_CE_EVENTS 0x100000

union tcp_cc_info;

Expand All @@ -1038,8 +1051,13 @@ struct ack_sample {
*/
struct rate_sample {
u64 prior_mstamp; /* starting timestamp for interval */
u32 prior_lost; /* tp->lost at "prior_mstamp" */
u32 prior_delivered; /* tp->delivered at "prior_mstamp" */
u32 prior_delivered_ce;/* tp->delivered_ce at "prior_mstamp" */
u32 tx_in_flight; /* packets in flight at starting timestamp */
s32 lost; /* number of packets lost over interval */
s32 delivered; /* number of packets delivered over interval */
s32 delivered_ce; /* packets delivered w/ CE mark over interval */
long interval_us; /* time for tp->delivered to incr "delivered" */
u32 snd_interval_us; /* snd interval for delivered packets */
u32 rcv_interval_us; /* rcv interval for delivered packets */
Expand All @@ -1050,6 +1068,7 @@ struct rate_sample {
bool is_app_limited; /* is sample from packet with bubble in pipe? */
bool is_retrans; /* is sample from retransmission? */
bool is_ack_delayed; /* is this (likely) a delayed ACK? */
bool is_ece; /* did this ACK have ECN marked? */
};

struct tcp_congestion_ops {
Expand All @@ -1076,10 +1095,12 @@ struct tcp_congestion_ops {
u32 (*undo_cwnd)(struct sock *sk);
/* hook for packet ack accounting (optional) */
void (*pkts_acked)(struct sock *sk, const struct ack_sample *sample);
/* override sysctl_tcp_min_tso_segs */
u32 (*min_tso_segs)(struct sock *sk);
/* pick target number of segments per TSO/GSO skb (optional): */
u32 (*tso_segs)(struct sock *sk, unsigned int mss_now);
/* returns the multiplier used in tcp_sndbuf_expand (optional) */
u32 (*sndbuf_expand)(struct sock *sk);
/* react to a specific lost skb (optional) */
void (*skb_marked_lost)(struct sock *sk, const struct sk_buff *skb);
/* call when packets are delivered to update cwnd and pacing rate,
* after all the ca_state processing. (optional)
*/
Expand Down Expand Up @@ -1125,6 +1146,14 @@ static inline char *tcp_ca_get_name_by_key(u32 key, char *buffer)
}
#endif

static inline bool tcp_ca_wants_ce_events(const struct sock *sk)
{
const struct inet_connection_sock *icsk = inet_csk(sk);

return icsk->icsk_ca_ops->flags & (TCP_CONG_NEEDS_ECN |
TCP_CONG_WANTS_CE_EVENTS);
}

static inline bool tcp_ca_needs_ecn(const struct sock *sk)
{
const struct inet_connection_sock *icsk = inet_csk(sk);
Expand All @@ -1150,6 +1179,7 @@ static inline void tcp_ca_event(struct sock *sk, const enum tcp_ca_event event)
}

/* From tcp_rate.c */
void tcp_set_tx_in_flight(struct sock *sk, struct sk_buff *skb);
void tcp_rate_skb_sent(struct sock *sk, struct sk_buff *skb);
void tcp_rate_skb_delivered(struct sock *sk, struct sk_buff *skb,
struct rate_sample *rs);
Expand Down
33 changes: 33 additions & 0 deletions include/uapi/linux/inet_diag.h
Expand Up @@ -231,9 +231,42 @@ struct tcp_bbr_info {
__u32 bbr_cwnd_gain; /* cwnd gain shifted left 8 bits */
};

/* Phase as reported in netlink/ss stats. */
enum tcp_bbr2_phase {
BBR2_PHASE_INVALID = 0,
BBR2_PHASE_STARTUP = 1,
BBR2_PHASE_DRAIN = 2,
BBR2_PHASE_PROBE_RTT = 3,
BBR2_PHASE_PROBE_BW_UP = 4,
BBR2_PHASE_PROBE_BW_DOWN = 5,
BBR2_PHASE_PROBE_BW_CRUISE = 6,
BBR2_PHASE_PROBE_BW_REFILL = 7
};

struct tcp_bbr2_info {
/* u64 bw: bandwidth (app throughput) estimate in Byte per sec: */
__u32 bbr_bw_lsb; /* lower 32 bits of bw */
__u32 bbr_bw_msb; /* upper 32 bits of bw */
__u32 bbr_min_rtt; /* min-filtered RTT in uSec */
__u32 bbr_pacing_gain; /* pacing gain shifted left 8 bits */
__u32 bbr_cwnd_gain; /* cwnd gain shifted left 8 bits */
__u32 bbr_bw_hi_lsb; /* lower 32 bits of bw_hi */
__u32 bbr_bw_hi_msb; /* upper 32 bits of bw_hi */
__u32 bbr_bw_lo_lsb; /* lower 32 bits of bw_lo */
__u32 bbr_bw_lo_msb; /* upper 32 bits of bw_lo */
__u8 bbr_mode; /* current bbr_mode in state machine */
__u8 bbr_phase; /* current state machine phase */
__u8 unused1; /* alignment padding; not used yet */
__u8 bbr_version; /* MUST be at this offset in struct */
__u32 bbr_inflight_lo; /* lower/short-term data volume bound */
__u32 bbr_inflight_hi; /* higher/long-term data volume bound */
__u32 bbr_extra_acked; /* max excess packets ACKed in epoch */
};

union tcp_cc_info {
struct tcpvegas_info vegas;
struct tcp_dctcp_info dctcp;
struct tcp_bbr_info bbr;
struct tcp_bbr2_info bbr2;
};
#endif /* _UAPI_INET_DIAG_H_ */
22 changes: 22 additions & 0 deletions net/ipv4/Kconfig
Expand Up @@ -669,6 +669,24 @@ config TCP_CONG_BBR
AQM schemes that do not provide a delay signal. It requires the fq
("Fair Queue") pacing packet scheduler.

config TCP_CONG_BBR2
tristate "BBR2 TCP"
default n
help

BBR2 TCP congestion control is a model-based congestion control
algorithm that aims to maximize network utilization, keep queues and
retransmit rates low, and to be able to coexist with Reno/CUBIC in
common scenarios. It builds an explicit model of the network path. It
tolerates a targeted degree of random packet loss and delay that are
unrelated to congestion. It can operate over LAN, WAN, cellular, wifi,
or cable modem links, and can use DCTCP-L4S-style ECN signals. It can
coexist with flows that use loss-based congestion control, and can
operate with shallow buffers, deep buffers, bufferbloat, policers, or
AQM schemes that do not provide a delay signal. It requires pacing,
using either TCP internal pacing or the fq ("Fair Queue") pacing packet
scheduler.

choice
prompt "Default TCP congestion control"
default DEFAULT_CUBIC
Expand Down Expand Up @@ -706,6 +724,9 @@ choice
config DEFAULT_BBR
bool "BBR" if TCP_CONG_BBR=y

config DEFAULT_BBR2
bool "BBR2" if TCP_CONG_BBR2=y

config DEFAULT_RENO
bool "Reno"
endchoice
Expand All @@ -730,6 +751,7 @@ config DEFAULT_TCP_CONG
default "dctcp" if DEFAULT_DCTCP
default "cdg" if DEFAULT_CDG
default "bbr" if DEFAULT_BBR
default "bbr2" if DEFAULT_BBR2
default "cubic"

config TCP_MD5SIG
Expand Down
1 change: 1 addition & 0 deletions net/ipv4/Makefile
Expand Up @@ -46,6 +46,7 @@ obj-$(CONFIG_INET_TCP_DIAG) += tcp_diag.o
obj-$(CONFIG_INET_UDP_DIAG) += udp_diag.o
obj-$(CONFIG_INET_RAW_DIAG) += raw_diag.o
obj-$(CONFIG_TCP_CONG_BBR) += tcp_bbr.o
obj-$(CONFIG_TCP_CONG_BBR2) += tcp_bbr2.o
obj-$(CONFIG_TCP_CONG_BIC) += tcp_bic.o
obj-$(CONFIG_TCP_CONG_CDG) += tcp_cdg.o
obj-$(CONFIG_TCP_CONG_CUBIC) += tcp_cubic.o
Expand Down
2 changes: 1 addition & 1 deletion net/ipv4/bpf_tcp_ca.c
Expand Up @@ -16,7 +16,7 @@ static u32 optional_ops[] = {
offsetof(struct tcp_congestion_ops, cwnd_event),
offsetof(struct tcp_congestion_ops, in_ack_event),
offsetof(struct tcp_congestion_ops, pkts_acked),
offsetof(struct tcp_congestion_ops, min_tso_segs),
offsetof(struct tcp_congestion_ops, tso_segs),
offsetof(struct tcp_congestion_ops, sndbuf_expand),
offsetof(struct tcp_congestion_ops, cong_control),
};
Expand Down
2 changes: 1 addition & 1 deletion net/ipv4/tcp.c
Expand Up @@ -2742,7 +2742,7 @@ int tcp_disconnect(struct sock *sk, int flags)
tp->rx_opt.dsack = 0;
tp->rx_opt.num_sacks = 0;
tp->rcv_ooopack = 0;

tp->fast_ack_mode = 0;

/* Clean up fastopen related fields */
tcp_free_fastopen_req(tp);
Expand Down
39 changes: 26 additions & 13 deletions net/ipv4/tcp_bbr.c
Expand Up @@ -292,26 +292,39 @@ static void bbr_set_pacing_rate(struct sock *sk, u32 bw, int gain)
sk->sk_pacing_rate = rate;
}

/* override sysctl_tcp_min_tso_segs */
static u32 bbr_min_tso_segs(struct sock *sk)
{
return sk->sk_pacing_rate < (bbr_min_tso_rate >> 3) ? 1 : 2;
}

static u32 bbr_tso_segs_goal(struct sock *sk)
/* Return the number of segments BBR would like in a TSO/GSO skb, given
* a particular max gso size as a constraint.
*/
static u32 bbr_tso_segs_generic(struct sock *sk, unsigned int mss_now,
u32 gso_max_size)
{
struct tcp_sock *tp = tcp_sk(sk);
u32 segs, bytes;
u32 segs;
u64 bytes;

/* Sort of tcp_tso_autosize() but ignoring
* driver provided sk_gso_max_size.
*/
bytes = min_t(unsigned long,
sk->sk_pacing_rate >> READ_ONCE(sk->sk_pacing_shift),
GSO_MAX_SIZE - 1 - MAX_TCP_HEADER);
segs = max_t(u32, bytes / tp->mss_cache, bbr_min_tso_segs(sk));
/* Budget a TSO/GSO burst size allowance based on bw (pacing_rate). */
bytes = sk->sk_pacing_rate >> sk->sk_pacing_shift;

bytes = min_t(u32, bytes, gso_max_size - 1 - MAX_TCP_HEADER);
segs = max_t(u32, bytes / mss_now, bbr_min_tso_segs(sk));
return segs;
}

return min(segs, 0x7FU);
/* Custom tcp_tso_autosize() for BBR, used at transmit time to cap skb size. */
static u32 bbr_tso_segs(struct sock *sk, unsigned int mss_now)
{
return bbr_tso_segs_generic(sk, mss_now, sk->sk_gso_max_size);
}

/* Like bbr_tso_segs(), using mss_cache, ignoring driver's sk_gso_max_size. */
static u32 bbr_tso_segs_goal(struct sock *sk)
{
struct tcp_sock *tp = tcp_sk(sk);
return bbr_tso_segs_generic(sk, tp->mss_cache, GSO_MAX_SIZE);
}

/* Save "last known good" cwnd so we can restore it after losses or PROBE_RTT */
Expand Down Expand Up @@ -1147,7 +1160,7 @@ static struct tcp_congestion_ops tcp_bbr_cong_ops __read_mostly = {
.undo_cwnd = bbr_undo_cwnd,
.cwnd_event = bbr_cwnd_event,
.ssthresh = bbr_ssthresh,
.min_tso_segs = bbr_min_tso_segs,
.tso_segs = bbr_tso_segs,
.get_info = bbr_get_info,
.set_state = bbr_set_state,
};
Expand Down

0 comments on commit b3b500b

Please sign in to comment.