Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sender-Controlled Delayed ACK Ratio #1978

Closed
bbriscoe opened this issue Nov 7, 2018 · 30 comments
Closed

Sender-Controlled Delayed ACK Ratio #1978

bbriscoe opened this issue Nov 7, 2018 · 30 comments
Labels
-recovery -transport design future-version parked

Comments

@bbriscoe
Copy link

@bbriscoe bbriscoe commented Nov 7, 2018

Yesterday Jana said the WG is open to the possibility of a sender-controlled ack_ratio.

A simpler approach: a 3-bit field in each direction, in every packet, carrying ack_exp where

ack_ratio` = 2^ack_exp.

Exactly how many bits to use is up for discussion, but I believe ack_ratio=16 and growing is already common in DCs, so 2 bits isn't quite enough.

Receiving ack_exp would be defined to mean: "From now, please try to ACK at least every 2^ack_exp packets". No negotiation. No need to detect loss of messages. Just repetition of a hint.

Of course a variable can be constant, e.g. for implementers of a sender who want to leave adaptation code for later, just set ack_exp=1 in every packet.

I think this would address all the scenarios in #1428 (related but closed).

Problem/goals:

  1. Getting up to speed fast without inducing much queue is one of the main areas where latency reductions are needed (for short and long flows). The sender needs frequent ACKing during this phase in many approaches (hystart, etc). A Linux TCP receiver starts with ack_ratio 1 and uses heuristics to identify the end of the sender's slow-start (or a re-start after idle). This has ossified a certain slow-start behaviour into all TCP senders (whether Linux or not). We are trying to new sender behaviours (paced chirping being an example), but we need a way for the sender to suppress delayed ACKs.
  2. At the other end of the scale, for long-running flows, many current CC approaches (e.g. BBR) use pacing not ACK-clocking so they can use very few ACKs per RTT. And fewer means less cache ejections for GRO. But the receiver doesn't know what CC the sender is using, so it doesn't know how many or few is too many or too few.
  3. Some middleboxes and the majority of link technologies (e.g. DOCSIS, LTE, Satellite) thin TCP ACKs when they detect the upstream is filled with TCP ACK stream(s), which are unresponsive. Otherwise the ACKs constrain downstream throughput (and any upstream data either in the flow itself, or in others). We don't want these links to attempt to guess which are the QUIC ACKs and try to thin them. QUIC can and should do this itself. QUIC has all the machinery for the sender CC to detect ACK congestion, but not the protocol to tell the receiver to do the thinning. This protocol addition would provide a sufficient hook for hosts to unilaterally add this to their CC behaviour.
@bbriscoe
Copy link
Author

@bbriscoe bbriscoe commented Nov 7, 2018

Some scenarios where the sender's preferred ack_ratio changes through the connection:

  1. A sender CC that wants the receiver to turn off DelAcks during flow-start (e.g. it's using hybrid slow-start as in Cubic and wants to get delay measurements more frequently) sets ack_exp=0 during flow-start (ack_ratio=1), then increases ack_exp during congestion avoidance. If it goes idle, then re-starts, it would set ack_exp=0 again.

    Note on heuristics: A Linux TCP receiver currently uses a heuristic to determine when the sender has exited slow-start. However, heuristics -> ossification. A Linux receiver's heuristic only works with the current pattern of slow-start. In TCP, when we tried to improve the pattern on the sender (paced chirping), the heuristic on the receiver killed us.

  2. Imagine a paced sender has hardware generic receive offload (GRO), so for a long-running flow it doesn't want a high rate of QUIC ACKs that are opaque to GRO. Let's say it would prefer at least 8 ACKs per RTT. Again, it starts with ack_exp=0, but in congestion avoidance it would use:

ack_ratio <= cwnd_in_packets/8

Upshot: By remote controlling the receiver, the server offloads nearly all the ACKs from large downloads, but still focuses its ACK-receiving resources on getting each client up to speed.

Note that calculation of ack_ratio lends itself to fast integer arithmetic.

@bbriscoe
Copy link
Author

@bbriscoe bbriscoe commented Nov 7, 2018

Units

Packets? bytes? time? fraction of RTT?

Think of this question in two stages:
Q1. what units the sender starts from
Q2. what units the message is in, which the sender has to derive from Q1

If the message was a fraction of 1 RTT, the receiver would have to calulate the sender's RTT. It can do that, using ACKs of ACKs, but that implies work and delay.

In all cases that I could see in #1428, the sender had all the info to translate from one unit to another (it knows its own SMSS, its window, its RTT). This leaves Q2, to which a good enough answer is packets (which also compresses the message size).

@ianswett
Copy link
Contributor

@ianswett ianswett commented Nov 7, 2018

I agree it should change throughout the connection, but I believe the right mechanism for this would be a combination of a transport param and an UPDATE_TRANSPORT_PARAMS frame. It turns out there's at least one other use for this frame already, which is initial stream flow control limit.

One reason why I believe this is a better approach is there's nowhere good to stick 3 bits in every QUIC packet. Another reason is I expect people will not want to update this value that often, maybe once every few round trips at most.

An ability to request a MaxAckDelay would also be useful, but it would likely require a new MinMaxAckDelay transport param to inform the sender of the minimum value the sender could request and expect it to be honored.

@ianswett
Copy link
Contributor

@ianswett ianswett commented Nov 7, 2018

Also, would you mind restating(editing) your original comment with information about the issue you're trying to solve and your goals? I think I know what mine are, but I'd prefer you state yours since you filed the issue.

@bbriscoe
Copy link
Author

@bbriscoe bbriscoe commented Nov 8, 2018

Regarding whether the interaction mode is 'always repeat' or 'set and change'. I have no strong views. My goal was to suggest the simplest possible protocol that would induce minimal push-back from implementers. Also, perhaps one that could be back-ported to TCP, to leverage shared experience.

On the question of whether ack_ratio would change frequently, if the sender is routinely setting it to cwnd/8 (say) it will change frequently. Similarly if the sender has implemented AckCC (but I suspect AckCC won't often be active).

Whatever, this mechanism is not an invariant, so the approach can be changed based on experience.

@ianswett I've added goals (my subsequent scenarios posting was intended to serve that purpose, but yes that's good practice for a first post).

@siyengar
Copy link

@siyengar siyengar commented Nov 8, 2018

I think this is great, but would definitely add a requirement that this behavior should be a SHOULD instead of MUST. The receiver able to say no to a too small value, for instance. Sending more frequent acks in UDP could generate a lot of small packets which have a CPU impact. To prevent DoS the receiver should be able to reject insane values.

@mikkelfj
Copy link
Contributor

@mikkelfj mikkelfj commented Nov 8, 2018

I think sending an update frame is right but I really would like to avoid a generic UPDATE_TRANSPORT_PARAMS frame because it vastly complicates implementations that distribute settings to subcomponents.

@martinthomson martinthomson added design -transport -recovery labels Nov 8, 2018
@janaiyengar
Copy link
Contributor

@janaiyengar janaiyengar commented Nov 12, 2018

I think this is basically good and doable. I agree with @siyengar that it should be a SHOULD; there is no reason to require that a peer ack every K packets, especially since it's foolish to do so if the peer receives more than K packets in one receive event.

On mechanism: if there is enough argument for doing an UPDATE_TP, that might be a way, but that's too heavy-weight for just this parameter and I don't think this parameter is compelling enough for it. OTOH, we have only two bits left (https://github.com/quicwg/wg-materials/blob/master/ietf103/octet0.pdf), which is not enough I think even for Internet environments.

How about we do a TP and then follow it up with an UPDATE_ACK_RATIO frame? This can be as small as 2 bytes in all imaginable cases, making the encoding really low-overhead.

@ianswett
Copy link
Contributor

@ianswett ianswett commented Nov 13, 2018

@janaiyengar , if we're not going to add a generic UPDATE_TRANSPORT_PARAMs frame, then I think it makes more sense to ONLY have an UPDATE_ACK_RATIO/DELAY/etc frame and not add any new transport params, since I think what we have now is good enough to complete the handshake?

I'm favoring adding the ability to update two params:

  1. MaxAckDelay (default 25ms)
  2. MaxPacketsBeforeAck (currently always 2) - aka how many packets can be received before sending an acknowledgement.

Even though gQUIC has had success with fraction of RTT, I think it's less robust than specifying an actual amount of time, and specifying an amount of time allows for more precise TLP and RTO timeouts, which is very nice.

I agree there need to be bounds on how small an ack delay a sender can request and how small a MaxPacketsBeforeAck? I'd prefer a hard failure(ie: connection close) to a "I'm just not going to do what you ask and not tell you", so I'd prefer a transport param or other mechanism to communicate the MinMaxAckDelay for a connection and a rule that you can't request a MaxPacketsBeforeAck smaller than 2.

@mikkelfj
Copy link
Contributor

@mikkelfj mikkelfj commented Nov 13, 2018

The benefit of a TP is that an algorithm that is dead set on, say a 1/10 ratio does not have to adjust for the initial handshake. Although that probably breaks the handshake. All the more reason for implicit ACK's ...

@ianswett
Copy link
Contributor

@ianswett ianswett commented Nov 13, 2018

One more thought on fraction of CWND vs # of packets and fraction of RTT vs absolute time is that the receiver doesn't know the sender's CWND or the sender's current RTT. Having exact values removes ambiguity.

@mikkelfj
Copy link
Contributor

@mikkelfj mikkelfj commented Nov 13, 2018

If time based, is the unit microseconds? There can be extremeley small RTT's on fast short distance networks.

@ianswett
Copy link
Contributor

@ianswett ianswett commented Nov 13, 2018

I think it'd be best to make it microseconds, though I'll note that a lot of environments don't have microsecond timer granularity.

We could re-use the shift value in transport params if we're concerned about constantly consuming 4 bytes for the value. Or we could agree it's in units of 100s of microseconds as a compromise.

@mikkelfj
Copy link
Contributor

@mikkelfj mikkelfj commented Nov 13, 2018

For high speed lans latency is only a few us if stack can keep up.

@janaiyengar
Copy link
Contributor

@janaiyengar janaiyengar commented Nov 14, 2018

@ianswett
Copy link
Contributor

@ianswett ianswett commented Nov 14, 2018

But in order to achieve the "ACK about 4 times per RTT" you need to give the sender the ability to change the receiver's MaxAckDelay.

Otherwise, I might be happy with you ACKing 10 packets at once, but I don't want you delaying that ack a full MaxAckDelay.

@janaiyengar
Copy link
Contributor

@janaiyengar janaiyengar commented Nov 14, 2018

@ianswett
Copy link
Contributor

@ianswett ianswett commented Nov 14, 2018

I was assuming SuggestedAckDelay would replace MaxAckDelay, but I agree it adds some complexities.

Maybe the sender controlling NumPacketsBeforeFastAck(?) is sufficiently good for now. It does avoid complex issues about how small an ack delay the sender can request of the receiver.

We keep wanting the ability to update MaxAckDelay. Does it make sense for the frame to include both an updated NumPacketsBeforeFastAck for the receiver to use as well as communicate a new MaxAckDelay? If so, that's feeling more like it should be UPDATE_TRANSPORT_PARAMs, but we can discuss that further.

@martinthomson
Copy link
Member

@martinthomson martinthomson commented Nov 15, 2018

This looks like a great discussion, but I have a request. Please consider not fixing this issue. This is starting to look very complex, and the benefits are not clear.

I realize that this is a long-standing issue, and fixing it in this way seems appealing, but I think that we need to clearly understand the implications of shipping a protocol without this. An extension along the lines discussed is possible if that turns out to be a mistake.

@janaiyengar
Copy link
Contributor

@janaiyengar janaiyengar commented Nov 15, 2018

Going into AckDelay suggestions adds complexity, so I agree that we should not get into it. The NumPacketsBeforeAck parameter however is a fairly straightforward mechanism, and is arguably trivial to implement at a receiver. This is a part of recovery and congestion control that could have used some airtime earlier, but we've punted on it. Fortunately, I don't think it'll be onerous.

As @bbriscoe noted, this allows us to bring the QUIC receiver up to speed with what Linux TCP receivers do, but in a cleaner manner. I had thought that we might address these things within the recovery draft, but doing it with this dynamic mechanism is better.

I'd like to write up a simple proposal here and if the wg thinks it's too much to do, I'm happy to draw back.

@ianswett
Copy link
Contributor

@ianswett ianswett commented Nov 15, 2018

I'll note that I can quantify the benefits of a proposal like this, at least for users of BBR congestion control.

What I'm trying to settle on is what is the minimum, simplest feature(s) that can achieve what gQUIC is currently doing in production.

If we don't do this in v1, we'll want to have an extension around the same time. Not doing something like this can really crush high bandwidth download performance on the client side, since sending UDP packets is still slow, particularly on mobile devices.

@NicoKos
Copy link

@NicoKos NicoKos commented Nov 19, 2018

The ACK_RATIO may not be the only transport parameter we would want to change to better adapt the congestion control to specific environments. As one example, we may want to force a "pacing+60 packets" IW for high BDP networks in order to get up to speed quickly.

Should we make a more general issue on "tunable congestion control parameters"?

@IngJohEricsson
Copy link

@IngJohEricsson IngJohEricsson commented Dec 5, 2018

A bit late into this discussion.
Have you discussed something like two dimensionless parameters K and N
Where
MaxAckDelay = RTT/K
MaxPacketsBeforeAck = N

The values K and N are signaled/negotiated between the peers. Suggested default values would be K=4 and N=10 or whatever is deemed as proper.
One issue I see is that RTT may be unknown in the (first part of) the initial handshake. And then one can discuss if it should be RTT or SRTT..
One positive property (as I see it) is that we avoid fixing ourselves into discussions about time unit and that can be good as QUIC may be used in long haul as well as data center applications.

/Ingemar

@ianswett
Copy link
Contributor

@ianswett ianswett commented Dec 5, 2018

@IngJohEricsson, gQUIC has an implementation of what you're describing where K is 4 or 8 and N is 10 or 20, and it works very well in production.

There are some drawbacks to this approach:

  1. Sender and receiver's view of RTT are different, and could be substantially so, so the sender can't rely on the receiver's max ack delay being RTT/K for timeouts.
  2. gQUIC's version is fixed over the lifetime of the connection, but acks every 2 packets for the first 100 under the theory that you want to increase the window quickly in slow start. But 100 is a fixed value and is too short for some connections slow start and too long for others. And maybe it should increase to 4 or 8 later in slow start as well?

Given there's interest in this, but it's clearly an area that needs further experimentation, I believe @janaiyengar and I are going to write up an extension draft to allow changing the N value during the connection via a new frame and leave MaxAckDelay fixed for now.

@IngJohEricsson
Copy link

@IngJohEricsson IngJohEricsson commented Dec 7, 2018

Thanks for the explanation and nice to hear that you have tried this N and K version. I overlooked the possibility of different estimated RTTs on the two sides of a connection.
Looking in TCP traces one can see that for example Android clients seem to ACK every 2nd segment in the initial phase and later on they typically go down to every 8th segment it does however not seem to be too conclusive and it is difficult to to see in the ACKs are pruned on lower protocol layers, in some cases one can see very long stretch ACKs (up to 100 segments).

/Ingemar

@janaiyengar
Copy link
Contributor

@janaiyengar janaiyengar commented Dec 11, 2018

I'll park this issue, and it can be picked up when we do an extension for it.

@janaiyengar janaiyengar added the parked label Dec 11, 2018
@martinthomson martinthomson moved this from Parked to Recovery in Transport / Recovery / TLS Jan 10, 2019
@ianswett ianswett moved this from Recovery to Close in Transport / Recovery / TLS Jan 11, 2019
@ianswett ianswett moved this from Close to Misc in Transport / Recovery / TLS Jan 11, 2019
@janaiyengar janaiyengar moved this from Misc to Path in Transport / Recovery / TLS Jan 15, 2019
@janaiyengar janaiyengar moved this from Path to Recovery in Transport / Recovery / TLS Jan 15, 2019
@mnot
Copy link
Member

@mnot mnot commented Feb 26, 2019

I think the appropriate tag for this is quicv2 -- i.e., we're not going to do it in the scope of our current work. Any disagreement?

@janaiyengar
Copy link
Contributor

@janaiyengar janaiyengar commented Mar 4, 2019

If that's what we're doing for extensions that might show up before v2 does, then yes, that's fine.

@ianswett ianswett added the future-version label Mar 26, 2019
@ianswett
Copy link
Contributor

@ianswett ianswett commented Mar 26, 2019

As discussed in Prague, the plan is to write an extension and get some more experimental data, but close this for v1.

@ianswett ianswett closed this Mar 26, 2019
@c-taylor
Copy link

@c-taylor c-taylor commented Nov 20, 2019

Within this thread there was an observation that there was no 'problem' that reduction in ACKs is solving.

ACKs cannot be coalesced with udp_gro and this ends up being a significant scaling issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
-recovery -transport design future-version parked
Projects
No open projects
Development

No branches or pull requests

10 participants