Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ACK generation recommendation #3304

Closed
janaiyengar opened this issue Dec 17, 2019 · 34 comments
Closed

ACK generation recommendation #3304

janaiyengar opened this issue Dec 17, 2019 · 34 comments
Labels
-transport design has-consensus

Comments

@janaiyengar
Copy link
Contributor

@janaiyengar janaiyengar commented Dec 17, 2019

The transport draft currently says:

An ACK frame SHOULD be generated for at least every second ack-eliciting packet.
This recommendation is in keeping with standard practice for TCP {{?RFC5681}}.

Gorry raised the point that in experiments, this generates way too many ACK packets in high bandwidth networks, such as satellite networks. This has noticeable CPU costs for QUIC, for both sending as well as for receiving. Satellite networks use middleboxes that collapse TCP ACKs, but they can't for QUIC.

We have talked about doing a more general strategy separately as an extension, but we have experience with a fairly straightforward one. Chrome uses an ACK coalescing strategy that does the following: after the first 100 packets, ACK once every 10 packets or 1/4th of an RTT, whichever comes earlier. If we agree that this might be good general strategy, we should suggest it in the transport document.

@ianswett: my recollection is that the strategy had some throughput reduction when used with Cubic, is that correct? Do you think using 1/8th RTT might resolve this?

@larseggert
Copy link
Member

@larseggert larseggert commented Dec 17, 2019

Why only after the first 100 packets? To have fine-grained feedback during ramp-up?

@nibanks
Copy link
Member

@nibanks nibanks commented Dec 17, 2019

Gorry raised the point that in experiments, this generates way too many ACK packets in high bandwidth networks, such as satellite networks. This has noticeable CPU costs for QUIC, for both sending as well as for receiving.

Could you expand on this? I find this to be an extremely general statement, and not very helpful in understanding the real motivation for any possible changes. You say it "has noticeable CPU costs for QUIC", but are you referring to the client, server or some middle boxes somehow?

Assuming you're worried about the CPU costs on the server side, has anyone explicitly measured the differences in CPU cost for generating different number of ACKs per RTT? What's the effect on FC? Fewer ACKs will mean the sender (I'm assuming the server?) is going to have to buffer more, and FC windows might be hit more often. Is this really such a big problem that is requires a spec change? In V1?

@RyanTheOptimist
Copy link
Contributor

@RyanTheOptimist RyanTheOptimist commented Dec 17, 2019

@RyanTheOptimist
Copy link
Contributor

@RyanTheOptimist RyanTheOptimist commented Dec 17, 2019

It can happen to either endpoint. Consider an HTTP/3 download.

Oh, and of course this can happen in the other direction too. In an HTTP/3 upload the sender is the client which is less likely to have hardware offload so the CPU concerns can be quite accute.

I think a QUIC sender which sends ACKs of every 2 packets without and further limits will be unable to take full advantage of the network

Err, a QUIC sender obvious isn't the ACK sender. sigh. My point remains, though. Senders my not be able to take full advantage of the network if every two packets are ACKd.

@ianswett
Copy link
Contributor

@ianswett ianswett commented Dec 17, 2019

There was still a regression with 1/8th RTT and Cubic. We we using pacing at the time, FYI. Not pacing or pacing differently may lead to slightly different results.

Given TCP has no formal recommendations on this that I'm aware of, I tend to think we should solve this problem properly or punt it to an extension. Recommending the heuristic that Chrome arrived at though some(though not exhaustive) experimentation makes me concerned.

Sending fewer ACKs impacts the sender's congestion controller, so ideally the sender would be in control of this, or at least be aware of what the peer's algorithm is. We could add some transport params to unilaterally communicate that, like we do max_ack_delay, but then do we need to specify how to compensate for that in the recovery draft?

I'm having a really hard time coming up with a good recommendation here without some combination of a new frame and/or transport params.

@mjoras
Copy link

@mjoras mjoras commented Dec 18, 2019

At Facebook, ACK handling is the largest component of what I'd call "discretionary" CPU cost (i.e. not crypto, not syscall/userspace->kernel copying). I expect that will be true in general once people have optimized their implementations. We see fine results with the suggested basic Chrome heuristic, and haven't experimented with much beyond that but it's on our roadmap. I don't think we should keep the current ACK every other in the draft as it will largely end up being facetious. Deployments will use what ends up working better in practice and we know ACKing every other is measurably problematic (I can share numbers if people are interested).

That being said, I feel uncomfortable recommending the Chrome heuristic in the draft, as it reminds me of some of the early TCP RFC's which have recommendations and constants that did not age well and are largely ignored today. Can we punt something similar to @ianswett's initial idea of having this controlled via TP or new frame type?

@janaiyengar
Copy link
Contributor Author

@janaiyengar janaiyengar commented Dec 18, 2019

@mjoras: I'm very sympathetic to yours (and Ian's) thinking. That said, we do have a recommendation in the draft right now, which is that a receiver SHOULD ack every other packet. To your point about aging: that arguably has not aged well for TCP, but we're recommending it here. I am now wondering if should drop that SHOULD as well and make a weaker recommendation, noting that there are tradeoffs here.

@kazuho
Copy link
Member

@kazuho kazuho commented Dec 18, 2019

@ianswett

There was still a regression with 1/8th RTT and Cubic. We we using pacing at the time, FYI. Not pacing or pacing differently may lead to slightly different results.

Oh. Then, am I correct in assuming that Chrome is (or will be) sending ACK for at least every two ack-eliciting packets?

IIUC, the recovery draft recommends use of Cubic. We'd definitely want to see good performance with what we recommend, when Chrome is acting as a client.

@mjoras
Copy link

@mjoras mjoras commented Dec 18, 2019

@mjoras: I'm very sympathetic to yours (and Ian's) thinking. That said, we do have a recommendation in the draft right now, which is that a receiver SHOULD ack every other packet. To your point about aging: that arguably has not aged well for TCP, but we're recommending it here. I am now wondering if should drop that SHOULD as well and make a weaker recommendation, noting that there are tradeoffs here.

@janaiyengar I would be supportive of dropping the SHOULD and having a brief text discussing the tradeoffs involved, possibly at least mentioning the 100-10-1/4RTT recommendation, or however you'd like to refer to it.

@janaiyengar
Copy link
Contributor Author

@janaiyengar janaiyengar commented Dec 18, 2019

@mjoras: Sorry, I misread your comment above. I was not proposing recommending the 100-10-1/4RTT strategy, I was proposing suggesting it. You said that the heuristic worked fine for you; did you see any performance regressions?

You also say:

Deployments will use what ends up working better in practice and we know ACKing
every other is measurably problematic (I can share numbers if people are interested).

If it isn't too much trouble, that would be super valuable.

@janaiyengar
Copy link
Contributor Author

@janaiyengar janaiyengar commented Dec 18, 2019

@ianswett: TCP does have a formal requirement which we cite in the recovery draft right now. From RFC 5681:

   The delayed ACK algorithm specified in [RFC1122] SHOULD be used by a
   TCP receiver.  When using delayed ACKs, a TCP receiver MUST NOT
   excessively delay acknowledgments.  Specifically, an ACK SHOULD be
   generated for at least every second full-sized segment, and MUST be
   generated within 500 ms of the arrival of the first unacknowledged
   packet.

The problem is that this is quite dated and the TCP ecosystem has corrected for this by collapsing ACKs in the network. Additionally, the two major QUIC clients deployed right now -- FB and Chrome -- neither follows this recommendation. It seems silly to continue saying SHOULD when we expect it to not be followed.

Also, @kazuho raises an interesting point. Chrome is likely to be speaking to servers that might be using Cubic or Reno in the near future. Do you recall how serious the degradation with Chrome's ACKing scheme was? With 1/8th RTT?

@janaiyengar
Copy link
Contributor Author

@janaiyengar janaiyengar commented Dec 18, 2019

@larseggert :

Why only after the first 100 packets? To have fine-grained feedback during ramp-up?

Yes, and to avoid potential issues during slow start.

@yangchi
Copy link

@yangchi yangchi commented Dec 18, 2019

Chrome is likely to be speaking to servers that might be using Cubic or Reno in the near future. Do you recall how serious the degradation with Chrome's ACKing scheme was? With 1/8th RTT?

Shouldn't that largely depend on server implementation? Server implements ack handling poorly vs Server implements it OK will have very different throughput for the same ack frequency from the same peer, assuming everything else equal.

And that's why using a specific recommendation in the draft is tricky. This is a receiver behavior, but it's the sender that knows what's the best way to do it.

@junhochoi
Copy link
Contributor

@junhochoi junhochoi commented Dec 18, 2019

Cloudflare(quiche) is implementing Reno at this point (and will soon add more congestion controls). Since now there is a variety of parties implementing/choosing their own choice of congestion control module (Reno as in the draft, Cubic is mentioned, Google has BBR(v2), Facebook has COPA...), I think it's good to keep a general strategy in the draft (every-2-packet ack is what TCP recommends so I think it's fine for most cases). But worth mentioing alternative strategy is possible (what Chromes does, or something like linux TCP_QUICKACK?)

If we want to use TP, I think Server may send its congestion control algorithm using a transport parameter so that Client can pick up its own ack strategy if needed.

@janaiyengar
Copy link
Contributor Author

@janaiyengar janaiyengar commented Dec 19, 2019

@yangchi:

Shouldn't that largely depend on server implementation?

Yes, specifically the congestion controller. My question was directed at @ianswett, about experiments with Chrome's strategy for a server that does Cubic. My point is that the client here (Chrome) has decided to use an ACKing strategy that works well with BBR (that's what Google servers use), but as Ian notes, that strategy may cause perf regressions for a server that speaks Cubic or Reno. Since most QUIC servers out there are unlikely to be speaking BBR (this is simply my assertion), my question was about the extent of this degradation.

I understand why this is not simple, and that is why I've opened this issue -- we have a recommendation right now in the draft that suggests that we know a good answer.

I will send out a PR shortly, which should help make this conversation more concrete.

@ianswett
Copy link
Contributor

@ianswett ianswett commented Dec 19, 2019

@kazuho I was hoping to have the Chrome default be ACK every two packets and then use a frame or TP to change the behavior everywhere we're running BBR. But I haven't done that yet.

@janaiyengar Your points are very strong. We have a loads of evidence from QUIC and TCP that acking every 2 packets is not the right choice in many circumstances, so saying SHOULD seems quite odd. On the other hand, I feel like we punted this issue when it was raised before(#1978 ), and I feel like this is a bit late to make large changes.

@junhochoi I would prefer to avoid sending a congestion controller, since that limits innovation going forward(ie: what if someone uses a new CC?).

Here are some possible options(feel free to suggest others):

  1. We make the existing text a default, but not really a recommendation: ie: You should ACK every two packets, unless you know better.
  2. We add a 'fraction of RTT' ack aggregation transport param that a peer can specify, which kicks in at 'some point' when the receiver thinks the sender is out of slow start or the receiver sending too many ACKs per RTT. Also limit the number of packets in a single ACK to IW, in order to limit bursts to IW.
  3. We add a frame for "Number of packets before sending an ACK", as discussed in #1978
  4. We say something about not ACKing more frequently than the timer granularity, unless IW has arrived.
  5. We say you have to ACK every IW and within max_ack_delay, and that's it. If you want an immediate ACK, skip a packet number. This has costs in terms of ack frame size and datastructure size, and means we have to be more conservative about how many times to send an immediate ACK after receiving a gap(ie: probably only once).

There are(at least) three resources being conserved here: Sender CPU, Receiver CPU, and path bandwidth(or transmission opportunities, packet counts, etc), so no one has perfect information.

@ianswett
Copy link
Contributor

@ianswett ianswett commented Dec 19, 2019

@janaiyengar In terms of 'how bad was 1/8RTT with Cubic', I can't find any results, only results for 1/4 RTT. 1/8RTT was an option added after 1/4RTT was tested and showed a clear regression, so possibly it was never tested with Cubic. That was around the time BBR was being developed, so I think the focus was on BBR.

If we head in the direction of recommending a fixed fraction, I'll need to re-run those experiments with Cubic to quantify the regression.

I'll note the way Chrome implements 1/4 and 1/8RTT ACK decimation means it takes the min of max_ack_delay and the RTT fraction, so in some cases receivers will send more ACKs for a given max_ack_delay.
https://cs.chromium.org/chromium/src/net/third_party/quiche/src/quic/core/quic_received_packet_manager.cc?sq=package:chromium&g=0&l=246

@larseggert
Copy link
Member

@larseggert larseggert commented Dec 20, 2019

As a datapoint, quant currently ACKs every burst, and that is too infrequently. But two packets caused way to much overhead at gigabit speeds. Some better strategy would be highly appreciated.

@ianswett
Copy link
Contributor

@ianswett ianswett commented Dec 20, 2019

@larseggert Can you clarify what "ACK every burst" is? Is this along the lines of the optimization described in recovery: "As an optimization, a receiver MAY process multiple packets before sending any ACK frames in response. In this case the receiver can determine whether an immediate or delayed acknowledgement should be generated after processing incoming packets."

Would adding an "Always ACK every IW to limit bursts" be a helpful addition?

@larseggert
Copy link
Member

@larseggert larseggert commented Dec 21, 2019

Yes, I implement the optimization, but that doesn't work so well when multiple = hundreds.

I wonder if once per IW is still too frequently at gigabit speeds.

@gorryfair
Copy link
Contributor

@gorryfair gorryfair commented Dec 22, 2019

I like ACK every 10 (after the initial 100 received packets) as the basic recommendation. This is consistent with the idea that a sender can release packets in 10's (IW) ... and for rates of 10's -100's Mbps this provides a significant benefit in the case we looked at in terms of not being limited by paths that have asymmetry (it also helps with CPU usage) - which is where we arrived at in our PANRG talk. For example, without this the satellite broadband systems we work perform significantly worse than currently. Many networks have deployed ways to do this for TCP, and it's important we don't make experience with QUIC much worse than it needs to be on the paths that care about this.

At higher rates, you may well benefit from doing more - especially to reduce endpoint processing - but I think that such larger changes (e.g., to 1/4 RTT) do interact with the CC and that level of change requires a lot more thought - because it interacts with other mechanisms.

@huitema
Copy link
Contributor

@huitema huitema commented Dec 23, 2019

I have been experimenting with various ACK reduction schemes in picoquic, larely as part of experimentation with satellite link. There are two distinct motivations: CPU reduction, and asymmetric links in which the return path is much narrower than the data path.

In the asymmetric scenario, acking every two packets leads to a lot queuing and losses on the return path. The end to end delay becomes very large, and the default congestion controller becomes very confused. In these paths, ack reduction is essential. The right thing to do is probably to write control loop that increases the ACK interval when detecting ACK congestion. I have not written that yet, and did something simpler: just go to interval=10 when the delay is long and the bandwidth is high enough.

The actual algorithm in Picoquic is:

if more than 128 packets received:
    if rtt_min > 100ms and receive_rate > 10MB/s:
        ack_gap = 10
    else if no_holes_detected
        ack_gap = 4
else if FIN received on last active stream:
    ack_gap=1
else
    ack_gap=2

The max_ack_delay is set to:

    max( 1ms, min(25ms, RTT/4))

@huitema
Copy link
Contributor

@huitema huitema commented Dec 23, 2019

The reason with not going above ack_gap=4 in the normal case is similar to what @ianswett mentioned: the gap 10 does cause regression in some of the test scenarios. The funky test on last active stream is also there to avoid some regression, because the sender is typically trying to resend the non-acked packets, so acking sooner results in better performance overall. The test on 128 packets is because of slow start, large ack intervals there do cause regressions.

@gorryfair
Copy link
Contributor

@gorryfair gorryfair commented Dec 23, 2019

I agree keeping 1:2 (or even 1:4) is useful for low-rate flows, age 1:2 in TCP originated at a time when rates were << 1 Mbps.

The decision to switch based on number of packets received or some heuristic based on ACK rate is probably what is needed: Detecting the impact of ACKs by looking for an overloaded return path is very hard - basically a design needs to understand there is "cost" to sending ACKs - sending an ACK every other data packet consumes 1/3 of transmit opportunities in overhead. That's expensive in radio resource - however the value fails to take into account that in many scenarios the physical layer is less efficient in the return direction (for various physical reasons) and will result in greatly increased cost and/or greater variability if the ACK rate is high. There is a reason why RFC3349 mechanisms have been widely used with TCP for over 25 years - however, I'd suggest 10 Mbps of data is already making a rather high ACK rate, probably 1 Mbps is nearer the point at which the cost matters.

@huitema
Copy link
Contributor

@huitema huitema commented Dec 23, 2019

@gorryfair RFC3349 is "A Transient Prefix for Identifying [BEEP] Profiles under Development by the
Working Groups of the Internet Engineering Task Force". You probably meant RFC3449...

@gorryfair
Copy link
Contributor

@gorryfair gorryfair commented Dec 24, 2019

Indeed - sorry - RFC3449.

@dtikhonov
Copy link
Contributor

@dtikhonov dtikhonov commented Dec 26, 2019

[W]e know ACKing every other is measurably problematic (I can share numbers if people are interested).

@mjoras, I am interested.

@yangchi
Copy link

@yangchi yangchi commented Dec 27, 2019

@dtikhonov : the last time we measured acking every other packet vs acking every 10 packets, we saw 12% throughput difference. But i think this is quite YMMV.

@mnot mnot added this to Triage in Late Stage Processing Jan 6, 2020
@mnot mnot added the design label Jan 6, 2020
@mnot mnot moved this from Triage to Design Issues in Late Stage Processing Jan 6, 2020
@mjoras
Copy link

@mjoras mjoras commented Jan 6, 2020

I'll do some tests with mvfst this week or next and post numbers. We have tests for both single stream throughput and an http server.

@mirjak
Copy link
Contributor

@mirjak mirjak commented Jan 7, 2020

I think the algorithm proposed here are actually protocol independent and should not be specified in the any quic draft but in a separate draft (in tsvwg...?).

I actually think that the SHOULD we currently have is fine because it really just means if you implement this protocol and you don't know any better, we recommend you to use this value.

However, we could of course add one more sentence and explain the tradeoffs.

@mjoras
Copy link

@mjoras mjoras commented Jan 14, 2020

Some tests, as promised, cc @janaiyengar @dtikhonov.

The first test is using an "iperf-style" single stream, single conn, single thread on client and server. The server uses BBR as the congestion controller. The server is an infinite source and the client is an infinite sink and the test runs for 60s and measures the average throughput.

The first situation is using the loopback interface on Linux, standard MTU size, and no introduced delay or loss. Changing the ACK generation interval from 10 -> 2, we observe a 20% relative decrease in throughput.

Using the same test but with a 15ms netem delay with mild loss and the results are more dramatic. The ACK ranges expand considerably, which ends up being a significant cost for both the client and the server which results in a 50% relative decrease in throughput.

These tests don't really reflect how most people plan to use QUIC (on the internet, not with multi-Gbps sustained transfers), but I believe they are illustrative of the costs we're dealing with. Note that the vast majority of the profiled stacks for mvfst in these tests are spent in sendmsg, recvmsg, crypto, and serializing the QUIC frames. There are still some opportunities for us to micro-optimize our implementation, but relative to the "fixed" costs most implementations are paying, it is likely only a few percentage reductions here and there. ACK handling, for example, which is very implementation-dependent, ends up being less costly to the server than writing ACK frames when the ACK interval is 2 versus 10.

We also have a way to test this using a real reverse proxy with synthetic traffic. This particular set up uses a real transatlantic backbone link with minimal loss. The link has an RTT of ~100ms, and the server is using BBR as the congestion controller. Two tests are of interest, one is generally not CPU bound for the server while the other is not. The one typically not CPU bound is many clients each requesting one 1MB resource. The test which typically becomes CPU bound is many clients each requesting ten 1MB resources.

For one 1MB resource per client the ACK interval from 10 -> 2 actually increases RPS by about 20%. I think this is largely due to congestion controller benefits from a higher ack frequency (note that the ACK every ten baseline does does not start out by ACKing every other).

For the ten 1MB resources per client case we see the dramatic results again, where the increased ACK frequency causes a 60-70% drop in RPS. In this case it seems that the client is the one that's really causing the regression by having to ACK much more frequently.

All of this is to say, I think there are dragons lurking here. We've had good success with the heuristic of ACKing every 10 combined with 1/4 or 1/8 RTT. After some thought I don't think including this (and the "after 100" optimization) as the default recommendation is problematic, as long as we have some nice language conveying the basics of the tradeoffs at play here.

@ianswett
Copy link
Contributor

@ianswett ianswett commented Jan 28, 2020

In case anyone interested in this issue hasn't seen it, please review the draft Jana and I wrote the allow senders to reduce the rate of ACK-only packets sent by receivers:
https://datatracker.ietf.org/doc/draft-iyengar-quic-delayed-ack/

My hope is that this draft allows this issue to be closed with no action in the core transport draft, but if people believe some or all of the draft's functionality needs to be present in the core draft, please indicate that on the list.

@gorryfair
Copy link
Contributor

@gorryfair gorryfair commented Jan 29, 2020

To me the problem is in two parts ... The first part is "what is the default?" - I still think we shouldn't be designing a QUIC default with significantly worse performance than TCP - simply because TCP can take advantage of an in-network device to Thin ACKs - we should think carefully about setting an appropriate default. To avoid a long "issue" we wrote a draft on this and can present results if there is interest, with a proposal to change the default in the spec:
https://datatracker.ietf.org/doc/draft-fairhurst-quic-ack-scaling/

draft-iyengar-quic-delayed-ack provides opportunities to tune the ACK policy for a server and CC algorithm. I don't see the two drafts in competition. Both could proceed?

@larseggert
Copy link
Member

@larseggert larseggert commented Feb 5, 2020

Discussed in ZRH. Proposed resolution is to close with no action. Someone may propose an Editorial issue on clarifying that the current text is not a bad starting point for a given implementation.

@larseggert larseggert added the proposal-ready label Feb 5, 2020
@project-bot project-bot bot moved this from Design Issues to Consensus Emerging in Late Stage Processing Feb 5, 2020
@LPardue LPardue moved this from Consensus Emerging to Consensus Call issued in Late Stage Processing Feb 19, 2020
@LPardue LPardue removed the proposal-ready label Feb 19, 2020
@LPardue LPardue added the call-issued label Feb 26, 2020
@LPardue LPardue added has-consensus and removed call-issued labels Mar 4, 2020
@project-bot project-bot bot moved this from Consensus Call issued to Consensus Declared in Late Stage Processing Mar 4, 2020
Late Stage Processing automation moved this from Consensus Declared to Text Incorporated Mar 5, 2020
gorryfair added a commit to gorryfair/base-drafts that referenced this issue Mar 5, 2020
This will update the default recommendation to ACK at least every 10th eliciting packet. This reduces ACK traffic. It also improves QUIC performance on asymmetric paths.

Related issue quicwg#3304.

QUIC ACKs are significantly larger in size than TCP ACKS (e.g. 1.5-2 times), which means additional processing overhead and link usage for all Internet paths, with a significant impact on asymmetric links, where this can also limit throughput.

Additional methods, such as the one described in https://datatracker.ietf.org/doc/draft-iyengar-quic-delayed-ack/, can be
used to modify the ACK Ratio in use-cases where further tuning is required, such as high-speed networks, or when a different congestion controller is used. Using an appropriate value for max_ack_delay, or ensuring a minimum number of ACKs per RTT (e.g 8) would mitigate the effect of ACK loss on RTT estimation and aids performance for low-rate interactive applications.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
-transport design has-consensus
Projects
Late Stage Processing
  
Issue Handled
Development

No branches or pull requests