quicwg / base-drafts Public
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ACK generation recommendation #3304
Comments
|
Why only after the first 100 packets? To have fine-grained feedback during ramp-up? |
Could you expand on this? I find this to be an extremely general statement, and not very helpful in understanding the real motivation for any possible changes. You say it "has noticeable CPU costs for QUIC", but are you referring to the client, server or some middle boxes somehow? Assuming you're worried about the CPU costs on the server side, has anyone explicitly measured the differences in CPU cost for generating different number of ACKs per RTT? What's the effect on FC? Fewer ACKs will mean the sender (I'm assuming the server?) is going to have to buffer more, and FC windows might be hit more often. Is this really such a big problem that is requires a spec change? In V1? |
|
On Tue, Dec 17, 2019 at 6:48 AM Nick Banks ***@***.***> wrote:
Gorry raised the point that in experiments, this generates way too many
ACK packets in high bandwidth networks, such as satellite networks. This
has noticeable CPU costs for QUIC, for both sending as well as for
receiving.
Could you expand on this? I find this to be an extremely general
statement, and not very helpful in understanding the real motivation for
any possible changes. You say it "has noticeable CPU costs for QUIC", but
are you referring to the client, server or some middle boxes somehow?
It can happen to either endpoint. Consider an HTTP/3 download. The client
ends up burning a bunch of CPU *sending* ACK, and the server a bunch of CPU
*processing* ACKs.
Assuming you're worried about the CPU costs on the server side, has anyone
explicitly measured the differences in CPU cost for generating different
number of ACKs per RTT? What's the effect on FC? Fewer ACKs will mean the
sender (I'm assuming the server?) is going to have to buffer more, and FC
windows might be hit more often. Is this really such a big problem that is
requires a spec change? In V1?
We've definitely noticed the CPU costs at Google on high
bandwidth connections with long flow. This led us to implement a variety of
ACK reduction strategies which reduced CPU use significantly. I think a
QUIC sender which sends ACKs of every 2 packets without and further limits
will be unable to take full advantage of the network and yes, we should
explicitly permit this in V1.
I don't believe we measured the impact on flow control. Our clients
typically advertise a very large flow control window which basically does
not get hit.
Cheers,
Ryan
|
Oh, and of course this can happen in the other direction too. In an HTTP/3 upload the sender is the client which is less likely to have hardware offload so the CPU concerns can be quite accute.
Err, a QUIC sender obvious isn't the ACK sender. sigh. My point remains, though. Senders my not be able to take full advantage of the network if every two packets are ACKd. |
|
There was still a regression with 1/8th RTT and Cubic. We we using pacing at the time, FYI. Not pacing or pacing differently may lead to slightly different results. Given TCP has no formal recommendations on this that I'm aware of, I tend to think we should solve this problem properly or punt it to an extension. Recommending the heuristic that Chrome arrived at though some(though not exhaustive) experimentation makes me concerned. Sending fewer ACKs impacts the sender's congestion controller, so ideally the sender would be in control of this, or at least be aware of what the peer's algorithm is. We could add some transport params to unilaterally communicate that, like we do max_ack_delay, but then do we need to specify how to compensate for that in the recovery draft? I'm having a really hard time coming up with a good recommendation here without some combination of a new frame and/or transport params. |
|
At Facebook, ACK handling is the largest component of what I'd call "discretionary" CPU cost (i.e. not crypto, not syscall/userspace->kernel copying). I expect that will be true in general once people have optimized their implementations. We see fine results with the suggested basic Chrome heuristic, and haven't experimented with much beyond that but it's on our roadmap. I don't think we should keep the current ACK every other in the draft as it will largely end up being facetious. Deployments will use what ends up working better in practice and we know ACKing every other is measurably problematic (I can share numbers if people are interested). That being said, I feel uncomfortable recommending the Chrome heuristic in the draft, as it reminds me of some of the early TCP RFC's which have recommendations and constants that did not age well and are largely ignored today. Can we punt something similar to @ianswett's initial idea of having this controlled via TP or new frame type? |
|
@mjoras: I'm very sympathetic to yours (and Ian's) thinking. That said, we do have a recommendation in the draft right now, which is that a receiver SHOULD ack every other packet. To your point about aging: that arguably has not aged well for TCP, but we're recommending it here. I am now wondering if should drop that SHOULD as well and make a weaker recommendation, noting that there are tradeoffs here. |
Oh. Then, am I correct in assuming that Chrome is (or will be) sending ACK for at least every two ack-eliciting packets? IIUC, the recovery draft recommends use of Cubic. We'd definitely want to see good performance with what we recommend, when Chrome is acting as a client. |
@janaiyengar I would be supportive of dropping the SHOULD and having a brief text discussing the tradeoffs involved, possibly at least mentioning the 100-10-1/4RTT recommendation, or however you'd like to refer to it. |
|
@mjoras: Sorry, I misread your comment above. I was not proposing recommending the 100-10-1/4RTT strategy, I was proposing suggesting it. You said that the heuristic worked fine for you; did you see any performance regressions? You also say: If it isn't too much trouble, that would be super valuable. |
|
@ianswett: TCP does have a formal requirement which we cite in the recovery draft right now. From RFC 5681: The problem is that this is quite dated and the TCP ecosystem has corrected for this by collapsing ACKs in the network. Additionally, the two major QUIC clients deployed right now -- FB and Chrome -- neither follows this recommendation. It seems silly to continue saying SHOULD when we expect it to not be followed. Also, @kazuho raises an interesting point. Chrome is likely to be speaking to servers that might be using Cubic or Reno in the near future. Do you recall how serious the degradation with Chrome's ACKing scheme was? With 1/8th RTT? |
Yes, and to avoid potential issues during slow start. |
Shouldn't that largely depend on server implementation? Server implements ack handling poorly vs Server implements it OK will have very different throughput for the same ack frequency from the same peer, assuming everything else equal. And that's why using a specific recommendation in the draft is tricky. This is a receiver behavior, but it's the sender that knows what's the best way to do it. |
|
Cloudflare(quiche) is implementing Reno at this point (and will soon add more congestion controls). Since now there is a variety of parties implementing/choosing their own choice of congestion control module (Reno as in the draft, Cubic is mentioned, Google has BBR(v2), Facebook has COPA...), I think it's good to keep a general strategy in the draft (every-2-packet ack is what TCP recommends so I think it's fine for most cases). But worth mentioing alternative strategy is possible (what Chromes does, or something like linux TCP_QUICKACK?) If we want to use TP, I think Server may send its congestion control algorithm using a transport parameter so that Client can pick up its own ack strategy if needed. |
Yes, specifically the congestion controller. My question was directed at @ianswett, about experiments with Chrome's strategy for a server that does Cubic. My point is that the client here (Chrome) has decided to use an ACKing strategy that works well with BBR (that's what Google servers use), but as Ian notes, that strategy may cause perf regressions for a server that speaks Cubic or Reno. Since most QUIC servers out there are unlikely to be speaking BBR (this is simply my assertion), my question was about the extent of this degradation. I understand why this is not simple, and that is why I've opened this issue -- we have a recommendation right now in the draft that suggests that we know a good answer. I will send out a PR shortly, which should help make this conversation more concrete. |
|
@kazuho I was hoping to have the Chrome default be ACK every two packets and then use a frame or TP to change the behavior everywhere we're running BBR. But I haven't done that yet. @janaiyengar Your points are very strong. We have a loads of evidence from QUIC and TCP that acking every 2 packets is not the right choice in many circumstances, so saying SHOULD seems quite odd. On the other hand, I feel like we punted this issue when it was raised before(#1978 ), and I feel like this is a bit late to make large changes. @junhochoi I would prefer to avoid sending a congestion controller, since that limits innovation going forward(ie: what if someone uses a new CC?). Here are some possible options(feel free to suggest others):
There are(at least) three resources being conserved here: Sender CPU, Receiver CPU, and path bandwidth(or transmission opportunities, packet counts, etc), so no one has perfect information. |
|
@janaiyengar In terms of 'how bad was 1/8RTT with Cubic', I can't find any results, only results for 1/4 RTT. 1/8RTT was an option added after 1/4RTT was tested and showed a clear regression, so possibly it was never tested with Cubic. That was around the time BBR was being developed, so I think the focus was on BBR. If we head in the direction of recommending a fixed fraction, I'll need to re-run those experiments with Cubic to quantify the regression. I'll note the way Chrome implements 1/4 and 1/8RTT ACK decimation means it takes the min of max_ack_delay and the RTT fraction, so in some cases receivers will send more ACKs for a given max_ack_delay. |
|
As a datapoint, quant currently ACKs every burst, and that is too infrequently. But two packets caused way to much overhead at gigabit speeds. Some better strategy would be highly appreciated. |
|
@larseggert Can you clarify what "ACK every burst" is? Is this along the lines of the optimization described in recovery: "As an optimization, a receiver MAY process multiple packets before sending any ACK frames in response. In this case the receiver can determine whether an immediate or delayed acknowledgement should be generated after processing incoming packets." Would adding an "Always ACK every IW to limit bursts" be a helpful addition? |
|
Yes, I implement the optimization, but that doesn't work so well when multiple = hundreds. I wonder if once per IW is still too frequently at gigabit speeds. |
|
I like ACK every 10 (after the initial 100 received packets) as the basic recommendation. This is consistent with the idea that a sender can release packets in 10's (IW) ... and for rates of 10's -100's Mbps this provides a significant benefit in the case we looked at in terms of not being limited by paths that have asymmetry (it also helps with CPU usage) - which is where we arrived at in our PANRG talk. For example, without this the satellite broadband systems we work perform significantly worse than currently. Many networks have deployed ways to do this for TCP, and it's important we don't make experience with QUIC much worse than it needs to be on the paths that care about this. At higher rates, you may well benefit from doing more - especially to reduce endpoint processing - but I think that such larger changes (e.g., to 1/4 RTT) do interact with the CC and that level of change requires a lot more thought - because it interacts with other mechanisms. |
|
I have been experimenting with various ACK reduction schemes in picoquic, larely as part of experimentation with satellite link. There are two distinct motivations: CPU reduction, and asymmetric links in which the return path is much narrower than the data path. In the asymmetric scenario, acking every two packets leads to a lot queuing and losses on the return path. The end to end delay becomes very large, and the default congestion controller becomes very confused. In these paths, ack reduction is essential. The right thing to do is probably to write control loop that increases the ACK interval when detecting ACK congestion. I have not written that yet, and did something simpler: just go to interval=10 when the delay is long and the bandwidth is high enough. The actual algorithm in Picoquic is: The max_ack_delay is set to: |
|
The reason with not going above |
|
I agree keeping 1:2 (or even 1:4) is useful for low-rate flows, age 1:2 in TCP originated at a time when rates were << 1 Mbps. The decision to switch based on number of packets received or some heuristic based on ACK rate is probably what is needed: Detecting the impact of ACKs by looking for an overloaded return path is very hard - basically a design needs to understand there is "cost" to sending ACKs - sending an ACK every other data packet consumes 1/3 of transmit opportunities in overhead. That's expensive in radio resource - however the value fails to take into account that in many scenarios the physical layer is less efficient in the return direction (for various physical reasons) and will result in greatly increased cost and/or greater variability if the ACK rate is high. There is a reason why RFC3349 mechanisms have been widely used with TCP for over 25 years - however, I'd suggest 10 Mbps of data is already making a rather high ACK rate, probably 1 Mbps is nearer the point at which the cost matters. |
|
@gorryfair RFC3349 is "A Transient Prefix for Identifying [BEEP] Profiles under Development by the |
|
Indeed - sorry - RFC3449. |
@mjoras, I am interested. |
|
@dtikhonov : the last time we measured acking every other packet vs acking every 10 packets, we saw 12% throughput difference. But i think this is quite YMMV. |
|
I'll do some tests with mvfst this week or next and post numbers. We have tests for both single stream throughput and an http server. |
|
I think the algorithm proposed here are actually protocol independent and should not be specified in the any quic draft but in a separate draft (in tsvwg...?). I actually think that the SHOULD we currently have is fine because it really just means if you implement this protocol and you don't know any better, we recommend you to use this value. However, we could of course add one more sentence and explain the tradeoffs. |
|
Some tests, as promised, cc @janaiyengar @dtikhonov. The first test is using an "iperf-style" single stream, single conn, single thread on client and server. The server uses BBR as the congestion controller. The server is an infinite source and the client is an infinite sink and the test runs for 60s and measures the average throughput. The first situation is using the loopback interface on Linux, standard MTU size, and no introduced delay or loss. Changing the ACK generation interval from 10 -> 2, we observe a 20% relative decrease in throughput. Using the same test but with a 15ms netem delay with mild loss and the results are more dramatic. The ACK ranges expand considerably, which ends up being a significant cost for both the client and the server which results in a 50% relative decrease in throughput. These tests don't really reflect how most people plan to use QUIC (on the internet, not with multi-Gbps sustained transfers), but I believe they are illustrative of the costs we're dealing with. Note that the vast majority of the profiled stacks for mvfst in these tests are spent in sendmsg, recvmsg, crypto, and serializing the QUIC frames. There are still some opportunities for us to micro-optimize our implementation, but relative to the "fixed" costs most implementations are paying, it is likely only a few percentage reductions here and there. ACK handling, for example, which is very implementation-dependent, ends up being less costly to the server than writing ACK frames when the ACK interval is 2 versus 10. We also have a way to test this using a real reverse proxy with synthetic traffic. This particular set up uses a real transatlantic backbone link with minimal loss. The link has an RTT of ~100ms, and the server is using BBR as the congestion controller. Two tests are of interest, one is generally not CPU bound for the server while the other is not. The one typically not CPU bound is many clients each requesting one 1MB resource. The test which typically becomes CPU bound is many clients each requesting ten 1MB resources. For one 1MB resource per client the ACK interval from 10 -> 2 actually increases RPS by about 20%. I think this is largely due to congestion controller benefits from a higher ack frequency (note that the ACK every ten baseline does does not start out by ACKing every other). For the ten 1MB resources per client case we see the dramatic results again, where the increased ACK frequency causes a 60-70% drop in RPS. In this case it seems that the client is the one that's really causing the regression by having to ACK much more frequently. All of this is to say, I think there are dragons lurking here. We've had good success with the heuristic of ACKing every 10 combined with 1/4 or 1/8 RTT. After some thought I don't think including this (and the "after 100" optimization) as the default recommendation is problematic, as long as we have some nice language conveying the basics of the tradeoffs at play here. |
|
In case anyone interested in this issue hasn't seen it, please review the draft Jana and I wrote the allow senders to reduce the rate of ACK-only packets sent by receivers: My hope is that this draft allows this issue to be closed with no action in the core transport draft, but if people believe some or all of the draft's functionality needs to be present in the core draft, please indicate that on the list. |
|
To me the problem is in two parts ... The first part is "what is the default?" - I still think we shouldn't be designing a QUIC default with significantly worse performance than TCP - simply because TCP can take advantage of an in-network device to Thin ACKs - we should think carefully about setting an appropriate default. To avoid a long "issue" we wrote a draft on this and can present results if there is interest, with a proposal to change the default in the spec: draft-iyengar-quic-delayed-ack provides opportunities to tune the ACK policy for a server and CC algorithm. I don't see the two drafts in competition. Both could proceed? |
|
Discussed in ZRH. Proposed resolution is to close with no action. Someone may propose an Editorial issue on clarifying that the current text is not a bad starting point for a given implementation. |
This will update the default recommendation to ACK at least every 10th eliciting packet. This reduces ACK traffic. It also improves QUIC performance on asymmetric paths. Related issue quicwg#3304. QUIC ACKs are significantly larger in size than TCP ACKS (e.g. 1.5-2 times), which means additional processing overhead and link usage for all Internet paths, with a significant impact on asymmetric links, where this can also limit throughput. Additional methods, such as the one described in https://datatracker.ietf.org/doc/draft-iyengar-quic-delayed-ack/, can be used to modify the ACK Ratio in use-cases where further tuning is required, such as high-speed networks, or when a different congestion controller is used. Using an appropriate value for max_ack_delay, or ensuring a minimum number of ACKs per RTT (e.g 8) would mitigate the effect of ACK loss on RTT estimation and aids performance for low-rate interactive applications.
The transport draft currently says:
Gorry raised the point that in experiments, this generates way too many ACK packets in high bandwidth networks, such as satellite networks. This has noticeable CPU costs for QUIC, for both sending as well as for receiving. Satellite networks use middleboxes that collapse TCP ACKs, but they can't for QUIC.
We have talked about doing a more general strategy separately as an extension, but we have experience with a fairly straightforward one. Chrome uses an ACK coalescing strategy that does the following: after the first 100 packets, ACK once every 10 packets or 1/4th of an RTT, whichever comes earlier. If we agree that this might be good general strategy, we should suggest it in the transport document.
@ianswett: my recollection is that the strategy had some throughput reduction when used with Cubic, is that correct? Do you think using 1/8th RTT might resolve this?
The text was updated successfully, but these errors were encountered: