-
Notifications
You must be signed in to change notification settings - Fork 205
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove ack_delay_exponent TP #2670
Comments
I don't think this adds a lot of complexity, so I'm happy with the status quo. |
The default value is not 8, it's 3. So what quicly does makes a pretty big difference here: They're using a multiplier of 1024 instead of 8. As I've already remarked on Slack, if we want to allow implementations to scale this value, it seems to make more sense to specify a multiplier instead of an exponent. |
@marten-seemann You're right. I was mixing up 3 and 2^3 (8). So, right now the default suggests an 8 microsecond ack_delay accuracy and quicly is using a ~1 millisecond ack_delay accuracy. The more I think about it, the higher value might well be the better one. The goal of this issue was to see if we could actually all just agree on a single scaling (or multiplier) value. To me, configurability is only useful if we have good reason for different values. As I understand it, changing this value just affects the accuracy of the ack_delay field in the ACK frame. This in turns, effects the accuracy of the network RTT calculation for the largest acknowledged packet in the ACK frame. My question is what truely is the difference between being able to measure the RTT with sub-millisecond accuracy (even in intra-DC) if the timers aren't that accurate? The primary point of measuring RTT is to have it factor into loss recovery timers. Additionally, since the endpoint setting the timer is the one with the actual restriction on timer accuracy, it seems to be, if this value was to be configurable, it should be the receiver of the ACK frame to dictate how accurate the ack_delay needs to be; not the sender. It just feels like this is really over complicating things. If folks really feel like this field needs to be configurable, I'd like to understand why, so that I might make the same optimizations/choices in winquic. I personally think the argument of "it's not that much more complex to keep it" isn't a great one. IMO, any additional complexity for no explicit reason is bad complexity. These things add up pretty fast. |
Clock accuracies vary a lot by environment(ie: ours are better than 1ms, but not as good as being in the Linux kernel) as do RTTs, so I would expect a datacenter to set this to 0 or 1 and user facing traffic to pick the default of 3 or a higher value. |
Sure, I understand that. Being able to query an accurate time is one thing; scheduling a timer that accurate is another. On Windows you can query microsecond time, but only configure to a 1 ms timer accuracy at best (depending on configuration and scheduler). But is that reason to have 1 microsecond ack_delay accuracy? Also, like I pointed out above, the ack_delay accuracy is dependent on your peer. They don't necessarily know your timer or clock accuracy. |
The idea is if you schedule a timer for 1ms, but it goes off at 1357us, you can indicate the actual delay instead of the intended delay. So the peer shouldn't need to know your timer accuracy, but ideally your clock accuracy is fairly good. |
Sorry if I was unclear. I wasn't referring to computing how long you actually delayed. I was referring to calculating the time you set for your PTO timer. That's effectively where the adjusted rtt measurement comes into play, right? Does it really matter if the rtt is 53.112 ms or just 53.1 ms or 53 ms, when you set the PTO timer, if your timer doesn't have that accuracy? |
I'm fine with status quo, but if we are to make a change, I like @marten-seemann's proposal of changing ack_delay_exponent to a multiplier. As @nibanks has pointed out, quicly uses ack_delay_exponent of 10 because it uses a millisecond timer. However, exponent of 10 is slightly incorrect, because 210 is 1,024. Generally speaking, I do not think that people would have a timer granularity of 2N microseconds, where N is an integer. I expect people granularity of 10N microsecond timers instead. To put it another way, the base of the exponent not being 10 leads to either inaccuracy of ACK.ack_delay, or the size of the ack_delay field becoming larger than what it could be. That said, @marten-seemann's suggestion of using a multiplier allows accurate representation of ack-delay for any timer that have X microsecond granularity in the most compact form, at the cost of probably spending 1 more byte in TP. I think it has the best flexibility without sacrificing anything in practice. |
@kazuho makes a good point about the inaccuracy of 2^N when it comes to timer granularity. I don't think a multiplier is any more expensive, and it's certainly more accurate. |
I'd be happy with just saying ack_delay is units of milliseconds and do away with ack_delay_exponent. I really don't see a reason to have more accurate time for this field. |
While millisecond granularity would be sufficient (and also efficient in terms of the size of the ACK frame) for connections between the web browsers and the servers, more accuracy is desirable for intra-datacenter connections (as @ianswett has pointed out). Therefore, my preference would be to retain a TP that can be used to scale the granularity. |
This really isn't much complexity, so I would like to keep the TP around, albeit preferably as a multiplier. |
Looks like multiplication is beating exponentiation. (Also consistency over inconsistency.) |
In praxis you can’t trust any timer to be at any resolution, so measuring the diff in nanoseconds and downshifting by exponent would be simpler, faster, and more precise. Otherwise you have to deal with division or choose a power of 2 anyway.
|
I don't see a reason to change this to a multiplier, except for personal taste. Shifts are objectively faster and I haven't seen anyone indicate they need or really want a multiplier that can't be expressed today. |
First, if the parameter ends up staying, I agree with @ianswett and I would just leave things as they are, with no changes (though maybe we update the default to 10?). But I still haven't seen a specific answer to two things:
|
On timers: you don't know what hardware and software will be deployed in the future or in special projects. Wether that precision is really important is a different question. |
@mikkelfj I can agree to that. But as far as "what might happen in the future", in the QUIC WG, it seems that the general guideline has been: if you don't have a scenario for it now (i.e. not in the charter) then defer it as either a possible extension or a future version. |
So, I don't know what you can typically expect, but I have a timer implementation that uses nanosleep because why not? Recent Linux kernels can wait on microsecond resolution on x86 arch even if its general Hz/jiffie setting is higher. |
|
Thinking a bit more about the problem, to me it seems that using examples might help in moving the discussion forward. Consider the case where an endpoint uses a millisecond timer, sets max_ack_delay to 25ms, and the RTT is 5 milliseconds. The endpoint sends an ACK when the ack-delay timer fires. If the endpoint uses ack_delay_exponent of 10 (i.e., 1024 microseconds as an approximation of 1 millisecond), the sender would set ACK.ack_delay to 25, and the receiver would interpret the value as 25.6 milliseconds (i.e. 1024*25=25,600). Then, the calculated RTT becomes 4.4ms, which is 12% smaller than the actual value. Is this error acceptable? I think it is, because a millisecond timer would have +- 1ms inaccuracy. Although it is a bit concerning that the RTT would always be underestimated in the above case. But the question is: what is the rational for using shifts instead of multipliers, when there are these types of errors? While it is true that bit shift is faster than multiplication, the different does not matter for an encrypted transport. |
Are you not forgetting that numbers are stored in base 2, so if you divide by, say 100 or 25000, you do not get an exact representation at the lower bit? Receiver side multiply by said factor will now introduce an inaccuracy. On the contrary, if you keep your time management in base 2 sender side (if you can be bothered), there is no additional precision loss beyond the chosen resolution. |
@mikkelfj Could you please elaborate where division happens? ack_delay_exponent is chosen by the sender of the ACK frame. Therefore, no division is happening in the example above. |
One clarification on your example @kazuho. The rtt sample that would result would not go lower than min_rtt, so it would only be 4.4ms if the min_rtt was <=4.4ms. |
It happens because timers don't work as you describe. |
Good point. Thank you for pointing that out. Though I'd assume that having +-12% jitter between RTT samples would be common. @mikkelfj Thank you for the clarification. I now see the confusion. I think I should have said "1ms granularity clock" rather than "timer". Does that make things clear? |
not really, because except for rare RTOS systems used in ABS breaking systems and such, there is no such thing as a 1ms granularity. It may be that your best resolution is 1ms, but you might wake up 2500ms later after disk swap is done, or at half the expected sleep time because of some Unix signal and you don't care to rewait after that, or just general noise of kernel processes, interrupts, TLB misses, etc. Of course, the smaller the RTT the more critical this gets. For 25ms the chance of staying within a few percentage is reasonable. I still think the float16 approach is better. It takes all the guessing out by introducing an exponent per timestamp, and it can be decoded in a few nanoseconds. |
These rounding issues make no sense to me, as I have argued in the above. On the contrary division becomes cheaper with exponents when you have to compute the actual timer interval. But even with a multiplier, the sender controls the factor and can therefore divide by a constant, so a multiplier is not fatal compared to an exponent. I just don't see a great benefit. I'm inclined to think that we could do without a multiplier or exponent altogether if we have microsecond units always. |
The whole thing looks like a storm in a cup of tea. I don't buy the implementation issues. On a modern computer, computing "x = (internal-value * unit)" has exactly the same cost as "x = (internal-value * unit)>>exponent". There is no branching involved, the instructions are pipe-lined, the next pipeline stall will come during the encoding to varint in all cases. My preference if we do change things is to have a straight microsecond count, no scaling involved. But I get Ian's point that this makes the varint encoding slightly longer, which causes some overhead. That would be an argument for just keeping the spec as is. |
Unsurprisingly, I prefer b (multiplication), and the reason is as follows. As @huitema points out, the computational cost of using a multiplier is no worse than using an exponent, and I second @marten-seemann's comment that use of exponent is strictly worse than using a multiplier. I do not think there is a justifiable reason to choose a (exponent). The network overhead of using no scaling factor would be 2 to 3 bytes per packet containing an ACK, for the primary use-case of QUIC v1 traffic. This is because we can assume that connections between web browsers and servers would be in most cases greater than 16.383 milliseconds. Considering the effort we have spent in reducing the number of bytes being transmitted on the wire, I think it might make more sense to have a multiplier than removing the scaling factor. |
Maybe it is worth saving those bytes. But another argument not prominent in the discussion is the risk of getting the scale wrong since it applies to all paths and might be hardcoded in implementations that do not adapt to different networking environments. This is why I think the fp16 approach is better (with overhead slightly above a multiplication), but I can see why people dislike a non-standard algorithm. Therefore the question is really if the bytes saved are worth the problems a bad scale might induce. |
I'm very much inclined to propose a vote (i.e., to invoke RFC 3929) on this one in Montreal. I get the strong impression that no one really cares about this that much. Would anyone object if I put up a slide with the three options up and we could all hum or wave hands at the ones we liked the least? |
@martinthomson I think that's a good idea, as long as we can summarize the pros and cons of each proposal before the vote. |
As I understand the state of play, the main costs are bytes on the wire (potentially across a wide range of operating points: short RTT, fast processor to long RTT, slow processor), code complexity, and efficiency of calculating the values for encode (e.g., division vs. shift) or decode (e.g., multiply vs. shift). Did I miss anything? |
I'm fine with a vote in Montreal, though I'd really prefer not to remove this entirely. If there is not a clear favorite, then I'd suggest doing a hum for which option people dislike most and not doing that one? That list covers everything that comes to mind. |
Doing a hum sounds good to me. Mp prefeence goes to arguing for dislikes as @ianswett says.
FWIW, this is the table I have illustrating the differences of the encoder side. I do not think division is necessary even when adopt the multiplier proposal, because the encoder has the freedom to choose the multiplier. By choosing a multiplier that is 2-power of the timer granularity, a encoder can use a bit shift operation.
For the decoder side, it's noop vs. shift vs. multiply. |
Yeah, we have times expressed in nanoseconds, so we have to divide always. I heard the "dislike most" suggestion; I was planning on that anyway. I've heard a lot stronger negative sentiment related to the "remove the TP" option. |
For those following along, here is a draft of some slides for this process. Comments are open if you want to improve them in any way. |
So I'm am one the few holding out for shift. I disagree with many of the arguments made on shift being imprecise, but I come to some of the same conclusion @kazuho has recently posted: The encoded can choose its multiplier. It can choose one that it is a power of 2. The native representation of time my require scaling anyway, so you might end up with a non-power of two. However, this is typically a constant, and division by a constant is multiplication if you do it compile time. It is a long multiplication which is slower, but not that much. Therefore I am reverting on my stance and think that a multiplier is acceptable, as long as it can be chosen statically for the application, or it can choose a power of 2 dynamically. More generally I still think it is a problem to guess the time scale and would like to have an adaptive scale without transport parameters, using 16 bit unsigned floats, but there is zero traction on that here. |
It should be noted though, that 64-bit multiplication on a 32-bit, or smaller MCU, is comparatively slow. These will also suffer from GCM, but there is other crypto. |
I'm fairly confident that these operations will be done on relatively small values, and so optimization to that degree will be possible. I plan to use Rust u128 values and to ignore the cost, but I realize that not every implementation can afford to spend upwards of 4 CPU cycles on an operation that you perform once every other packet. |
In general, if CPU is a problem, there's a simple fix. Ignore ack delay
entirely. Your RTT values will be higher, causing recovery to take longer,
but that is the tradeoff. At any rate, this recovery time won't be worse
than TCP, since TCP includes receiver delays in its RTT estimates.
…On Wed, Jul 10, 2019 at 7:40 AM Martin Thomson ***@***.***> wrote:
I'm fairly confident that these operations will be done on relatively
small values, and so optimization to that degree will be possible. I plan
to use Rust u128 values and to ignore the cost, but I realize that not
every implementation can afford to spend upwards of 4 CPU cycles on an
operation that you perform once every other packet.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#2670?email_source=notifications&email_token=ACUOBVFXENMDISDM33ZM5PDP6XDDDA5CNFSM4HLJKNXKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZTGBGY#issuecomment-510025883>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ACUOBVBJWPWETAPZOLXDUSDP6XDDDANCNFSM4HLJKNXA>
.
|
Assigned to @martinthomson since he presented in Montreal. |
Discussed in Montreal. Option B eventually got consensus in the room. |
The agreed option requires no action, so this is ready for a consensus call. |
How so - isn't option B about a multiplier TP - that would require a PR no? |
@mikkelfj The "option B" discussed during the WG meeting was to effectively keep the text as is. |
OK, thanks. |
I asked the question on Slack and got quite a bit of feedback so I figured I'd open an issue.
Right now, it seems all but one implementation (quicly) is using the default ack_delay_exponent value of 8. Quicly is using an even higher value of 10 because they don't care about more than millisecond timer resolution.
This leads me to believe that having this be a configurable value (in the QUIC TP) is an over design, and we should simplify this. Pick a value (8 or 10, I personally don't care which) and just leave it at that.
The only argument I've heard for keeping it configurable is that intra-DC environments might want to have a smaller exponent to have more fine grain ACK delays. I'm not sure that's a great reason, personally. At that small of time granularity, most OS timers aren't going to be very accurate. So, even if you can calculate an extremely accurrate RTT, it's not going to help your loss recovery timers any.
The text was updated successfully, but these errors were encountered: