Remove ack_delay_exponent TP #2670

nibanks · 2019-05-07T14:20:30Z

I asked the question on Slack and got quite a bit of feedback so I figured I'd open an issue.

Right now, it seems all but one implementation (quicly) is using the default ack_delay_exponent value of 8. Quicly is using an even higher value of 10 because they don't care about more than millisecond timer resolution.

This leads me to believe that having this be a configurable value (in the QUIC TP) is an over design, and we should simplify this. Pick a value (8 or 10, I personally don't care which) and just leave it at that.

The only argument I've heard for keeping it configurable is that intra-DC environments might want to have a smaller exponent to have more fine grain ACK delays. I'm not sure that's a great reason, personally. At that small of time granularity, most OS timers aren't going to be very accurate. So, even if you can calculate an extremely accurrate RTT, it's not going to help your loss recovery timers any.

ianswett · 2019-05-07T22:18:30Z

I don't think this adds a lot of complexity, so I'm happy with the status quo.

marten-seemann · 2019-05-08T03:32:46Z

Right now, it seems all but one implementation (quicly) is using the default ack_delay_exponent value of 8. Quicly is using an even higher value of 10 because they don't care about more than millisecond timer resolution.

The default value is not 8, it's 3. So what quicly does makes a pretty big difference here: They're using a multiplier of 1024 instead of 8.

As I've already remarked on Slack, if we want to allow implementations to scale this value, it seems to make more sense to specify a multiplier instead of an exponent.

nibanks · 2019-05-08T14:25:41Z

@marten-seemann You're right. I was mixing up 3 and 2^3 (8). So, right now the default suggests an 8 microsecond ack_delay accuracy and quicly is using a ~1 millisecond ack_delay accuracy. The more I think about it, the higher value might well be the better one.

The goal of this issue was to see if we could actually all just agree on a single scaling (or multiplier) value. To me, configurability is only useful if we have good reason for different values.

As I understand it, changing this value just affects the accuracy of the ack_delay field in the ACK frame. This in turns, effects the accuracy of the network RTT calculation for the largest acknowledged packet in the ACK frame. My question is what truely is the difference between being able to measure the RTT with sub-millisecond accuracy (even in intra-DC) if the timers aren't that accurate? The primary point of measuring RTT is to have it factor into loss recovery timers.

Additionally, since the endpoint setting the timer is the one with the actual restriction on timer accuracy, it seems to be, if this value was to be configurable, it should be the receiver of the ACK frame to dictate how accurate the ack_delay needs to be; not the sender.

It just feels like this is really over complicating things. If folks really feel like this field needs to be configurable, I'd like to understand why, so that I might make the same optimizations/choices in winquic.

I personally think the argument of "it's not that much more complex to keep it" isn't a great one. IMO, any additional complexity for no explicit reason is bad complexity. These things add up pretty fast.

ianswett · 2019-05-08T14:33:41Z

Clock accuracies vary a lot by environment(ie: ours are better than 1ms, but not as good as being in the Linux kernel) as do RTTs, so I would expect a datacenter to set this to 0 or 1 and user facing traffic to pick the default of 3 or a higher value.

nibanks · 2019-05-08T14:40:15Z

Sure, I understand that. Being able to query an accurate time is one thing; scheduling a timer that accurate is another. On Windows you can query microsecond time, but only configure to a 1 ms timer accuracy at best (depending on configuration and scheduler). But is that reason to have 1 microsecond ack_delay accuracy?

Also, like I pointed out above, the ack_delay accuracy is dependent on your peer. They don't necessarily know your timer or clock accuracy.

ianswett · 2019-05-08T15:18:10Z

The idea is if you schedule a timer for 1ms, but it goes off at 1357us, you can indicate the actual delay instead of the intended delay.

So the peer shouldn't need to know your timer accuracy, but ideally your clock accuracy is fairly good.

nibanks · 2019-05-08T15:24:15Z

Sorry if I was unclear. I wasn't referring to computing how long you actually delayed. I was referring to calculating the time you set for your PTO timer. That's effectively where the adjusted rtt measurement comes into play, right? Does it really matter if the rtt is 53.112 ms or just 53.1 ms or 53 ms, when you set the PTO timer, if your timer doesn't have that accuracy?

kazuho · 2019-05-09T00:04:23Z

I'm fine with status quo, but if we are to make a change, I like @marten-seemann's proposal of changing ack_delay_exponent to a multiplier.

As @nibanks has pointed out, quicly uses ack_delay_exponent of 10 because it uses a millisecond timer. However, exponent of 10 is slightly incorrect, because 2¹⁰ is 1,024.

Generally speaking, I do not think that people would have a timer granularity of 2^N microseconds, where N is an integer. I expect people granularity of 10^N microsecond timers instead.

To put it another way, the base of the exponent not being 10 leads to either inaccuracy of ACK.ack_delay, or the size of the ack_delay field becoming larger than what it could be.

That said, @marten-seemann's suggestion of using a multiplier allows accurate representation of ack-delay for any timer that have X microsecond granularity in the most compact form, at the cost of probably spending 1 more byte in TP. I think it has the best flexibility without sacrificing anything in practice.

janaiyengar · 2019-05-09T01:50:16Z

@kazuho makes a good point about the inaccuracy of 2^N when it comes to timer granularity. I don't think a multiplier is any more expensive, and it's certainly more accurate.

nibanks · 2019-05-09T02:22:33Z

I'd be happy with just saying ack_delay is units of milliseconds and do away with ack_delay_exponent. I really don't see a reason to have more accurate time for this field.

kazuho · 2019-05-09T03:05:22Z

I'd be happy with just saying ack_delay is units of milliseconds and do away with ack_delay_exponent.

While millisecond granularity would be sufficient (and also efficient in terms of the size of the ACK frame) for connections between the web browsers and the servers, more accuracy is desirable for intra-datacenter connections (as @ianswett has pointed out).

Therefore, my preference would be to retain a TP that can be used to scale the granularity.

janaiyengar · 2019-05-09T03:25:19Z

This really isn't much complexity, so I would like to keep the TP around, albeit preferably as a multiplier.

martinthomson · 2019-05-09T04:59:02Z

Looks like multiplication is beating exponentiation. (Also consistency over inconsistency.)

mikkelfj · 2019-05-09T05:45:21Z

In praxis you can’t trust any timer to be at any resolution, so measuring the diff in nanoseconds and downshifting by exponent would be simpler, faster, and more precise. Otherwise you have to deal with division or choose a power of 2 anyway.

ianswett · 2019-05-09T12:11:45Z

I don't see a reason to change this to a multiplier, except for personal taste. Shifts are objectively faster and I haven't seen anyone indicate they need or really want a multiplier that can't be expressed today.

nibanks · 2019-05-09T14:22:20Z

First, if the parameter ends up staying, I agree with @ianswett and I would just leave things as they are, with no changes (though maybe we update the default to 10?).

But I still haven't seen a specific answer to two things:

The endpoint setting the timers for loss detection is the one who knows the accuracy of the timer; not the one sending the ACK frame. So, shouldn't the endpoint that sets the timer tell it's peer what ack_delay_exponent to use, not the other way around?
Even in intra-DC environments, where RTT is nearly 0 ms, that doesn't change how accurate timers can be set. In the default case for Windows, for instance, 15.6 ms is still average accuracy a timer can get. At best, that can be reduced to 1 ms. But at that small of an interval, any kind of scheduling interruption will add extra delay too. So practically, I don't ever see sub-millisecond timers being viable. Therefore, why do we ever need sub-millisecond ack_delay values communicated over the wire? Any fractional delay would just be included in the rtt variance (and IMO, rightly so).

mikkelfj · 2019-05-09T14:26:00Z

On timers: you don't know what hardware and software will be deployed in the future or in special projects. Wether that precision is really important is a different question.

nibanks · 2019-05-09T14:30:50Z

@mikkelfj I can agree to that. But as far as "what might happen in the future", in the QUIC WG, it seems that the general guideline has been: if you don't have a scenario for it now (i.e. not in the charter) then defer it as either a possible extension or a future version.

mikkelfj · 2019-05-09T14:41:22Z

So, I don't know what you can typically expect, but I have a timer implementation that uses nanosleep because why not? Recent Linux kernels can wait on microsecond resolution on x86 arch even if its general Hz/jiffie setting is higher.
http://man7.org/linux/man-pages/man7/time.7.html
http://man7.org/linux/man-pages/man2/nanosleep.2.html
That doesn't change my view that exponents are more manageable though.

ianswett · 2019-05-09T15:19:02Z

I think it makes slightly more sense for the receiver to specify it, since the receiver is the one setting a delayed ack alarm, but I can imagine it working ok either way. It pairs with max_ack_delay in some ways, and that's receiver specified, so I think keeping it on the receiver side is preferable.
If you have a 100us RTT and >1ms timer granularity, then I think this becomes more important, not less. In that case, you want to pace over the true RTT, not the SRTT without adjusting for max_ack_delay, because that may be 10x or more too slow. And to second @mikkelfj point, sub-millisecond timers are now fairly common.

kazuho · 2019-05-09T16:44:56Z

@ianswett

I don't see a reason to change this to a multiplier, except for personal taste. Shifts are objectively faster and I haven't seen anyone indicate they need or really want a multiplier that can't be expressed today.

Thinking a bit more about the problem, to me it seems that using examples might help in moving the discussion forward.

Consider the case where an endpoint uses a millisecond timer, sets max_ack_delay to 25ms, and the RTT is 5 milliseconds. The endpoint sends an ACK when the ack-delay timer fires.

If the endpoint uses ack_delay_exponent of 10 (i.e., 1024 microseconds as an approximation of 1 millisecond), the sender would set ACK.ack_delay to 25, and the receiver would interpret the value as 25.6 milliseconds (i.e. 1024*25=25,600). Then, the calculated RTT becomes 4.4ms, which is 12% smaller than the actual value.

Is this error acceptable? I think it is, because a millisecond timer would have +- 1ms inaccuracy. Although it is a bit concerning that the RTT would always be underestimated in the above case.

But the question is: what is the rational for using shifts instead of multipliers, when there are these types of errors? While it is true that bit shift is faster than multiplication, the different does not matter for an encrypted transport.

mikkelfj · 2019-05-09T16:49:48Z

Are you not forgetting that numbers are stored in base 2, so if you divide by, say 100 or 25000, you do not get an exact representation at the lower bit? Receiver side multiply by said factor will now introduce an inaccuracy. On the contrary, if you keep your time management in base 2 sender side (if you can be bothered), there is no additional precision loss beyond the chosen resolution.

kazuho · 2019-05-09T16:54:03Z

@mikkelfj Could you please elaborate where division happens? ack_delay_exponent is chosen by the sender of the ACK frame. Therefore, no division is happening in the example above.

ianswett · 2019-05-09T17:00:27Z

One clarification on your example @kazuho. The rtt sample that would result would not go lower than min_rtt, so it would only be 4.4ms if the min_rtt was <=4.4ms.

mikkelfj · 2019-05-09T17:01:07Z

It happens because timers don't work as you describe.
When the timer fires it is either a spurious wake-up or a proper wake-up, but likely rounded up to some internal quant that is a not a clean multiple of anything. Additionally the wake-up is is possibly coalesced so different things might happen before your event gets activated.
Hence, the sane thing to do is:
t1 = now.
delay = 25ms
sleep(25ms)
t2 = now
ack.delay = (t2 - t1) / unit_multiplier.
If you do not do that, any rounding error is dwarfed by assumptions on time precision.

kazuho · 2019-05-09T17:11:03Z

@ianswett

The rtt sample that would result would not go lower than min_rtt, so it would only be 4.4ms if the min_rtt was <=4.4ms.

Good point. Thank you for pointing that out. Though I'd assume that having +-12% jitter between RTT samples would be common.

@mikkelfj Thank you for the clarification. I now see the confusion. I think I should have said "1ms granularity clock" rather than "timer". Does that make things clear?

mikkelfj · 2019-05-09T17:18:21Z

not really, because except for rare RTOS systems used in ABS breaking systems and such, there is no such thing as a 1ms granularity. It may be that your best resolution is 1ms, but you might wake up 2500ms later after disk swap is done, or at half the expected sleep time because of some Unix signal and you don't care to rewait after that, or just general noise of kernel processes, interrupts, TLB misses, etc. Of course, the smaller the RTT the more critical this gets. For 25ms the chance of staying within a few percentage is reasonable.

I still think the float16 approach is better. It takes all the guessing out by introducing an exponent per timestamp, and it can be decoded in a few nanoseconds.

mikkelfj · 2019-06-12T14:43:12Z

These rounding issues make no sense to me, as I have argued in the above. On the contrary division becomes cheaper with exponents when you have to compute the actual timer interval. But even with a multiplier, the sender controls the factor and can therefore divide by a constant, so a multiplier is not fatal compared to an exponent. I just don't see a great benefit.

I'm inclined to think that we could do without a multiplier or exponent altogether if we have microsecond units always.

huitema · 2019-06-12T15:19:27Z

The whole thing looks like a storm in a cup of tea. I don't buy the implementation issues. On a modern computer, computing "x = (internal-value * unit)" has exactly the same cost as "x = (internal-value * unit)>>exponent". There is no branching involved, the instructions are pipe-lined, the next pipeline stall will come during the encoding to varint in all cases.

My preference if we do change things is to have a straight microsecond count, no scaling involved. But I get Ian's point that this makes the varint encoding slightly longer, which causes some overhead. That would be an argument for just keeping the spec as is.

kazuho · 2019-06-13T02:15:08Z

Unsurprisingly, I prefer b (multiplication), and the reason is as follows.

As @huitema points out, the computational cost of using a multiplier is no worse than using an exponent, and I second @marten-seemann's comment that use of exponent is strictly worse than using a multiplier. I do not think there is a justifiable reason to choose a (exponent).

The network overhead of using no scaling factor would be 2 to 3 bytes per packet containing an ACK, for the primary use-case of QUIC v1 traffic. This is because we can assume that connections between web browsers and servers would be in most cases greater than 16.383 milliseconds.

Considering the effort we have spent in reducing the number of bytes being transmitted on the wire, I think it might make more sense to have a multiplier than removing the scaling factor.

mikkelfj · 2019-06-13T06:12:24Z

Maybe it is worth saving those bytes. But another argument not prominent in the discussion is the risk of getting the scale wrong since it applies to all paths and might be hardcoded in implementations that do not adapt to different networking environments. This is why I think the fp16 approach is better (with overhead slightly above a multiplication), but I can see why people dislike a non-standard algorithm.

Therefore the question is really if the bytes saved are worth the problems a bad scale might induce.

martinthomson · 2019-07-09T07:10:32Z

I'm very much inclined to propose a vote (i.e., to invoke RFC 3929) on this one in Montreal. I get the strong impression that no one really cares about this that much.

Would anyone object if I put up a slide with the three options up and we could all hum or wave hands at the ones we liked the least?

marten-seemann · 2019-07-09T07:59:30Z

@martinthomson I think that's a good idea, as long as we can summarize the pros and cons of each proposal before the vote.

martinthomson · 2019-07-09T08:15:36Z

As I understand the state of play, the main costs are bytes on the wire (potentially across a wide range of operating points: short RTT, fast processor to long RTT, slow processor), code complexity, and efficiency of calculating the values for encode (e.g., division vs. shift) or decode (e.g., multiply vs. shift).

Did I miss anything?

ianswett · 2019-07-09T17:13:30Z

I'm fine with a vote in Montreal, though I'd really prefer not to remove this entirely. If there is not a clear favorite, then I'd suggest doing a hum for which option people dislike most and not doing that one?

That list covers everything that comes to mind.

kazuho · 2019-07-09T23:53:20Z

Doing a hum sounds good to me. Mp prefeence goes to arguing for dislikes as @ianswett says.

@martinthomson

As I understand the state of play, the main costs are bytes on the wire (potentially across a wide range of operating points: short RTT, fast processor to long RTT, slow processor), code complexity, and efficiency of calculating the values for encode (e.g., division vs. shift) or decode (e.g., multiply vs. shift).

FWIW, this is the table I have illustrating the differences of the encoder side. I do not think division is necessary even when adopt the multiplier proposal, because the encoder has the freedom to choose the multiplier. By choosing a multiplier that is 2-power of the timer granularity, a encoder can use a bit shift operation.

design \ timer granularity	1us	2ⁿ us	10ⁿ us
no scaling	noop	bit-shift	multiply
bit-shift	bit-shift	bit-shift	becomes inaccurate
multiplier	bit-shift	bit-shift	bit-shift

For the decoder side, it's noop vs. shift vs. multiply.

martinthomson · 2019-07-10T07:18:40Z

Yeah, we have times expressed in nanoseconds, so we have to divide always.

I heard the "dislike most" suggestion; I was planning on that anyway. I've heard a lot stronger negative sentiment related to the "remove the TP" option.

martinthomson · 2019-07-10T07:38:43Z

For those following along, here is a draft of some slides for this process. Comments are open if you want to improve them in any way.

mikkelfj · 2019-07-10T09:34:19Z

So I'm am one the few holding out for shift. I disagree with many of the arguments made on shift being imprecise, but I come to some of the same conclusion @kazuho has recently posted:

The encoded can choose its multiplier. It can choose one that it is a power of 2.

The native representation of time my require scaling anyway, so you might end up with a non-power of two. However, this is typically a constant, and division by a constant is multiplication if you do it compile time. It is a long multiplication which is slower, but not that much.

Therefore I am reverting on my stance and think that a multiplier is acceptable, as long as it can be chosen statically for the application, or it can choose a power of 2 dynamically.

More generally I still think it is a problem to guess the time scale and would like to have an adaptive scale without transport parameters, using 16 bit unsigned floats, but there is zero traction on that here.

mikkelfj · 2019-07-10T09:37:02Z

It should be noted though, that 64-bit multiplication on a 32-bit, or smaller MCU, is comparatively slow. These will also suffer from GCM, but there is other crypto.

martinthomson · 2019-07-10T11:40:27Z

I'm fairly confident that these operations will be done on relatively small values, and so optimization to that degree will be possible. I plan to use Rust u128 values and to ignore the cost, but I realize that not every implementation can afford to spend upwards of 4 CPU cycles on an operation that you perform once every other packet.

janaiyengar · 2019-07-22T19:22:14Z

In general, if CPU is a problem, there's a simple fix. Ignore ack delay entirely. Your RTT values will be higher, causing recovery to take longer, but that is the tradeoff. At any rate, this recovery time won't be worse than TCP, since TCP includes receiver delays in its RTT estimates.

…

On Wed, Jul 10, 2019 at 7:40 AM Martin Thomson ***@***.***> wrote: I'm fairly confident that these operations will be done on relatively small values, and so optimization to that degree will be possible. I plan to use Rust u128 values and to ignore the cost, but I realize that not every implementation can afford to spend upwards of 4 CPU cycles on an operation that you perform once every other packet. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#2670?email_source=notifications&email_token=ACUOBVFXENMDISDM33ZM5PDP6XDDDA5CNFSM4HLJKNXKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZTGBGY#issuecomment-510025883>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ACUOBVBJWPWETAPZOLXDUSDP6XDDDANCNFSM4HLJKNXA> .

larseggert · 2019-07-23T14:32:43Z

Assigned to @martinthomson since he presented in Montreal.

larseggert · 2019-07-23T14:54:36Z

Discussed in Montreal. Option B eventually got consensus in the room.

martinthomson · 2019-07-25T17:52:06Z

The agreed option requires no action, so this is ready for a consensus call.

mikkelfj · 2019-07-25T17:55:40Z

The agreed option requires no action, so this is ready for a consensus call.

How so - isn't option B about a multiplier TP - that would require a PR no?

huitema · 2019-07-25T18:04:51Z

@mikkelfj The "option B" discussed during the WG meeting was to effectively keep the text as is.

mikkelfj · 2019-07-25T18:06:14Z

OK, thanks.

martinthomson added the -transport label May 7, 2019

marten-seemann mentioned this issue May 9, 2019

change the ack delay exponent to an ack delay multiplier #2679

Closed

martinthomson added the proposal-ready An issue which has a proposal that is believed to be ready for a consensus call. label May 9, 2019

larseggert assigned martinthomson Jul 23, 2019

martinthomson added the proposal-ready An issue which has a proposal that is believed to be ready for a consensus call. label Jul 25, 2019

mnot added has-consensus An issue that the Chairs have determined has consensus, by canvassing the mailing list. and removed proposal-ready An issue which has a proposal that is believed to be ready for a consensus call. labels Aug 16, 2019

martinthomson closed this as completed Aug 16, 2019

Remove ack_delay_exponent TP #2670

Remove ack_delay_exponent TP #2670

Comments

nibanks commented May 7, 2019

ianswett commented May 7, 2019

marten-seemann commented May 8, 2019 • edited Loading

nibanks commented May 8, 2019

ianswett commented May 8, 2019

nibanks commented May 8, 2019 • edited Loading

ianswett commented May 8, 2019

nibanks commented May 8, 2019

kazuho commented May 9, 2019 • edited Loading

janaiyengar commented May 9, 2019

nibanks commented May 9, 2019

kazuho commented May 9, 2019

janaiyengar commented May 9, 2019

martinthomson commented May 9, 2019

mikkelfj commented May 9, 2019 via email • edited Loading

ianswett commented May 9, 2019

nibanks commented May 9, 2019

mikkelfj commented May 9, 2019

nibanks commented May 9, 2019

mikkelfj commented May 9, 2019

ianswett commented May 9, 2019

kazuho commented May 9, 2019

mikkelfj commented May 9, 2019

kazuho commented May 9, 2019

ianswett commented May 9, 2019 • edited Loading

mikkelfj commented May 9, 2019 • edited Loading

kazuho commented May 9, 2019

mikkelfj commented May 9, 2019 • edited Loading

mikkelfj commented Jun 12, 2019

huitema commented Jun 12, 2019 • edited Loading

kazuho commented Jun 13, 2019 • edited Loading

mikkelfj commented Jun 13, 2019

martinthomson commented Jul 9, 2019

marten-seemann commented Jul 9, 2019

martinthomson commented Jul 9, 2019

ianswett commented Jul 9, 2019

kazuho commented Jul 9, 2019 • edited Loading

martinthomson commented Jul 10, 2019

martinthomson commented Jul 10, 2019

mikkelfj commented Jul 10, 2019

mikkelfj commented Jul 10, 2019

martinthomson commented Jul 10, 2019

janaiyengar commented Jul 22, 2019 via email

larseggert commented Jul 23, 2019

larseggert commented Jul 23, 2019

martinthomson commented Jul 25, 2019

mikkelfj commented Jul 25, 2019

huitema commented Jul 25, 2019

mikkelfj commented Jul 25, 2019

marten-seemann commented May 8, 2019 •

edited

Loading

nibanks commented May 8, 2019 •

edited

Loading

kazuho commented May 9, 2019 •

edited

Loading

mikkelfj commented May 9, 2019 via email •

edited

Loading

ianswett commented May 9, 2019 •

edited

Loading

mikkelfj commented May 9, 2019 •

edited

Loading

mikkelfj commented May 9, 2019 •

edited

Loading

huitema commented Jun 12, 2019 •

edited

Loading

kazuho commented Jun 13, 2019 •

edited

Loading

kazuho commented Jul 9, 2019 •

edited

Loading