Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove ack_delay_exponent TP #2670

Closed
nibanks opened this issue May 7, 2019 · 58 comments
Closed

Remove ack_delay_exponent TP #2670

nibanks opened this issue May 7, 2019 · 58 comments
Assignees
Labels
-transport design An issue that affects the design of the protocol; resolution requires consensus. has-consensus An issue that the Chairs have determined has consensus, by canvassing the mailing list.

Comments

@nibanks
Copy link
Member

nibanks commented May 7, 2019

I asked the question on Slack and got quite a bit of feedback so I figured I'd open an issue.

Right now, it seems all but one implementation (quicly) is using the default ack_delay_exponent value of 8. Quicly is using an even higher value of 10 because they don't care about more than millisecond timer resolution.

This leads me to believe that having this be a configurable value (in the QUIC TP) is an over design, and we should simplify this. Pick a value (8 or 10, I personally don't care which) and just leave it at that.

The only argument I've heard for keeping it configurable is that intra-DC environments might want to have a smaller exponent to have more fine grain ACK delays. I'm not sure that's a great reason, personally. At that small of time granularity, most OS timers aren't going to be very accurate. So, even if you can calculate an extremely accurrate RTT, it's not going to help your loss recovery timers any.

@ianswett
Copy link
Contributor

ianswett commented May 7, 2019

I don't think this adds a lot of complexity, so I'm happy with the status quo.

@marten-seemann
Copy link
Contributor

marten-seemann commented May 8, 2019

Right now, it seems all but one implementation (quicly) is using the default ack_delay_exponent value of 8. Quicly is using an even higher value of 10 because they don't care about more than millisecond timer resolution.

The default value is not 8, it's 3. So what quicly does makes a pretty big difference here: They're using a multiplier of 1024 instead of 8.

As I've already remarked on Slack, if we want to allow implementations to scale this value, it seems to make more sense to specify a multiplier instead of an exponent.

@nibanks
Copy link
Member Author

nibanks commented May 8, 2019

@marten-seemann You're right. I was mixing up 3 and 2^3 (8). So, right now the default suggests an 8 microsecond ack_delay accuracy and quicly is using a ~1 millisecond ack_delay accuracy. The more I think about it, the higher value might well be the better one.

The goal of this issue was to see if we could actually all just agree on a single scaling (or multiplier) value. To me, configurability is only useful if we have good reason for different values.

As I understand it, changing this value just affects the accuracy of the ack_delay field in the ACK frame. This in turns, effects the accuracy of the network RTT calculation for the largest acknowledged packet in the ACK frame. My question is what truely is the difference between being able to measure the RTT with sub-millisecond accuracy (even in intra-DC) if the timers aren't that accurate? The primary point of measuring RTT is to have it factor into loss recovery timers.

Additionally, since the endpoint setting the timer is the one with the actual restriction on timer accuracy, it seems to be, if this value was to be configurable, it should be the receiver of the ACK frame to dictate how accurate the ack_delay needs to be; not the sender.

It just feels like this is really over complicating things. If folks really feel like this field needs to be configurable, I'd like to understand why, so that I might make the same optimizations/choices in winquic.

I personally think the argument of "it's not that much more complex to keep it" isn't a great one. IMO, any additional complexity for no explicit reason is bad complexity. These things add up pretty fast.

@ianswett
Copy link
Contributor

ianswett commented May 8, 2019

Clock accuracies vary a lot by environment(ie: ours are better than 1ms, but not as good as being in the Linux kernel) as do RTTs, so I would expect a datacenter to set this to 0 or 1 and user facing traffic to pick the default of 3 or a higher value.

@nibanks
Copy link
Member Author

nibanks commented May 8, 2019

Sure, I understand that. Being able to query an accurate time is one thing; scheduling a timer that accurate is another. On Windows you can query microsecond time, but only configure to a 1 ms timer accuracy at best (depending on configuration and scheduler). But is that reason to have 1 microsecond ack_delay accuracy?

Also, like I pointed out above, the ack_delay accuracy is dependent on your peer. They don't necessarily know your timer or clock accuracy.

@ianswett
Copy link
Contributor

ianswett commented May 8, 2019

The idea is if you schedule a timer for 1ms, but it goes off at 1357us, you can indicate the actual delay instead of the intended delay.

So the peer shouldn't need to know your timer accuracy, but ideally your clock accuracy is fairly good.

@nibanks
Copy link
Member Author

nibanks commented May 8, 2019

Sorry if I was unclear. I wasn't referring to computing how long you actually delayed. I was referring to calculating the time you set for your PTO timer. That's effectively where the adjusted rtt measurement comes into play, right? Does it really matter if the rtt is 53.112 ms or just 53.1 ms or 53 ms, when you set the PTO timer, if your timer doesn't have that accuracy?

@kazuho
Copy link
Member

kazuho commented May 9, 2019

I'm fine with status quo, but if we are to make a change, I like @marten-seemann's proposal of changing ack_delay_exponent to a multiplier.

As @nibanks has pointed out, quicly uses ack_delay_exponent of 10 because it uses a millisecond timer. However, exponent of 10 is slightly incorrect, because 210 is 1,024.

Generally speaking, I do not think that people would have a timer granularity of 2N microseconds, where N is an integer. I expect people granularity of 10N microsecond timers instead.

To put it another way, the base of the exponent not being 10 leads to either inaccuracy of ACK.ack_delay, or the size of the ack_delay field becoming larger than what it could be.

That said, @marten-seemann's suggestion of using a multiplier allows accurate representation of ack-delay for any timer that have X microsecond granularity in the most compact form, at the cost of probably spending 1 more byte in TP. I think it has the best flexibility without sacrificing anything in practice.

@janaiyengar
Copy link
Contributor

@kazuho makes a good point about the inaccuracy of 2^N when it comes to timer granularity. I don't think a multiplier is any more expensive, and it's certainly more accurate.

@nibanks
Copy link
Member Author

nibanks commented May 9, 2019

I'd be happy with just saying ack_delay is units of milliseconds and do away with ack_delay_exponent. I really don't see a reason to have more accurate time for this field.

@kazuho
Copy link
Member

kazuho commented May 9, 2019

I'd be happy with just saying ack_delay is units of milliseconds and do away with ack_delay_exponent.

While millisecond granularity would be sufficient (and also efficient in terms of the size of the ACK frame) for connections between the web browsers and the servers, more accuracy is desirable for intra-datacenter connections (as @ianswett has pointed out).

Therefore, my preference would be to retain a TP that can be used to scale the granularity.

@janaiyengar
Copy link
Contributor

This really isn't much complexity, so I would like to keep the TP around, albeit preferably as a multiplier.

@martinthomson martinthomson added the proposal-ready An issue which has a proposal that is believed to be ready for a consensus call. label May 9, 2019
@martinthomson
Copy link
Member

Looks like multiplication is beating exponentiation. (Also consistency over inconsistency.)

@mikkelfj
Copy link
Contributor

mikkelfj commented May 9, 2019 via email

@ianswett
Copy link
Contributor

ianswett commented May 9, 2019

I don't see a reason to change this to a multiplier, except for personal taste. Shifts are objectively faster and I haven't seen anyone indicate they need or really want a multiplier that can't be expressed today.

@nibanks
Copy link
Member Author

nibanks commented May 9, 2019

First, if the parameter ends up staying, I agree with @ianswett and I would just leave things as they are, with no changes (though maybe we update the default to 10?).

But I still haven't seen a specific answer to two things:

  1. The endpoint setting the timers for loss detection is the one who knows the accuracy of the timer; not the one sending the ACK frame. So, shouldn't the endpoint that sets the timer tell it's peer what ack_delay_exponent to use, not the other way around?

  2. Even in intra-DC environments, where RTT is nearly 0 ms, that doesn't change how accurate timers can be set. In the default case for Windows, for instance, 15.6 ms is still average accuracy a timer can get. At best, that can be reduced to 1 ms. But at that small of an interval, any kind of scheduling interruption will add extra delay too. So practically, I don't ever see sub-millisecond timers being viable. Therefore, why do we ever need sub-millisecond ack_delay values communicated over the wire? Any fractional delay would just be included in the rtt variance (and IMO, rightly so).

@mikkelfj
Copy link
Contributor

mikkelfj commented May 9, 2019

On timers: you don't know what hardware and software will be deployed in the future or in special projects. Wether that precision is really important is a different question.

@nibanks
Copy link
Member Author

nibanks commented May 9, 2019

@mikkelfj I can agree to that. But as far as "what might happen in the future", in the QUIC WG, it seems that the general guideline has been: if you don't have a scenario for it now (i.e. not in the charter) then defer it as either a possible extension or a future version.

@mikkelfj
Copy link
Contributor

mikkelfj commented May 9, 2019

So, I don't know what you can typically expect, but I have a timer implementation that uses nanosleep because why not? Recent Linux kernels can wait on microsecond resolution on x86 arch even if its general Hz/jiffie setting is higher.
http://man7.org/linux/man-pages/man7/time.7.html
http://man7.org/linux/man-pages/man2/nanosleep.2.html
That doesn't change my view that exponents are more manageable though.

@ianswett
Copy link
Contributor

ianswett commented May 9, 2019

  1. I think it makes slightly more sense for the receiver to specify it, since the receiver is the one setting a delayed ack alarm, but I can imagine it working ok either way. It pairs with max_ack_delay in some ways, and that's receiver specified, so I think keeping it on the receiver side is preferable.

  2. If you have a 100us RTT and >1ms timer granularity, then I think this becomes more important, not less. In that case, you want to pace over the true RTT, not the SRTT without adjusting for max_ack_delay, because that may be 10x or more too slow. And to second @mikkelfj point, sub-millisecond timers are now fairly common.

@kazuho
Copy link
Member

kazuho commented May 9, 2019

@ianswett

I don't see a reason to change this to a multiplier, except for personal taste. Shifts are objectively faster and I haven't seen anyone indicate they need or really want a multiplier that can't be expressed today.

Thinking a bit more about the problem, to me it seems that using examples might help in moving the discussion forward.

Consider the case where an endpoint uses a millisecond timer, sets max_ack_delay to 25ms, and the RTT is 5 milliseconds. The endpoint sends an ACK when the ack-delay timer fires.

If the endpoint uses ack_delay_exponent of 10 (i.e., 1024 microseconds as an approximation of 1 millisecond), the sender would set ACK.ack_delay to 25, and the receiver would interpret the value as 25.6 milliseconds (i.e. 1024*25=25,600). Then, the calculated RTT becomes 4.4ms, which is 12% smaller than the actual value.

Is this error acceptable? I think it is, because a millisecond timer would have +- 1ms inaccuracy. Although it is a bit concerning that the RTT would always be underestimated in the above case.

But the question is: what is the rational for using shifts instead of multipliers, when there are these types of errors? While it is true that bit shift is faster than multiplication, the different does not matter for an encrypted transport.

@mikkelfj
Copy link
Contributor

mikkelfj commented May 9, 2019

Are you not forgetting that numbers are stored in base 2, so if you divide by, say 100 or 25000, you do not get an exact representation at the lower bit? Receiver side multiply by said factor will now introduce an inaccuracy. On the contrary, if you keep your time management in base 2 sender side (if you can be bothered), there is no additional precision loss beyond the chosen resolution.

@kazuho
Copy link
Member

kazuho commented May 9, 2019

@mikkelfj Could you please elaborate where division happens? ack_delay_exponent is chosen by the sender of the ACK frame. Therefore, no division is happening in the example above.

@ianswett
Copy link
Contributor

ianswett commented May 9, 2019

One clarification on your example @kazuho. The rtt sample that would result would not go lower than min_rtt, so it would only be 4.4ms if the min_rtt was <=4.4ms.

@mikkelfj
Copy link
Contributor

mikkelfj commented May 9, 2019

It happens because timers don't work as you describe.
When the timer fires it is either a spurious wake-up or a proper wake-up, but likely rounded up to some internal quant that is a not a clean multiple of anything. Additionally the wake-up is is possibly coalesced so different things might happen before your event gets activated.
Hence, the sane thing to do is:
t1 = now.
delay = 25ms
sleep(25ms)
t2 = now
ack.delay = (t2 - t1) / unit_multiplier.
If you do not do that, any rounding error is dwarfed by assumptions on time precision.

@kazuho
Copy link
Member

kazuho commented May 9, 2019

@ianswett

The rtt sample that would result would not go lower than min_rtt, so it would only be 4.4ms if the min_rtt was <=4.4ms.

Good point. Thank you for pointing that out. Though I'd assume that having +-12% jitter between RTT samples would be common.

@mikkelfj Thank you for the clarification. I now see the confusion. I think I should have said "1ms granularity clock" rather than "timer". Does that make things clear?

@mikkelfj
Copy link
Contributor

mikkelfj commented May 9, 2019

not really, because except for rare RTOS systems used in ABS breaking systems and such, there is no such thing as a 1ms granularity. It may be that your best resolution is 1ms, but you might wake up 2500ms later after disk swap is done, or at half the expected sleep time because of some Unix signal and you don't care to rewait after that, or just general noise of kernel processes, interrupts, TLB misses, etc. Of course, the smaller the RTT the more critical this gets. For 25ms the chance of staying within a few percentage is reasonable.

I still think the float16 approach is better. It takes all the guessing out by introducing an exponent per timestamp, and it can be decoded in a few nanoseconds.

@mikkelfj
Copy link
Contributor

These rounding issues make no sense to me, as I have argued in the above. On the contrary division becomes cheaper with exponents when you have to compute the actual timer interval. But even with a multiplier, the sender controls the factor and can therefore divide by a constant, so a multiplier is not fatal compared to an exponent. I just don't see a great benefit.

I'm inclined to think that we could do without a multiplier or exponent altogether if we have microsecond units always.

@huitema
Copy link
Contributor

huitema commented Jun 12, 2019

The whole thing looks like a storm in a cup of tea. I don't buy the implementation issues. On a modern computer, computing "x = (internal-value * unit)" has exactly the same cost as "x = (internal-value * unit)>>exponent". There is no branching involved, the instructions are pipe-lined, the next pipeline stall will come during the encoding to varint in all cases.

My preference if we do change things is to have a straight microsecond count, no scaling involved. But I get Ian's point that this makes the varint encoding slightly longer, which causes some overhead. That would be an argument for just keeping the spec as is.

@kazuho
Copy link
Member

kazuho commented Jun 13, 2019

Unsurprisingly, I prefer b (multiplication), and the reason is as follows.

As @huitema points out, the computational cost of using a multiplier is no worse than using an exponent, and I second @marten-seemann's comment that use of exponent is strictly worse than using a multiplier. I do not think there is a justifiable reason to choose a (exponent).

The network overhead of using no scaling factor would be 2 to 3 bytes per packet containing an ACK, for the primary use-case of QUIC v1 traffic. This is because we can assume that connections between web browsers and servers would be in most cases greater than 16.383 milliseconds.

Considering the effort we have spent in reducing the number of bytes being transmitted on the wire, I think it might make more sense to have a multiplier than removing the scaling factor.

@mikkelfj
Copy link
Contributor

Maybe it is worth saving those bytes. But another argument not prominent in the discussion is the risk of getting the scale wrong since it applies to all paths and might be hardcoded in implementations that do not adapt to different networking environments. This is why I think the fp16 approach is better (with overhead slightly above a multiplication), but I can see why people dislike a non-standard algorithm.

Therefore the question is really if the bytes saved are worth the problems a bad scale might induce.

@martinthomson
Copy link
Member

I'm very much inclined to propose a vote (i.e., to invoke RFC 3929) on this one in Montreal. I get the strong impression that no one really cares about this that much.

Would anyone object if I put up a slide with the three options up and we could all hum or wave hands at the ones we liked the least?

@marten-seemann
Copy link
Contributor

@martinthomson I think that's a good idea, as long as we can summarize the pros and cons of each proposal before the vote.

@martinthomson
Copy link
Member

As I understand the state of play, the main costs are bytes on the wire (potentially across a wide range of operating points: short RTT, fast processor to long RTT, slow processor), code complexity, and efficiency of calculating the values for encode (e.g., division vs. shift) or decode (e.g., multiply vs. shift).

Did I miss anything?

@ianswett
Copy link
Contributor

ianswett commented Jul 9, 2019

I'm fine with a vote in Montreal, though I'd really prefer not to remove this entirely. If there is not a clear favorite, then I'd suggest doing a hum for which option people dislike most and not doing that one?

That list covers everything that comes to mind.

@kazuho
Copy link
Member

kazuho commented Jul 9, 2019

Doing a hum sounds good to me. Mp prefeence goes to arguing for dislikes as @ianswett says.

@martinthomson

As I understand the state of play, the main costs are bytes on the wire (potentially across a wide range of operating points: short RTT, fast processor to long RTT, slow processor), code complexity, and efficiency of calculating the values for encode (e.g., division vs. shift) or decode (e.g., multiply vs. shift).

FWIW, this is the table I have illustrating the differences of the encoder side. I do not think division is necessary even when adopt the multiplier proposal, because the encoder has the freedom to choose the multiplier. By choosing a multiplier that is 2-power of the timer granularity, a encoder can use a bit shift operation.

design \ timer granularity 1us 2n us 10n us
no scaling noop bit-shift multiply
bit-shift bit-shift bit-shift becomes inaccurate
multiplier bit-shift bit-shift bit-shift

For the decoder side, it's noop vs. shift vs. multiply.

@martinthomson
Copy link
Member

Yeah, we have times expressed in nanoseconds, so we have to divide always.

I heard the "dislike most" suggestion; I was planning on that anyway. I've heard a lot stronger negative sentiment related to the "remove the TP" option.

@martinthomson
Copy link
Member

For those following along, here is a draft of some slides for this process. Comments are open if you want to improve them in any way.

@mikkelfj
Copy link
Contributor

So I'm am one the few holding out for shift. I disagree with many of the arguments made on shift being imprecise, but I come to some of the same conclusion @kazuho has recently posted:

The encoded can choose its multiplier. It can choose one that it is a power of 2.

The native representation of time my require scaling anyway, so you might end up with a non-power of two. However, this is typically a constant, and division by a constant is multiplication if you do it compile time. It is a long multiplication which is slower, but not that much.

Therefore I am reverting on my stance and think that a multiplier is acceptable, as long as it can be chosen statically for the application, or it can choose a power of 2 dynamically.

More generally I still think it is a problem to guess the time scale and would like to have an adaptive scale without transport parameters, using 16 bit unsigned floats, but there is zero traction on that here.

@mikkelfj
Copy link
Contributor

It should be noted though, that 64-bit multiplication on a 32-bit, or smaller MCU, is comparatively slow. These will also suffer from GCM, but there is other crypto.

@martinthomson
Copy link
Member

I'm fairly confident that these operations will be done on relatively small values, and so optimization to that degree will be possible. I plan to use Rust u128 values and to ignore the cost, but I realize that not every implementation can afford to spend upwards of 4 CPU cycles on an operation that you perform once every other packet.

@janaiyengar
Copy link
Contributor

janaiyengar commented Jul 22, 2019 via email

@larseggert
Copy link
Member

Assigned to @martinthomson since he presented in Montreal.

@larseggert
Copy link
Member

Discussed in Montreal. Option B eventually got consensus in the room.

@martinthomson
Copy link
Member

The agreed option requires no action, so this is ready for a consensus call.

@martinthomson martinthomson added the proposal-ready An issue which has a proposal that is believed to be ready for a consensus call. label Jul 25, 2019
@mikkelfj
Copy link
Contributor

The agreed option requires no action, so this is ready for a consensus call.

How so - isn't option B about a multiplier TP - that would require a PR no?

@huitema
Copy link
Contributor

huitema commented Jul 25, 2019

@mikkelfj The "option B" discussed during the WG meeting was to effectively keep the text as is.

@mikkelfj
Copy link
Contributor

OK, thanks.

@mnot mnot added has-consensus An issue that the Chairs have determined has consensus, by canvassing the mailing list. and removed proposal-ready An issue which has a proposal that is believed to be ready for a consensus call. labels Aug 16, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
-transport design An issue that affects the design of the protocol; resolution requires consensus. has-consensus An issue that the Chairs have determined has consensus, by canvassing the mailing list.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants