Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

consider making the varint encoding unique #2299

Closed
marten-seemann opened this issue Jan 6, 2019 · 31 comments
Closed

consider making the varint encoding unique #2299

marten-seemann opened this issue Jan 6, 2019 · 31 comments
Labels
-transport design An issue that affects the design of the protocol; resolution requires consensus. has-consensus An issue that the Chairs have determined has consensus, by canvassing the mailing list.

Comments

@marten-seemann
Copy link
Contributor

Our varint encoding has the unpleasant property that it's possible to encode numbers in multiple ways. This leads to rather clumsy language when it comes to the frame encoding:

The Frame Type field uses a variable length integer encoding (see Section 16) with one exception. To ensure simple and efficient implementations of frame parsing, a frame type MUST use the shortest possible encoding. Though a two-, four- or eight-byte encoding of the frame types defined in this document is possible, the Frame Type field for these frames is encoded on a single byte.

The easy fix for this is to make the varint encoding unique, such that every number can only be encoded in a single way. This is a very straightforward change from what we have today:
The 1-byte numbers would still encode the values from 0 to 63, but the 2-byte numbers would encode the values between 64 and 16,447 (instead of encoding the range between 0 and 16,383), etc.

While a simplification of the spec is the main advantage of this change, it has the nice property that it slightly increases the encoding efficiency.

@mikkelfj
Copy link
Contributor

mikkelfj commented Jan 6, 2019

I have also argued for this in the past, but it breaks down when conservatively allocate two bytes for a length that is unknown. Then you'd have to memmove if the length only requires one byte.

@marten-seemann
Copy link
Contributor Author

marten-seemann commented Jan 6, 2019

The solution is simple: Don't do that! It's definitely possible to use varints everywhere (I have a (roughly) draft-17 compliant implementation, I'm always using the shortest possible varint encoding, and I don't need to memmove anywhere in order to achieve that).

@ianswett
Copy link
Contributor

ianswett commented Jan 6, 2019

Conceptually this is nice and I think @ekr has mentioned this as well in the past.

But as @mikkelfj said, we'd have to carefully think through whether it causes some packetization headaches in practice. Thinking through our current gQUIC implementation, I don't think it would cause problems. I guess one could always add a byte or two of padding if the number was shorter than the reserved length?

This prevents or makes very awkward (for better or worse) picking fixed serialization sizes for specific cases(ie: ack blocks and gaps are always 2 bytes) and writing serialization code that assumes that.

@dtikhonov
Copy link
Member

There are places where the shortest possible encoding is an important property to have, such as frame types. There are also places where it is not important, such as long header's payload length.

Losing the flexibility afforded by allowing fixed-size varints is one way to fix the clumsy language. The other way is to fix the clumsy language.

I suggest we add a definition for the shortest possible varint encoding to Section 16—we can call it minimal varint encoding or something of the sort—and reference it in other parts of the documents as appropriate.

@marten-seemann
Copy link
Contributor Author

Losing the flexibility afforded by allowing fixed-size varints is one way to fix the clumsy language. The other way is to fix the clumsy language.

It's not only the language that's clumsy. The fact that we have to make this distinction at all is clumsy.
I see little value in having flexibility where it's not needed, especially if it adds weird corner cases elsewhere. I'm aware that reserving a longer value and filling it in later is a lazy way to implement the varint encoding, but I didn't find it particularly hard to write my code in a way that allows me to always use the optimal encoding.

@kazuho
Copy link
Member

kazuho commented Jan 6, 2019

I have a mixed feeling about the proposal. Generally speaking, I prefer using a canonical encoding.

OTOH, there are cases that you do not know the length beforehand, and IIRC we have in the past built consensus based on the fact that people can use 2-byte varint for expressing values between 0 to 63.

FWIW, the fields that would be problemsome would be: length field of long header packets, frame lengths of HEADERS and PUSH_PROMISE frames. The lengths are determined after building the contents, and therefore the length of the lengths cannot be predetermined.

@nibanks
Copy link
Member

nibanks commented Jan 6, 2019

In WinQuic we have a special function just for encoding a varint explicitly two bytes that we use for the payload length in long headers. It would be extremely difficult to update this code to always use the shortest encoding without having to memmove the entire payload after it has been written to the packet buffer.

Looking up at the requirement for shortest possible frame type, I don't understand why that's even necessary to require this. Obviously it's more efficient on the wire, but as far as implementation processing efficiency goes, it would make little difference in the grand scheme of things (a majority of all CPU will be spent in encryption/decryption). So, we can recommend the shortest encoding, but I don't think we should kill the connection if a particular implementation does always use that.

I'd prefer to leave the varint as is and remove any requirements for a particular encoding length.

@mikkelfj
Copy link
Contributor

mikkelfj commented Jan 6, 2019

Looking up at the requirement for shortest possible frame type, I don't understand why that's even necessary to require this.

This is because it allows for efficient switch constructs when decoding the frames. This avoids additional complexity during decode. There is difference between enumerations such as frame types, and length values. Enumerations should be uniquely encoded whereas it could be argued that lengths need not due to the memmov issue.

I'm not sure that vast overhead goes to decryption because this is typically 3 cycles per byte so you could easily double the time overhead if you have many small frames that are processed inefficiently.

@ianswett
Copy link
Contributor

ianswett commented Jan 6, 2019

Good point on long header packets. I think we should consider changing the long header length to 2 byte lengths if we did this.

Given this is introducing churn and I'm not clear it's adding a lot of value, I'd say we shouldn't do this at the moment.

@mikkelfj
Copy link
Contributor

mikkelfj commented Jan 6, 2019

@ianswett I was almost about to propose the same thing. This would make all encodings canonical - I think there is value in that. The reason I hesitated is that perhaps there are cases where 1 byte lengths makes a difference even if lazy implementations use 2 bytes.

I think ACK block count were not mentioned in this discussion. This is the most important case I can think of.

@martinthomson
Copy link
Member

I'm pretty sure that this was a case where @ekr said that a canonical encoding would be silly. That's certainly my view.

I believe that this would be a fairly significant hit both in terms of code complexity and performance. If you think that you can make the case, I think we might start with a PR here to show that a) the code is simple enough, and b) the performance is tolerable. Then there is the point about churn.

The requirement on minimal encodings for frame types exists so that a frame decoder can be more efficient, as @mikkelfj says. I'm reluctant to add that requirement in too many places though (even in the h3 cases we are currently considering).

@martinthomson martinthomson added design An issue that affects the design of the protocol; resolution requires consensus. -transport labels Jan 7, 2019
@janaiyengar
Copy link
Contributor

I think we should leave this as it is, primarily for the reason that this is an unnecessary constraint. The efficiency argument isn't strong for non-frame use cases, and the loss of flexibility can be an issue. @larseggert might have trouble with forcing packet lengths to be minimally represented, for the same reason as @nibanks.

On changing the long header packet size to 2 bytes, we discussed in #1577 and decided against it.

@martinthomson
Copy link
Member

I'm seeing fair support for closing with no action, but I'm happy to leave this until we meet in Tokyo. Does anyone want to add anything? I'll close if there is no new information or argument.

@siyengar
Copy link

siyengar commented Jan 9, 2019

I've mentioned this previously that this might be good to do, but I also have mixed feelings now because of the reasons @nibanks mentioned about pre-reserving the length. It would seem nasty to work around that.

@ianswett
Copy link
Contributor

I'd rather close this with no action at this point. I can't see us getting to consensus to do this in v1.

@larseggert
Copy link
Member

I'm OK with closing, but it is not difficult to know how long a packet number will need to be, and to leave space for it.

   The sender MUST use a packet number size able to represent more than
   twice as large a range than the difference between the largest
   acknowledged packet and packet number being sent.

@mikkelfj
Copy link
Contributor

@larseggert The problem is with formatting ACK frames in-place.

@larseggert
Copy link
Member

Depends on your data structures. My ACKing code doesn't have this issue.

@ianswett
Copy link
Contributor

I think long header packets are the largest issue I'm aware of as @nibanks pointed out above.

ACK frames are a bit annoying, but easier.

@larseggert
Copy link
Member

Well, as I said, I'm not objecting to closing this now

@marten-seemann
Copy link
Contributor Author

Not feeling strongly about this issue, but for how many implementations would this change actually be difficult to implement? So far, the only definitive statement was about WinQuic. Does this apply to other implementations as well, or is this something that's limited to one implementation?

@ianswett
Copy link
Contributor

I believe @DavidSchinazi indicated it'd be more difficult in our coalesced packets implementation, but I'll let him confirm.

@martinthomson
Copy link
Member

At this stage, I think that one objection is sufficient. The advantage of a minimal encoding is in wire efficiency. Something that can be achieved with the current scheme, and not guaranteed with the proposed one (because there are ample ways in which to suboptimally encode packets, PADDING notwithstanding).

Add the unproven performance characteristics, and I'm not seeing much incentive to change anything.

@mikkelfj
Copy link
Contributor

I don't disagree. But I'd like to add that non-canonical form should only apply to lengths, not enumerations such as frame headers or error codes, as a general principle. The latter can be inconvenient to process if not predictable and have no benefit of being written non-canonical (other than being robust against implementation mistakes).

@mikkelfj
Copy link
Contributor

Honestly, if it is only about long headers length field, then by all means make it canonical. There are so few of those packages, and memmove is relatively cheap, especially compared to the handshake. Furthermore, you can special case those packets to start at an offset and move the preceding data down, or even write them after the length. Typically you can even with benefit use a vectored write with a separate buffer for the header. And for higher level languages that care less about performance, you can just concatenate two buffers.

If it were post handshake with lots of data it would be different. This hangs on ACK's being easy to generate with known lengths.

@mikkelfj
Copy link
Contributor

A canonical length means you use memcmp and hash keys directly, for example for the token length + token.

@janaiyengar
Copy link
Contributor

I'm disinclined to pull for a change that doesn't have strong proponents. @nibanks seems to have a strong reservation against this change. On balance, I'm supportive of closing this issue.

@nibanks
Copy link
Member

nibanks commented Jan 17, 2019

Yeah, I'd prefer not to take a change for this, at this time.

@DavidSchinazi
Copy link
Contributor

(somehow I accidentally deleted my earlier comment, so I'm reposting it)

@ianswett This would add complexity to our long header write path, but it wouldn't be insurmountable. I personally don't have a strong opinion on this change either way.

@huitema
Copy link
Contributor

huitema commented Jan 20, 2019

I would rather leave the spec as is. There are places where short encoding is natural, but there are also many place in picoquic in which the code reserves two bytes, encode an element, find the coded length, and then update the length. Yes there are workarounds but as many said, please keep the rules flexible.

@martinthomson
Copy link
Member

@marten-seemann is happy closing this. We will not enforce minimum-length encoding.

@mnot mnot added the has-consensus An issue that the Chairs have determined has consensus, by canvassing the mailing list. label Mar 5, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
-transport design An issue that affects the design of the protocol; resolution requires consensus. has-consensus An issue that the Chairs have determined has consensus, by canvassing the mailing list.
Projects
No open projects
Development

No branches or pull requests