New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
consider making the varint encoding unique #2299
Comments
I have also argued for this in the past, but it breaks down when conservatively allocate two bytes for a length that is unknown. Then you'd have to memmove if the length only requires one byte. |
The solution is simple: Don't do that! It's definitely possible to use varints everywhere (I have a (roughly) draft-17 compliant implementation, I'm always using the shortest possible varint encoding, and I don't need to memmove anywhere in order to achieve that). |
Conceptually this is nice and I think @ekr has mentioned this as well in the past. But as @mikkelfj said, we'd have to carefully think through whether it causes some packetization headaches in practice. Thinking through our current gQUIC implementation, I don't think it would cause problems. I guess one could always add a byte or two of padding if the number was shorter than the reserved length? This prevents or makes very awkward (for better or worse) picking fixed serialization sizes for specific cases(ie: ack blocks and gaps are always 2 bytes) and writing serialization code that assumes that. |
There are places where the shortest possible encoding is an important property to have, such as frame types. There are also places where it is not important, such as long header's payload length. Losing the flexibility afforded by allowing fixed-size varints is one way to fix the clumsy language. The other way is to fix the clumsy language. I suggest we add a definition for the shortest possible varint encoding to Section 16—we can call it minimal varint encoding or something of the sort—and reference it in other parts of the documents as appropriate. |
It's not only the language that's clumsy. The fact that we have to make this distinction at all is clumsy. |
I have a mixed feeling about the proposal. Generally speaking, I prefer using a canonical encoding. OTOH, there are cases that you do not know the length beforehand, and IIRC we have in the past built consensus based on the fact that people can use 2-byte varint for expressing values between 0 to 63. FWIW, the fields that would be problemsome would be: length field of long header packets, frame lengths of HEADERS and PUSH_PROMISE frames. The lengths are determined after building the contents, and therefore the length of the lengths cannot be predetermined. |
In WinQuic we have a special function just for encoding a varint explicitly two bytes that we use for the payload length in long headers. It would be extremely difficult to update this code to always use the shortest encoding without having to memmove the entire payload after it has been written to the packet buffer. Looking up at the requirement for shortest possible frame type, I don't understand why that's even necessary to require this. Obviously it's more efficient on the wire, but as far as implementation processing efficiency goes, it would make little difference in the grand scheme of things (a majority of all CPU will be spent in encryption/decryption). So, we can recommend the shortest encoding, but I don't think we should kill the connection if a particular implementation does always use that. I'd prefer to leave the varint as is and remove any requirements for a particular encoding length. |
This is because it allows for efficient switch constructs when decoding the frames. This avoids additional complexity during decode. There is difference between enumerations such as frame types, and length values. Enumerations should be uniquely encoded whereas it could be argued that lengths need not due to the memmov issue. I'm not sure that vast overhead goes to decryption because this is typically 3 cycles per byte so you could easily double the time overhead if you have many small frames that are processed inefficiently. |
Good point on long header packets. I think we should consider changing the long header length to 2 byte lengths if we did this. Given this is introducing churn and I'm not clear it's adding a lot of value, I'd say we shouldn't do this at the moment. |
@ianswett I was almost about to propose the same thing. This would make all encodings canonical - I think there is value in that. The reason I hesitated is that perhaps there are cases where 1 byte lengths makes a difference even if lazy implementations use 2 bytes. I think ACK block count were not mentioned in this discussion. This is the most important case I can think of. |
I'm pretty sure that this was a case where @ekr said that a canonical encoding would be silly. That's certainly my view. I believe that this would be a fairly significant hit both in terms of code complexity and performance. If you think that you can make the case, I think we might start with a PR here to show that a) the code is simple enough, and b) the performance is tolerable. Then there is the point about churn. The requirement on minimal encodings for frame types exists so that a frame decoder can be more efficient, as @mikkelfj says. I'm reluctant to add that requirement in too many places though (even in the h3 cases we are currently considering). |
I think we should leave this as it is, primarily for the reason that this is an unnecessary constraint. The efficiency argument isn't strong for non-frame use cases, and the loss of flexibility can be an issue. @larseggert might have trouble with forcing packet lengths to be minimally represented, for the same reason as @nibanks. On changing the long header packet size to 2 bytes, we discussed in #1577 and decided against it. |
I'm seeing fair support for closing with no action, but I'm happy to leave this until we meet in Tokyo. Does anyone want to add anything? I'll close if there is no new information or argument. |
I've mentioned this previously that this might be good to do, but I also have mixed feelings now because of the reasons @nibanks mentioned about pre-reserving the length. It would seem nasty to work around that. |
I'd rather close this with no action at this point. I can't see us getting to consensus to do this in v1. |
I'm OK with closing, but it is not difficult to know how long a packet number will need to be, and to leave space for it.
|
@larseggert The problem is with formatting ACK frames in-place. |
Depends on your data structures. My ACKing code doesn't have this issue. |
I think long header packets are the largest issue I'm aware of as @nibanks pointed out above. ACK frames are a bit annoying, but easier. |
Well, as I said, I'm not objecting to closing this now |
Not feeling strongly about this issue, but for how many implementations would this change actually be difficult to implement? So far, the only definitive statement was about WinQuic. Does this apply to other implementations as well, or is this something that's limited to one implementation? |
I believe @DavidSchinazi indicated it'd be more difficult in our coalesced packets implementation, but I'll let him confirm. |
At this stage, I think that one objection is sufficient. The advantage of a minimal encoding is in wire efficiency. Something that can be achieved with the current scheme, and not guaranteed with the proposed one (because there are ample ways in which to suboptimally encode packets, PADDING notwithstanding). Add the unproven performance characteristics, and I'm not seeing much incentive to change anything. |
I don't disagree. But I'd like to add that non-canonical form should only apply to lengths, not enumerations such as frame headers or error codes, as a general principle. The latter can be inconvenient to process if not predictable and have no benefit of being written non-canonical (other than being robust against implementation mistakes). |
Honestly, if it is only about long headers length field, then by all means make it canonical. There are so few of those packages, and memmove is relatively cheap, especially compared to the handshake. Furthermore, you can special case those packets to start at an offset and move the preceding data down, or even write them after the length. Typically you can even with benefit use a vectored write with a separate buffer for the header. And for higher level languages that care less about performance, you can just concatenate two buffers. If it were post handshake with lots of data it would be different. This hangs on ACK's being easy to generate with known lengths. |
A canonical length means you use memcmp and hash keys directly, for example for the token length + token. |
I'm disinclined to pull for a change that doesn't have strong proponents. @nibanks seems to have a strong reservation against this change. On balance, I'm supportive of closing this issue. |
Yeah, I'd prefer not to take a change for this, at this time. |
(somehow I accidentally deleted my earlier comment, so I'm reposting it) @ianswett This would add complexity to our long header write path, but it wouldn't be insurmountable. I personally don't have a strong opinion on this change either way. |
I would rather leave the spec as is. There are places where short encoding is natural, but there are also many place in picoquic in which the code reserves two bytes, encode an element, find the coded length, and then update the length. Yes there are workarounds but as many said, please keep the rules flexible. |
@marten-seemann is happy closing this. We will not enforce minimum-length encoding. |
Our varint encoding has the unpleasant property that it's possible to encode numbers in multiple ways. This leads to rather clumsy language when it comes to the frame encoding:
The easy fix for this is to make the varint encoding unique, such that every number can only be encoded in a single way. This is a very straightforward change from what we have today:
The 1-byte numbers would still encode the values from 0 to 63, but the 2-byte numbers would encode the values between 64 and 16,447 (instead of encoding the range between 0 and 16,383), etc.
While a simplification of the spec is the main advantage of this change, it has the nice property that it slightly increases the encoding efficiency.
The text was updated successfully, but these errors were encountered: