Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling of corrupt Retry packets #3014

Closed
kazuho opened this issue Sep 10, 2019 · 61 comments · Fixed by #3120
Closed

Handling of corrupt Retry packets #3014

kazuho opened this issue Sep 10, 2019 · 61 comments · Fixed by #3120
Assignees
Labels
-transport design An issue that affects the design of the protocol; resolution requires consensus. has-consensus An issue that the Chairs have determined has consensus, by canvassing the mailing list.

Comments

@kazuho
Copy link
Member

kazuho commented Sep 10, 2019

Retry packet does not have a checksum (e.g., CRC32, AEAD tag), and therefore is fragile to accidental bit flips that happen on the wire.

Are we fine with that? I ask this, because it seems to me that an one bit flip in some parts of a Retry packet can cause the connection establishment effort to fail in unexpected ways. To paraphrase, are we fine with certain deployments of QUIC being more fragile to accidental bit flips during connection establishment, than compared to TCP (that always have the checksum)? Note that checksum is an optional feature in UDP.

Below are what I believe would happen when there is a bit flip in a Retry packet.

Consider the case where a bit in the SCID field of a Retry packet is flipped. The client would send back an Initial packet that carries this CID as the DCID. To the server, such an Initial packet would look like an attack, as it contains a Retry Token issued for a different server CID. Therefore, the server is likely to either drop the Initial packet, or close the connection immediately. This behavior would be a must for firewall generating Retry packets in order to redistribute the load between the servers running behind, as the load would be distributed by assigning different server CIDs to different servers.

Consider the case where a bit in the Token field of a Retry packet is flipped. The client would send back an Initial packet that carries this corrupt token. To the server, such an Initial packet might look like an Initial packet carrying an outdated token. Then, that server would send back a Retry packet carrying a new Retry token. But the client would drop the new Retry packet (because the client is required to accept only one Retry packet). The client continues retransmitting Initial packets using the corrupt token, until it time outs.

Bit flips in the DCID, ODCID, are non-issues as such packets would be dropped by the client. Bit flips in the length fields are likely to cause no harm, and in the worst case scenario it would look either of the above.

PS. Version Negotiation packet also lacks a checksum. Though I think we are not going to use that packet type in V1 (see #3013).

@nibanks
Copy link
Member

nibanks commented Sep 10, 2019

One option would be to require UDP checksum for QUIC, right?

@mikkelfj
Copy link
Contributor

@nibanks I don't think we should mix those layers. It would prevent QUIC from running elsewhere.

I do think that everything should be checksummed, like the initial packet. The problem is versioning, so each version need some official checksum schema.

@nibanks
Copy link
Member

nibanks commented Sep 10, 2019

I don't think it's a huge requirement that QUIC must be sent over a medium that is checksummed. IPv6 also has checksum already too, right?

@mikkelfj
Copy link
Contributor

Well, sort of, but since QUIC in 99.999% if the cases does its own QUICsumming, it would be an odd requirement.

@mikkelfj
Copy link
Contributor

Of course UDP sums still make sense for intermediate routing but that is fabric internal.

@nibanks
Copy link
Member

nibanks commented Sep 10, 2019

I'd personally prefer to depend on UDP in this case, especially since it already has performant hardware solutions. Why invent something new when there is a pre-existing solution? Also, we are only charted for UDP, as far as I know.

@mikkelfj
Copy link
Contributor

Sure, only UDP, I'd just like to avoid hardcoding that dependency for a single case when you could just add a sha256 at the end or something.

@ianswett
Copy link
Contributor

I think it's reasonable to recommend using UDP checksums, particularly for unprotected packets such as Retry and VN.

I hadn't thought through the implications of a corrupted Retry token to realize it makes the handshake unrecoverable. I'm not sure I want to fix that issue now, but I think it's worth documenting.

@kazuho
Copy link
Member Author

kazuho commented Sep 10, 2019

FWIW, I think that there are three options:

  1. Document the behavior, recommend using UDP checksums.
  2. Define a QUIC error code that recommends the client to retry establishing a connection. Then, a server recognizing the situations listed above can send that error code.
  3. Add a checksum field to the Retry packet (the use of the checksum can be an optional feature, much like that of UDP). The algorithm could be CRC, or GMAC (i.e. AES-GCM without any encrypted data).

@mikkelfj
Copy link
Contributor

mikkelfj commented Sep 10, 2019

I'd suggest not using GMAC, at least for VNEG, because GMAC might be outdated before VNEG, and it is complex.

@kazuho
Copy link
Member Author

kazuho commented Sep 10, 2019

@mikkelfj Retry is a concept specific to QUIC v1, which already depends on AES-GCM for protecting the integrity of Initial packets. Therefore, IMO use of GMAC would be a reasonable choice if we decide to give it a checksum. Use of GMAC would be simpler than other algorithms, because then QUIC stacks can use the existing code that encrypts the Initial packet; Retry packet would be a packet that is encrypted exactly the same way as the Initial packet, with the only exceptions being that the nonce is fixed to zero and that the encrypted part is zero bytes.

PS. Yes, we need to think about GCM becoming obsolete for Version Negotiation packets.

@mikkelfj
Copy link
Contributor

I don't think we disagree. I did mention VNEG explicitly. But, one could argue that that if VNEG uses a different hash, why not use it for Retry as well - not that I care that much.

@RyanTheOptimist
Copy link
Contributor

This definitely seems like a reasonable concern to me. I'd be inclined to extend to Retry packets the use of obfuscation/encryption similar to what we do for Initial packets. Heck, we could even encrypt the retry token itself if we really wished to (in keeping with our philosophy of obfuscating/hiding as much as humanly possible)

@DavidSchinazi
Copy link
Contributor

+1. Using AES-GCM-128 with initial salts for retry could solve this elegantly.

@kazuho
Copy link
Member Author

kazuho commented Sep 11, 2019

@nibanks

I don't think it's a huge requirement that QUIC must be sent over a medium that is checksummed.

I am not sure. I doubt if it would be practical to suggest the cloud users to ask their cloud service providers "are you currently using UDP checksums, and committed to using that in the future" before starting to use QUIC. Also, UDP checksums are rewritten by NAT devices, so there is no way to tell for an endpoint if a UDP packet is (was) protected by the UDP checksum.

@martinthomson
Copy link
Member

Presumably devices that rewrite addresses and checksums also check checksums on the way in... Right? :)

This seems like a fine design issue. It's a fairly big change, for Retry, and late, but not impossible. Version Negotiation is much, much more difficult to retrofit, so I think that we should concentrate on Retry here and open a VN issue if anyone feels strongly about that.

Using Initial packet protection isn't crazy. We'd be able to drop ODCID if we did that.

Saying that we don't care is also rational. The UDP checksum is imperfect (especially when it comes to catching bit flips), but we are talking about having maybe a few connections per day time out, globally. I didn't do the numbers on that, but my recollection is that the odds are that small.

@RyanTheOptimist
Copy link
Contributor

Totally agree that VN is a different issue and this should focus on Retry.

I also love the idea of getting rid of ODCID! That has always seems like an ugly wart to me and replacing it with initial packet protection seems much more elegant. (As well as solving the packet corruption issue)

@mnot mnot added the design An issue that affects the design of the protocol; resolution requires consensus. label Sep 13, 2019
@janaiyengar
Copy link
Contributor

Jon Stone's work [1] showed that the internet is actually quite terrible and that the TCP/UDP checksums are super important. From the conclusion section: "Our trace data shows that the TCP and UDP checksums are catching a signi cant number of persistent errors. In practice, the checksum is being asked to detect an error every few thousand packets." Of course, this data is dated, but given the I wouldn't be surprised if the numbers today were not too far off from the ones he reports.

@kazuho's right that we cannot rely on the UDP checksum being on in all circumstances, but I would guess that it'll be on in most deployments. That said, the failure rate can be high enough to be bothersome in specific deployments. I would argue that it's a valuable goal to ensure that QUIC is protected against trivial corruption in all deployments.

I agree with @RyanatGoogle that encrypting this packet is aligned with how we do things anyways, and there's additional benefit in being able to remove the ODCID. It is a bit late in the process, but this is a fairly isolated and local change.

I would argue that unless there's a strong reason to not encrypt this packet, we should encrypt it.

1: http://conferences.sigcomm.org/sigcomm/2000/conf/paper/sigcomm2000-9-1.pdf

@nibanks
Copy link
Member

nibanks commented Sep 17, 2019

I would prefer not to encrypt the Retry packet and keep things how they are. We've done a lot of work to create a design that requires the least amount of work to accomplish the stateless retry scenario. Adding an additional level of encryption, just to get a checksum (something that is practically, already accomplished in most deployments with UDP) is just going to add complexity (we've already started work with HW vendors on the current model, and it's complexity, compared to TCP, is already a major hurdle).

@dtikhonov
Copy link
Member

To quote the paper @janaiyengar links to above:

Traces of Internet packets from the past two years show that between 1 packet in 1,100 and 1 packet in 32,000 fails the TCP checksum

That translates to less than 0.1% error rate on the high side. We already know that a significant percentage of QUIC connections will fail -- 3%, IIRC, according to Google -- for other reasons. <0.1% is 3% of the 3%... Is it worth it?

@ianswett
Copy link
Contributor

I think the concern here is not that the handshake fails, but that it takes a long time to fail in a circumstance where UDP connectivity does work. Chrome currently uses quite a short handshake timeout of 4 seconds, but if clients use longer timeouts, then users could wait a long time in this unrecoverable state.

I think UDP checksums are probably sufficient here, but these long timeout based handshake failures are very user visible.

@dtikhonov
Copy link
Member

@kazuho writes:

Therefore, IMO use of GMAC would be a reasonable choice if we decide to give it a checksum.

This is an overkill. Just use CRC32, which is simple, cheap (there is a dedicated CPU instruction for it!), and does what we need.

@mikkelfj
Copy link
Contributor

Not arguing against CRC but intels CRC instruction is not necessarily standard CRC checksum for reasons that elude me.

@kazuho
Copy link
Member Author

kazuho commented Sep 17, 2019

@dtikhonov As people have been discussing, use of AES-GCM (or GMAC) is a simplification in terms of code reuse and also provides us the chance to get rid of the ODCID field.

Clarification: GMAC is any-cipher-with-GCM with zero-byte encrypted text, so if it's going to be called AES-GCM or GMAC simply depends on if we decide to encrypt the token. If we are going to make the use of the QUIC-level checksum an optional feature, it would make more sense to not encrypt the token regardless of the QUIC-level checksum being used - which means GMAC.

@janaiyengar
Copy link
Contributor

@nibanks: Do you believe that obfuscating the packet is inconvenient at this late time, or that it is a lot of work for the HW vendors?

@nibanks
Copy link
Member

nibanks commented Sep 17, 2019

It's more about the cost per packet for the HW. With TCP, the devices handle millions, if not tens of millions of connections per second using the SYN cookie method. They won't be able to get anywhere close to that with QUIC Retry packets, and it will be even worse if every packet must be individually encrypted.

@mikkelfj
Copy link
Contributor

mikkelfj commented Oct 1, 2019

I don't really care much how Retry is protected, whether it is a checksum at the end or the Initial format, but aside from Version Negotiation, it is the only packet without self-contained integrity checks. That is a pain when forwarding packets, for example over a bluetooth or USB serial gateway. Now you have to add a packet version specific framing layer and since packets can be coalesced, that makes it all the more fun.

@martinthomson
Copy link
Member

Even if we do nothing here, we need to note that UDP checksums are a good idea (cite RFC 8085, Section 3.4). And we need to point out that a bad Retry token will result in the server dropping the packet. Also, corruption of Retry could make it hard to distinguish between Retry tokens and NEW_TOKEN tokens unless the discriminator the server uses is in the clear (in which case, the distinction is erased, but that seems unlikely, or maybe you can create markers with a Hamming distance >1 so that bit flips are less likely to erase that signal.

Advice for clients would be to make a new connection if an Initial with a token times out. What is likely to happen for HTTP - at least in the near term - is that clients will fall back to TCP.

@ianswett suggests that we could allow a server to generate a CONNECTION_CLOSE in response to an Initial with a token that is probably an invalid Retry token. This would allow the server to signal to the client that the connection is busted. However, this won't work if we make the proposed change to retain the Initial keys after Retry (#2823), because the server won't be able to get the original keys.

@MikeBishop
Copy link
Contributor

@ianswett suggests that we could allow a server to generate a CONNECTION_CLOSE in response to an Initial with a token that is probably an invalid Retry token. This would allow the server to signal to the client that the connection is busted.

This also becomes an attack vector, since an attacker could inject such a CONNECTION_CLOSE.

@martinthomson
Copy link
Member

That sort of injection is already possible.

@huitema
Copy link
Contributor

huitema commented Oct 16, 2019

How is the bit flip error different from an attacker spoofing a Retry packet with made up values?

@larseggert
Copy link
Member

Discussed in Cupertino. Agreement in the room for MANDATORY use of AES-GCM with zero key and zero nonce to protect the Retry token. The entire packet would be AAD. @DavidSchinazi's addition of potentially eliding DCID to be thought through by assignee of issue.

@DavidSchinazi
Copy link
Contributor

Created PR #3120 to try to address this.

@martinthomson martinthomson added the proposal-ready An issue which has a proposal that is believed to be ready for a consensus call. label Nov 5, 2019
@martinthomson
Copy link
Member

Moving discussion back here to address @ekr's request to encrypt the token.

Discussion thus far identified a very minor performance/value trade-off. Running AES might cost some. This is minor, as the key is fixed and so the AES key preparation costs are erased; you might even pregenerate the fixed key stream and just run XOR. With that realization, we observed that the value provided by encryption was equivalent to the difficulty of applying that XOR.

There is a step higher, which is to use a key derived from the connection ID, but that depends on running a KDF and setting up AES, which I think we all agreed was prohibitive.

I think that the idea of moving from a 0 key to a fixed key is easy, so we should definitely do that even if we do nothing else.

@DavidSchinazi
Copy link
Contributor

I've incorporated @martinthomson 's suggestion and switched the key from all-zeroes to the initial_salt

@mnot mnot removed the proposal-ready An issue which has a proposal that is believed to be ready for a consensus call. label Nov 29, 2019
@janaiyengar
Copy link
Contributor

In PR #3120, @huitema wrote:

In Singapore, we were debating whether to just do checksum, versus encrypt and use an initial vector based on the CID. I wanted to compare the two variants using a benchmark designed for Picotls, but I ran into a little bit of trouble because of the structure of the picotls encryption API, which looks like:

  1. Initialize an encryption context with the key context, the IV, and the authenticated data
  2. Encrypt the message
  3. Compute the AEAD checksum

With that API, there is no option to not use an IV,. Using a constant IV would have exactly the same cost as using a variable IV. I went on to compare the time to encrypt 64 bytes versus the time to just authenticate 64 bytes. Results are below for two implementations of AES128 GCM, one from OpenSSL crypto library and another from Windows Bcrypt library -- both measured using 64 bit code on my Windows laptop:

  encrypt decrypt authenticate verify
bcrypt 0.35751 0.22602 0.24687 0.20189
openssl 1.1.1c 0.30211 0.23956 0.24296 0.23926

The times are measured here in microseconds -- these are the average on 100,000 operations. We see that the times are fractions of microseconds in both cases. The encryption operation is more expensive than a simple authentication, 45% more with brcrypt, 24% more with openssl. The decrypt operation is 12% more expensive than the simple authenticate with brcypt, almost the same with openssl.

@kazuho
Copy link
Member Author

kazuho commented Dec 5, 2019

@janaiyengar I appreciate your effort to coordinate the threads; I think the discussion is mostly happening in #3274.

@martinthomson
Copy link
Member

Marking as proposal-ready on the basis that we can proceed with just GMAC and later encrypt if we decide.

@martinthomson martinthomson added the proposal-ready An issue which has a proposal that is believed to be ready for a consensus call. label Dec 11, 2019
@mnot mnot added has-consensus An issue that the Chairs have determined has consensus, by canvassing the mailing list. and removed proposal-ready An issue which has a proposal that is believed to be ready for a consensus call. labels Jan 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
-transport design An issue that affects the design of the protocol; resolution requires consensus. has-consensus An issue that the Chairs have determined has consensus, by canvassing the mailing list.
Projects
None yet
Development

Successfully merging a pull request may close this issue.