Handling of corrupt Retry packets #3014

kazuho · 2019-09-10T14:40:01Z

Retry packet does not have a checksum (e.g., CRC32, AEAD tag), and therefore is fragile to accidental bit flips that happen on the wire.

Are we fine with that? I ask this, because it seems to me that an one bit flip in some parts of a Retry packet can cause the connection establishment effort to fail in unexpected ways. To paraphrase, are we fine with certain deployments of QUIC being more fragile to accidental bit flips during connection establishment, than compared to TCP (that always have the checksum)? Note that checksum is an optional feature in UDP.

Below are what I believe would happen when there is a bit flip in a Retry packet.

Consider the case where a bit in the SCID field of a Retry packet is flipped. The client would send back an Initial packet that carries this CID as the DCID. To the server, such an Initial packet would look like an attack, as it contains a Retry Token issued for a different server CID. Therefore, the server is likely to either drop the Initial packet, or close the connection immediately. This behavior would be a must for firewall generating Retry packets in order to redistribute the load between the servers running behind, as the load would be distributed by assigning different server CIDs to different servers.

Consider the case where a bit in the Token field of a Retry packet is flipped. The client would send back an Initial packet that carries this corrupt token. To the server, such an Initial packet might look like an Initial packet carrying an outdated token. Then, that server would send back a Retry packet carrying a new Retry token. But the client would drop the new Retry packet (because the client is required to accept only one Retry packet). The client continues retransmitting Initial packets using the corrupt token, until it time outs.

Bit flips in the DCID, ODCID, are non-issues as such packets would be dropped by the client. Bit flips in the length fields are likely to cause no harm, and in the worst case scenario it would look either of the above.

PS. Version Negotiation packet also lacks a checksum. Though I think we are not going to use that packet type in V1 (see #3013).

nibanks · 2019-09-10T14:47:09Z

One option would be to require UDP checksum for QUIC, right?

mikkelfj · 2019-09-10T14:57:05Z

@nibanks I don't think we should mix those layers. It would prevent QUIC from running elsewhere.

I do think that everything should be checksummed, like the initial packet. The problem is versioning, so each version need some official checksum schema.

nibanks · 2019-09-10T14:58:53Z

I don't think it's a huge requirement that QUIC must be sent over a medium that is checksummed. IPv6 also has checksum already too, right?

mikkelfj · 2019-09-10T15:03:28Z

Well, sort of, but since QUIC in 99.999% if the cases does its own QUICsumming, it would be an odd requirement.

mikkelfj · 2019-09-10T15:04:19Z

Of course UDP sums still make sense for intermediate routing but that is fabric internal.

nibanks · 2019-09-10T15:06:38Z

I'd personally prefer to depend on UDP in this case, especially since it already has performant hardware solutions. Why invent something new when there is a pre-existing solution? Also, we are only charted for UDP, as far as I know.

mikkelfj · 2019-09-10T15:08:28Z

Sure, only UDP, I'd just like to avoid hardcoding that dependency for a single case when you could just add a sha256 at the end or something.

ianswett · 2019-09-10T15:09:59Z

I think it's reasonable to recommend using UDP checksums, particularly for unprotected packets such as Retry and VN.

I hadn't thought through the implications of a corrupted Retry token to realize it makes the handshake unrecoverable. I'm not sure I want to fix that issue now, but I think it's worth documenting.

kazuho · 2019-09-10T15:25:24Z

FWIW, I think that there are three options:

Document the behavior, recommend using UDP checksums.
Define a QUIC error code that recommends the client to retry establishing a connection. Then, a server recognizing the situations listed above can send that error code.
Add a checksum field to the Retry packet (the use of the checksum can be an optional feature, much like that of UDP). The algorithm could be CRC, or GMAC (i.e. AES-GCM without any encrypted data).

mikkelfj · 2019-09-10T15:58:47Z

I'd suggest not using GMAC, at least for VNEG, because GMAC might be outdated before VNEG, and it is complex.

kazuho · 2019-09-10T16:05:44Z

@mikkelfj Retry is a concept specific to QUIC v1, which already depends on AES-GCM for protecting the integrity of Initial packets. Therefore, IMO use of GMAC would be a reasonable choice if we decide to give it a checksum. Use of GMAC would be simpler than other algorithms, because then QUIC stacks can use the existing code that encrypts the Initial packet; Retry packet would be a packet that is encrypted exactly the same way as the Initial packet, with the only exceptions being that the nonce is fixed to zero and that the encrypted part is zero bytes.

PS. Yes, we need to think about GCM becoming obsolete for Version Negotiation packets.

mikkelfj · 2019-09-10T16:14:58Z

I don't think we disagree. I did mention VNEG explicitly. But, one could argue that that if VNEG uses a different hash, why not use it for Retry as well - not that I care that much.

RyanTheOptimist · 2019-09-10T17:31:04Z

This definitely seems like a reasonable concern to me. I'd be inclined to extend to Retry packets the use of obfuscation/encryption similar to what we do for Initial packets. Heck, we could even encrypt the retry token itself if we really wished to (in keeping with our philosophy of obfuscating/hiding as much as humanly possible)

DavidSchinazi · 2019-09-10T19:15:08Z

+1. Using AES-GCM-128 with initial salts for retry could solve this elegantly.

kazuho · 2019-09-11T02:30:46Z

@nibanks

I don't think it's a huge requirement that QUIC must be sent over a medium that is checksummed.

I am not sure. I doubt if it would be practical to suggest the cloud users to ask their cloud service providers "are you currently using UDP checksums, and committed to using that in the future" before starting to use QUIC. Also, UDP checksums are rewritten by NAT devices, so there is no way to tell for an endpoint if a UDP packet is (was) protected by the UDP checksum.

martinthomson · 2019-09-11T02:49:59Z

Presumably devices that rewrite addresses and checksums also check checksums on the way in... Right? :)

This seems like a fine design issue. It's a fairly big change, for Retry, and late, but not impossible. Version Negotiation is much, much more difficult to retrofit, so I think that we should concentrate on Retry here and open a VN issue if anyone feels strongly about that.

Using Initial packet protection isn't crazy. We'd be able to drop ODCID if we did that.

Saying that we don't care is also rational. The UDP checksum is imperfect (especially when it comes to catching bit flips), but we are talking about having maybe a few connections per day time out, globally. I didn't do the numbers on that, but my recollection is that the odds are that small.

RyanTheOptimist · 2019-09-11T02:57:27Z

Totally agree that VN is a different issue and this should focus on Retry.

I also love the idea of getting rid of ODCID! That has always seems like an ugly wart to me and replacing it with initial packet protection seems much more elegant. (As well as solving the packet corruption issue)

janaiyengar · 2019-09-17T02:09:01Z

Jon Stone's work [1] showed that the internet is actually quite terrible and that the TCP/UDP checksums are super important. From the conclusion section: "Our trace data shows that the TCP and UDP checksums are catching a signicant number of persistent errors. In practice, the checksum is being asked to detect an error every few thousand packets." Of course, this data is dated, but given the I wouldn't be surprised if the numbers today were not too far off from the ones he reports.

@kazuho's right that we cannot rely on the UDP checksum being on in all circumstances, but I would guess that it'll be on in most deployments. That said, the failure rate can be high enough to be bothersome in specific deployments. I would argue that it's a valuable goal to ensure that QUIC is protected against trivial corruption in all deployments.

I agree with @RyanatGoogle that encrypting this packet is aligned with how we do things anyways, and there's additional benefit in being able to remove the ODCID. It is a bit late in the process, but this is a fairly isolated and local change.

I would argue that unless there's a strong reason to not encrypt this packet, we should encrypt it.

1: http://conferences.sigcomm.org/sigcomm/2000/conf/paper/sigcomm2000-9-1.pdf

nibanks · 2019-09-17T13:31:26Z

I would prefer not to encrypt the Retry packet and keep things how they are. We've done a lot of work to create a design that requires the least amount of work to accomplish the stateless retry scenario. Adding an additional level of encryption, just to get a checksum (something that is practically, already accomplished in most deployments with UDP) is just going to add complexity (we've already started work with HW vendors on the current model, and it's complexity, compared to TCP, is already a major hurdle).

dtikhonov · 2019-09-17T14:43:27Z

To quote the paper @janaiyengar links to above:

Traces of Internet packets from the past two years show that between 1 packet in 1,100 and 1 packet in 32,000 fails the TCP checksum

That translates to less than 0.1% error rate on the high side. We already know that a significant percentage of QUIC connections will fail -- 3%, IIRC, according to Google -- for other reasons. <0.1% is 3% of the 3%... Is it worth it?

ianswett · 2019-09-17T14:46:54Z

I think the concern here is not that the handshake fails, but that it takes a long time to fail in a circumstance where UDP connectivity does work. Chrome currently uses quite a short handshake timeout of 4 seconds, but if clients use longer timeouts, then users could wait a long time in this unrecoverable state.

I think UDP checksums are probably sufficient here, but these long timeout based handshake failures are very user visible.

dtikhonov · 2019-09-17T15:00:40Z

@kazuho writes:

Therefore, IMO use of GMAC would be a reasonable choice if we decide to give it a checksum.

This is an overkill. Just use CRC32, which is simple, cheap (there is a dedicated CPU instruction for it!), and does what we need.

mikkelfj · 2019-09-17T16:01:55Z

Not arguing against CRC but intels CRC instruction is not necessarily standard CRC checksum for reasons that elude me.

kazuho · 2019-09-17T16:02:25Z

@dtikhonov As people have been discussing, use of AES-GCM (or GMAC) is a simplification in terms of code reuse and also provides us the chance to get rid of the ODCID field.

Clarification: GMAC is any-cipher-with-GCM with zero-byte encrypted text, so if it's going to be called AES-GCM or GMAC simply depends on if we decide to encrypt the token. If we are going to make the use of the QUIC-level checksum an optional feature, it would make more sense to not encrypt the token regardless of the QUIC-level checksum being used - which means GMAC.

janaiyengar · 2019-09-17T20:50:11Z

@nibanks: Do you believe that obfuscating the packet is inconvenient at this late time, or that it is a lot of work for the HW vendors?

nibanks · 2019-09-17T21:00:14Z

It's more about the cost per packet for the HW. With TCP, the devices handle millions, if not tens of millions of connections per second using the SYN cookie method. They won't be able to get anywhere close to that with QUIC Retry packets, and it will be even worse if every packet must be individually encrypted.

mikkelfj · 2019-10-01T07:28:19Z

I don't really care much how Retry is protected, whether it is a checksum at the end or the Initial format, but aside from Version Negotiation, it is the only packet without self-contained integrity checks. That is a pain when forwarding packets, for example over a bluetooth or USB serial gateway. Now you have to add a packet version specific framing layer and since packets can be coalesced, that makes it all the more fun.

martinthomson · 2019-10-15T20:23:16Z

Even if we do nothing here, we need to note that UDP checksums are a good idea (cite RFC 8085, Section 3.4). And we need to point out that a bad Retry token will result in the server dropping the packet. Also, corruption of Retry could make it hard to distinguish between Retry tokens and NEW_TOKEN tokens unless the discriminator the server uses is in the clear (in which case, the distinction is erased, but that seems unlikely, or maybe you can create markers with a Hamming distance >1 so that bit flips are less likely to erase that signal.

Advice for clients would be to make a new connection if an Initial with a token times out. What is likely to happen for HTTP - at least in the near term - is that clients will fall back to TCP.

@ianswett suggests that we could allow a server to generate a CONNECTION_CLOSE in response to an Initial with a token that is probably an invalid Retry token. This would allow the server to signal to the client that the connection is busted. However, this won't work if we make the proposed change to retain the Initial keys after Retry (#2823), because the server won't be able to get the original keys.

MikeBishop · 2019-10-15T20:36:15Z

@ianswett suggests that we could allow a server to generate a CONNECTION_CLOSE in response to an Initial with a token that is probably an invalid Retry token. This would allow the server to signal to the client that the connection is busted.

This also becomes an attack vector, since an attacker could inject such a CONNECTION_CLOSE.

martinthomson · 2019-10-15T22:46:26Z

That sort of injection is already possible.

huitema · 2019-10-16T17:16:44Z

How is the bit flip error different from an attacker spoofing a Retry packet with made up values?

larseggert · 2019-10-16T17:45:35Z

Discussed in Cupertino. Agreement in the room for MANDATORY use of AES-GCM with zero key and zero nonce to protect the Retry token. The entire packet would be AAD. @DavidSchinazi's addition of potentially eliding DCID to be thought through by assignee of issue.

DavidSchinazi · 2019-10-17T23:16:57Z

Created PR #3120 to try to address this.

martinthomson · 2019-11-18T00:03:35Z

Moving discussion back here to address @ekr's request to encrypt the token.

Discussion thus far identified a very minor performance/value trade-off. Running AES might cost some. This is minor, as the key is fixed and so the AES key preparation costs are erased; you might even pregenerate the fixed key stream and just run XOR. With that realization, we observed that the value provided by encryption was equivalent to the difficulty of applying that XOR.

There is a step higher, which is to use a key derived from the connection ID, but that depends on running a KDF and setting up AES, which I think we all agreed was prohibitive.

I think that the idea of moving from a 0 key to a fixed key is easy, so we should definitely do that even if we do nothing else.

DavidSchinazi · 2019-11-18T11:17:40Z

I've incorporated @martinthomson 's suggestion and switched the key from all-zeroes to the initial_salt

janaiyengar · 2019-12-05T03:47:02Z

In PR #3120, @huitema wrote:

In Singapore, we were debating whether to just do checksum, versus encrypt and use an initial vector based on the CID. I wanted to compare the two variants using a benchmark designed for Picotls, but I ran into a little bit of trouble because of the structure of the picotls encryption API, which looks like:

Initialize an encryption context with the key context, the IV, and the authenticated data
Encrypt the message
Compute the AEAD checksum

With that API, there is no option to not use an IV,. Using a constant IV would have exactly the same cost as using a variable IV. I went on to compare the time to encrypt 64 bytes versus the time to just authenticate 64 bytes. Results are below for two implementations of AES128 GCM, one from OpenSSL crypto library and another from Windows Bcrypt library -- both measured using 64 bit code on my Windows laptop:

	encrypt	decrypt	authenticate	verify
bcrypt	0.35751	0.22602	0.24687	0.20189
openssl 1.1.1c	0.30211	0.23956	0.24296	0.23926

The times are measured here in microseconds -- these are the average on 100,000 operations. We see that the times are fractions of microseconds in both cases. The encryption operation is more expensive than a simple authentication, 45% more with brcrypt, 24% more with openssl. The decrypt operation is 12% more expensive than the simple authenticate with brcypt, almost the same with openssl.

kazuho · 2019-12-05T03:51:09Z

@janaiyengar I appreciate your effort to coordinate the threads; I think the discussion is mostly happening in #3274.

martinthomson · 2019-12-11T01:18:12Z

Marking as proposal-ready on the basis that we can proceed with just GMAC and later encrypt if we decide.

martinthomson added the -transport label Sep 11, 2019

mnot added the design An issue that affects the design of the protocol; resolution requires consensus. label Sep 13, 2019

larseggert assigned DavidSchinazi Oct 16, 2019

ianswett mentioned this issue Oct 17, 2019

Immediately close with INVALID_TOKEN #3107

Merged

DavidSchinazi mentioned this issue Oct 17, 2019

Add retry integrity tag #3120

Merged

martinthomson added the proposal-ready An issue which has a proposal that is believed to be ready for a consensus call. label Nov 5, 2019

mnot removed the proposal-ready An issue which has a proposal that is believed to be ready for a consensus call. label Nov 29, 2019

martinthomson added the proposal-ready An issue which has a proposal that is believed to be ready for a consensus call. label Dec 11, 2019

janaiyengar mentioned this issue Dec 19, 2019

Encrypting Retry token #3274

Closed

mnot added has-consensus An issue that the Chairs have determined has consensus, by canvassing the mailing list. and removed proposal-ready An issue which has a proposal that is believed to be ready for a consensus call. labels Jan 17, 2020

janaiyengar closed this as completed in #3120 Jan 17, 2020

nqv mentioned this issue Feb 22, 2020

Encrypting token in Retry packet #3477

Closed

Handling of corrupt Retry packets #3014

Handling of corrupt Retry packets #3014

Comments

kazuho commented Sep 10, 2019 • edited Loading

nibanks commented Sep 10, 2019

mikkelfj commented Sep 10, 2019

nibanks commented Sep 10, 2019

mikkelfj commented Sep 10, 2019

mikkelfj commented Sep 10, 2019

nibanks commented Sep 10, 2019

mikkelfj commented Sep 10, 2019

ianswett commented Sep 10, 2019

kazuho commented Sep 10, 2019 • edited Loading

mikkelfj commented Sep 10, 2019 • edited Loading

kazuho commented Sep 10, 2019 • edited Loading

mikkelfj commented Sep 10, 2019

RyanTheOptimist commented Sep 10, 2019

DavidSchinazi commented Sep 10, 2019

kazuho commented Sep 11, 2019 • edited Loading

martinthomson commented Sep 11, 2019

RyanTheOptimist commented Sep 11, 2019

janaiyengar commented Sep 17, 2019

nibanks commented Sep 17, 2019

dtikhonov commented Sep 17, 2019

ianswett commented Sep 17, 2019

dtikhonov commented Sep 17, 2019

mikkelfj commented Sep 17, 2019

kazuho commented Sep 17, 2019

janaiyengar commented Sep 17, 2019

nibanks commented Sep 17, 2019

mikkelfj commented Oct 1, 2019

martinthomson commented Oct 15, 2019

MikeBishop commented Oct 15, 2019

martinthomson commented Oct 15, 2019

huitema commented Oct 16, 2019

larseggert commented Oct 16, 2019

DavidSchinazi commented Oct 17, 2019

martinthomson commented Nov 18, 2019

DavidSchinazi commented Nov 18, 2019

janaiyengar commented Dec 5, 2019

kazuho commented Dec 5, 2019

martinthomson commented Dec 11, 2019

kazuho commented Sep 10, 2019 •

edited

Loading

kazuho commented Sep 10, 2019 •

edited

Loading

mikkelfj commented Sep 10, 2019 •

edited

Loading

kazuho commented Sep 10, 2019 •

edited

Loading

kazuho commented Sep 11, 2019 •

edited

Loading