Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preventing KEY_PHASE bit from being used as a tool to correlate CIDs #1322

Closed
kazuho opened this issue Apr 24, 2018 · 43 comments
Closed

Preventing KEY_PHASE bit from being used as a tool to correlate CIDs #1322

kazuho opened this issue Apr 24, 2018 · 43 comments
Labels
-tls -transport design An issue that affects the design of the protocol; resolution requires consensus. has-consensus An issue that the Chairs have determined has consensus, by canvassing the mailing list.

Comments

@kazuho
Copy link
Member

kazuho commented Apr 24, 2018

Once we decide to either encrypt PN (#1079) or have per-CID PN (#1317), KEY_PHASE bit will be left as a tool for an observer to correlate CIDs in order to determine which of those CIDs belong to a particular QUIC connection.

Let's consider the case where a client, in an attempt to grease the bit, flips the KEY_PHASE bit for every 65,536 packets it sends. That would mean that it is effectively leaking the 17th bit (from LSB) of an unencrypted PN. The leak would be sufficient for an observer to determine which CIDs belonged to the same connection, for connections consisting of CIDs that carry more than 65,356 packets.

Another bad example is a client that greases the bit for some minor portion of the connections (e.g. 1%). Then, an observer will have a better chance of tracking that 1% of connection.

Considering these two attacks, I think that we should require implementations to flip the KEY_PHASE bit at randomly, and that we should flip either 50% or 100% of the connections that we handle.

The questions are:

  • do we need to document such requirements?
  • are there any other ways for an observer to use the bit?
@kazuho
Copy link
Member Author

kazuho commented Apr 24, 2018

The other way to address the issue is to remove the KEY_PHASE bit.

For example, we can remove the bit and instead state that the key phase can be changed only when a new CID is used. When an endpoint receives a new CID, it trial decrypts the packet using the current generation key and the next generation key to see if the key phase has shifted.

@martinthomson martinthomson added design An issue that affects the design of the protocol; resolution requires consensus. -transport -tls labels Apr 26, 2018
@martinthomson
Copy link
Member

Once we decide to either encrypt PN (#1079) or have per-CID PN (#1317),

Why not both?!?

@kazuho
Copy link
Member Author

kazuho commented Apr 26, 2018

I agree that we should do both.

The intent of the text was to state that preventing leak of information from KEY_PHASE bit that could be used to correlate CIDs, makes sense only if we prevent leak from PN in some way.

@huitema
Copy link
Contributor

huitema commented Apr 30, 2018

I like the idea of using migration to a new CID as an alternative to key rollover. Come to think of it, we could transmit the key phase bit as part of the NEW CONNECTION ID frame. Could even make it a key rollover sequence. But that requires allowing different paths to use different key phases. If all paths migrate to new CID at the same time, we still have leakage.

@martinthomson
Copy link
Member

That sounds like budget encryption, which I think we should avoid like the plague. See #1317 (comment) for an alternative design.

@kazuho
Copy link
Member Author

kazuho commented May 1, 2018

Come to think of it, we could transmit the key phase bit as part of the NEW CONNECTION ID frame. Could even make it a key rollover sequence.

The positive side of the current design is that key updates are unilateral. Using NEW_CONNECTION_ID for signalling means that an endpoint will instruct the peer to update the key. I am not sure if doing so is a good thing.

If all paths migrate to new CID at the same time, we still have leakage.

I agree that that is an issue. One trivial way of avoiding such an issue is to implement the following strategy:

  • endpoints should switch to a new path after sending 232 packets on that path
  • endpoints should start using an updated key for every 261 packets that it sends over one QUIC connection

We might want to suggest something like that.

@martinthomson
Copy link
Member

Key updates require reciprocation. Using a new connection ID would be the same. So no change there.

One advantage of coupling NEW_CONNECTION_ID to key updates is that it means that there is a single mechanism that supports migration (and multipath) and key update. You can't support one without also enabling the other.

One thing I've been concerned about with key update is that no one will implement it. NSS only very recently grew that capability in TLS 1.3; it's a small feature, but one that took ages to implement because there wasn't much incentive to build it. We still don't have the DTLS variant because it messes with our timers in ways that are hard to reason about.

I agree that simultaneous migration would be a mistake. Better to drive rollovers based on packet counts as @kazuho suggests. Though I think that we can probably ignore the 262 limit and throw connections away if they ever get to that point.

@mikkelfj
Copy link
Contributor

mikkelfj commented May 1, 2018

I support the idea of changing keys only when changing CID, but make it optional until the packet count reaches at most 2^32. CID change can be probalisitic after say 2^31 packets to reduce linkage.

@kazuho
Copy link
Member Author

kazuho commented May 1, 2018

@martinthomson

One thing I've been concerned about with key update is that no one will implement it. NSS only very recently grew that capability in TLS 1.3; it's a small feature, but one that took ages to implement because there wasn't much incentive to build it. We still don't have the DTLS variant because it messes with our timers in ways that are hard to reason about.

I agree that simultaneous migration would be a mistake. Better to drive rollovers based on packet counts as @kazuho suggests. Though I think that we can probably ignore the 262 limit and throw connections away if they ever get to that point.

FWIW, picotls doesn't yet have key updates (and I wonder if I would ever implement it). Considering that, I am perfectly fine with dropping key update entirely from QUIC v1, leaving it as an issue to be solved by an extension or v2 issue.

@mikkelfj
Copy link
Contributor

mikkelfj commented May 1, 2018

@kazuho I don't have the numbers fresh in memory, but if you don't do key updates, you limit yourself to a few GB of transfer, or you create a serious security violation because GCM is weak when you send too much data.

Or am I missing something about key updates?

If it isn't already stated, an endpoint should close a conneciton with protocol violation if it does not refresh keys in due time.

@mikkelfj
Copy link
Contributor

mikkelfj commented May 1, 2018

@kazuho
Copy link
Member Author

kazuho commented May 2, 2018

@mikkelfj Thank you for the link. I have actually been looking for it. And rereading it, I agree that we need to have some form of key update in QUIC (or change PN to a 32-bit value, considering that AES-GCM is the default cipher).

@kazuho
Copy link
Member Author

kazuho commented May 3, 2018

Now that #1079 has been merged, and assuming that we would not have per-CID PN, let me propose another simple solution: never update PN protection key throughout the connection, and define the generation of payload protection key as PN mod 232.

Since the sample used for PNE is 128 bits long, I would assume (or hope) that we can encrypt 262 packet numbers using a single key. In other words, there is no need to update the PN protection key. Then, we can use the value of the decrypted PN to determine the generation of the payload protection key.

@huitema
Copy link
Contributor

huitema commented May 3, 2018

I like that proposal. It makes key rotation straightforward, and it does solve the phase bit leakage issue.

@mikkelfj
Copy link
Contributor

mikkelfj commented May 3, 2018

That could work. There is an on path attack where large packet numbers are skipped to expend resources on frequent rekeying, so perhaps som text to guard against this - notably in relation to optimistic ACK packet skipping. I was considering switching between CID's instead of skipping packets - but that is no good if the ACK doesn't care about the path and PN's have one linear space.

@marten-seemann
Copy link
Contributor

That’s an interesting proposal.
I see a problem that the key rotation is exercised very infrequently. 2^32 packets allow the transfer of roughly 5 TB of data, so only a tiny fraction of QUIC connections will use it at all. And code that doesn’t get used is prone to bugs.

@janaiyengar
Copy link
Contributor

@kazuho: I still believe that per-CID PN is the best way to get to multipath. Even if we don't do it yet, I expect that we will get there eventually. I wouldn't want to make the assumption that we use a single PN space yet.

I like the idea of tying key updates to new CIDs, but there's a gotcha. Both endpoints might decide to do zero-length CIDs, and you then lose any ability to do a key update.

Why not do both? That is, KEY_PHASE bit used as normal, but CID resets the key phase. In other words, CID represents the high-order bits of the key phase, and the KEY_PHASE bit represents the lowest order bit. Does this make sense?

@kazuho
Copy link
Member Author

kazuho commented May 4, 2018

@mikkelfj

There is an on path attack where large packet numbers are skipped to expend resources on frequent rekeying, so perhaps som text to guard against this - notably in relation to optimistic ACK packet skipping.

I am not sure if that is a realistic attack considering the fact that attacker would be required to spend same amount of CPU cost, and that an attacker cannot trigger key update once every 2 RTT (note: a sender cannot send PN that is greater than the largest acknowledged value + 231).

@kazuho
Copy link
Member Author

kazuho commented May 4, 2018

@marten-seemann

I see a problem that the key rotation is exercised very infrequently. 2^32 packets allow the transfer of roughly 5 TB of data, so only a tiny fraction of QUIC connections will use it at all. And code that doesn’t get used is prone to bugs.

I share the concern, and I think that we can grease the feature. One way would be to occasionally start with a PN near 232. The other way would be to for some portion of connections inject huge gaps during the early stages of the connection so that the PN would soon become above 232.

@marten-seemann
Copy link
Contributor

Is there any limit on how many key updates can safely be done? If not, another option would be to lower the threshold to e.g. 2^16 (this would correspond to roughly 80MB of data transferred). This would ensure that a significant fraction of QUIC connections would do a key update.

@kazuho
Copy link
Member Author

kazuho commented May 4, 2018

@janaiyengar

I like the idea of tying key updates to new CIDs, but there's a gotcha. Both endpoints might decide to do zero-length CIDs, and you then lose any ability to do a key update.

Agreed.

Why not do both?

I think that I at least agree with you that my proposal of eliminating the KEY_PHASE bit by using PN mod 232 as the generation number is orthogonal to the discussion of #1317.

That is, KEY_PHASE bit used as normal, but CID resets the key phase. In other words, CID represents the high-order bits of the key phase, and the KEY_PHASE bit represents the lowest order bit. Does this make sense?

That should be possible, but it might be better to derive the first generation of the encryption key using HKDF that takes CID as the input, then update the key for every 232 PN. This is because we cannot skip key generations in the current derivation formula, which is: client_pp_secret<N+1> = QHKDF-Expand(client_pp_secret<N>, "client 1rtt", Hash.length).

@kazuho
Copy link
Member Author

kazuho commented May 4, 2018

@marten-seemann

Is there any limit on how many key updates can safely be done? If not, another option would be to lower the threshold to e.g. 2^16 (this would correspond to roughly 80MB of data transferred). This would ensure that a significant fraction of QUIC connections would do a key update.

I would prefer setting the threshold to something greater than the largest delta that can be represented by the PN field of a QUIC packet. That property ensures you that having 2 generations of encryption keys is enough, regardless of how the client injects gaps. That is why I proposed 232.

@marten-seemann
Copy link
Contributor

I don’t know if many implementations would actually implement a key update that only allows ales place after 5 TB of transferred data. In most cases, it would probably be easier to shut down the connection and reconnect.
I also don’t see how we could possibly integration test and interop test this, without manually modifying the constant every time we run a test (which kind of defeats the purpose of such a test anyway).

@kazuho
Copy link
Member Author

kazuho commented May 4, 2018

@marten-seemann Maybe you missed #1322 (comment)? I think you can enforce the use in some ways.

@martinthomson
Copy link
Member

I don't think that PN mod 232 works without trial decryption.

Let's say that we have an 8 bit encoding and we are approaching the crossover point. When the crossover happens, the recipient uses the old PN key and gets a random 8 bit value, which might be a valid value less than the crossover point. That packet will fail to be decrypted.

I also agree with @marten-seemann that we should favour designs that are based on explicit signals rather than simple counts. 232 is large enough that some implementations might not encounter the rollover point. I have experience with fixed rollovers in NSS and testing them is ugly. On the other hand, sending KeyUpdate is easy. I don't think that a randomized packet number is the right fix for that.

Finally, though I suspect @janaiyengar is right with respect to per-CID spaces, but the decision to roll keys based on a change in connection ID can be made without going that far. All we need to do is decide to keep the sequencing of connection IDs. We also need to recognize that two sets of keys might be maintained for some time during migration; that would be somewhat longer than the text currently suggests because of probing, but it's not that much.

@kazuho
Copy link
Member Author

kazuho commented May 4, 2018

I don't think that PN mod 232 works without trial decryption.
Let's say that we have an 8 bit encoding and we are approaching the crossover point. When the crossover happens, the recipient uses the old PN key and gets a random 8 bit value, which might be a valid value less than the crossover point. That packet will fail to be decrypted.

@martinthomson Maybe you missed the point that I proposed keeping the PN key stable across the entire connection? #1322 (comment)

@martinthomson
Copy link
Member

No, I saw that - and it's a necessary precondition for this. Nonetheless, the problem remains.

@kazuho
Copy link
Member Author

kazuho commented May 4, 2018

Then, would you mind clarifying what you mean by “old PN key”?

@martinthomson
Copy link
Member

Oh, I'm being daft. I still think that the other reasons are enough.

@kazuho
Copy link
Member Author

kazuho commented May 4, 2018

PS. assuming that we can use one PN key across the entire connection, the other solution would be to steal one bit from the varint encoding of the PN (in clear) and use that as the KEY_PHASE bit. The bit will then become unobservable.

@martinthomson
Copy link
Member

Yes, I have considered that option. I still prefer coupling connection ID change to key migration. It has the benefit of consolidating mechanisms for migration and key update, which means that the mechanism is more likely to see use.

@kazuho
Copy link
Member Author

kazuho commented May 4, 2018

I still prefer coupling connection ID change to key migration. It has the benefit of consolidating mechanisms for migration and key update, which means that the mechanism is more likely to see use.

I can see the benefit. OTOH, there is a downside. As @janaiyengar has pointed out, a server with a zero-length CID cannot let the Connection ID change.

@kazuho
Copy link
Member Author

kazuho commented May 4, 2018

FWIW, in #989 (comment) I have proposed encrypting certain bits (including the KEY_PHASE bit) of the first octet in-place. The pros / cons of the approach is discussed on the comment there.

@mikkelfj
Copy link
Contributor

mikkelfj commented May 4, 2018

Skipping packet numbers to induce key rotation seems like (danish proverb) crossing the creek to fetch water since PNE just gained a linear number space.

@mikkelfj
Copy link
Contributor

mikkelfj commented May 4, 2018

@marten-seemann

Is there any limit on how many key updates can safely be done? If not, another option would be to lower the threshold to e.g. 2^16 (this would correspond to roughly 80MB of data transferred). This would ensure that a significant fraction of QUIC connections would do a key update.

Yes, sort of. If there is a 1/N chance of a successfull attack on a key, and you have M keys, the chance is improved to M/N. As I understand this ties more to AES than to GCM and since AES is much harder than GCM, this is not an immediate concern, but too many keys is not a good idea.

See
1.3 Security Degradation for Multiple Keys

http://www.isg.rhul.ac.uk/~kp/TLS-AEbounds.pdf

@kazuho
Copy link
Member Author

kazuho commented May 4, 2018

In #1322 (comment):

I still prefer coupling connection ID change to key migration. It has the benefit of consolidating mechanisms for migration and key update, which means that the mechanism is more likely to see use.

I can see the benefit. OTOH, there is a downside. As @janaiyengar has pointed out, a server with a zero-length CID cannot let the Connection ID change.

@martinthomson I think I missed one important aspect in my comment here.

The issue with only allowing key update on CID change would mean that an endpoint (both the client and the server) cannot update the key if the peer advertises a zero-length CID.

I'd assume that most clients will be advertising zero-length CIDs. That would mean that for most connections a server would not be able to trigger a key update, unless we allow key updates to happen without migrating the connection.

@mikkelfj
Copy link
Contributor

mikkelfj commented May 4, 2018

Unless keyphase is a degenerate CID. If so, CID's are not zero length, they are at least one bit.

@martinthomson
Copy link
Member

I think that all of this is too complicated. Here's a simple solution for this issue:

NEW_CONNECTION_ID can carry an extra hard-to-guess bit that masks the KEY_PHASE bit so that it can't be used to correlate flows.

We also recommend that endpoints that don't use connection IDs can randomly do a key update when they migrate to a new path to break linkability. That creates uncertainty about which half of the population they are correlated with.

@kazuho
Copy link
Member Author

kazuho commented May 7, 2018

NEW_CONNECTION_ID can carry an extra hard-to-guess bit that masks the KEY_PHASE bit so that it can't be used to correlate flows.

I am not sure if that is a good solution. While it is true that we can scramble the initial value, I do not think that it prevents the leak, because an observer can use the fact of the bit belonging to multiple CID's being flipped at the same time as the correlator.

Considering that, I think that we need to encrypt the information in some way. So far, we have seen three proposals:

  • a. tie key phase to CID
  • b. tie key phase to PN (which it now encrypted)
  • c. encrypt key phase using PN key

As discussed, a does not work well when the peer advertises a zero-length CID (which would be the typical behavior for clients). I personally prefer
b due to the fact that it does not waste a bit on the wire, however I can understand people looking for a more flexible way of flipping the bit, which is c.

@martinthomson
Copy link
Member

Oh, I see, you are also trying to solve the problem for multipath at the same time. I agree that what I suggested doesn't work for that.

I think that your b and c are shades of the same design. Both require that the PN protection key is fixed (or maybe dependent on connection ID). I'm OK with c, but the reduced ability to force an update in b is still a concern. Why not just encrypt phase along with the packet number?

@kazuho
Copy link
Member Author

kazuho commented May 7, 2018

Oh, I see, you are also trying to solve the problem for multipath at the same time. I agree that what I suggested doesn't work for that.

Yes. That's correct. And we have a restricted form of multipath in v1 (i.e. path probing).

I think that your b and c are shades of the same design. Both require that the PN protection key is fixed (or maybe dependent on connection ID). I'm OK with c, but the reduced ability to force an update in b is still a concern. Why not just encrypt phase along with the packet number?

I am fine with c. The trade-offs between b and c (i.e. spending one-bit vs. ease of updating the key) seem minor to me.

@britram
Copy link
Contributor

britram commented May 7, 2018

Aside from the architectural correctness of this proposal -- the path really doesn't need to see KEY_PHASE, so we shouldn't expose it -- this discussion does seem to take as given that protecting the KEY_PHASE bit will have actual benefits in preventing linkability, which seems to be to be a very dubious assertion.

Very simple timing analysis in most cases will be so much more useful in achieving CID linkability than any of the edge cases we're optimizing away now that I'm afraid any additional benefit here is lost in the underflow -- I've filed details on this on ops-drafts as quicwg/ops-drafts#31

Again, this seems like a perfectly reasonable thing to do for the sake of cleanliness and correctness, but I wouldn't want us to do this with any hope that it is actually useful against linkability.

@martinthomson
Copy link
Member

Fixed in #2006.

@mnot mnot added the has-consensus An issue that the Chairs have determined has consensus, by canvassing the mailing list. label Mar 5, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
-tls -transport design An issue that affects the design of the protocol; resolution requires consensus. has-consensus An issue that the Chairs have determined has consensus, by canvassing the mailing list.
Projects
None yet
Development

No branches or pull requests

8 participants