Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed handshake key deadlock issue #3093

Closed
wants to merge 1 commit into from
Closed

Fixed handshake key deadlock issue #3093

wants to merge 1 commit into from

Conversation

martinduke
Copy link
Contributor

Along with some other PRs, I believe this fixes #2863.

Fundamentally, there are four ways to solve the problem of one endpoint throwing away the handshake keys while the other has no 1-RTT data.

  1. Keep the handshake keys forever
  2. Make the endpoint start a 1-RTT PTO timer whether or not it has data to send
  3. Require each endpoint to send a retransmittable frame soon after the handshake is complete; or
  4. Make the endpoint wait until it receives an ack-eliciting 1-RTT packet, in addition to a 1-RTT ack.

I believe that the discard keys group might have a plurality for #4. The added condition makes sure that the peer's PTO machinery has started and will have little effect on most applications, including HTTP3.

This PR is a concrete statement of the fourth option.

@martinthomson
Copy link
Member

I don't think that this fixes the problem. What we had discussed was requiring that some ack-eliciting 1-RTT data is sent after handshake completion. With that, the existing definition of "confirmed" is adequate.

@martinthomson martinthomson added -tls -transport design An issue that affects the design of the protocol; resolution requires consensus. labels Oct 16, 2019
@marten-seemann
Copy link
Contributor

2863

@kazuho
Copy link
Member

kazuho commented Oct 16, 2019

Note: at the moment, recovery states "When a PTO timer expires, a sender MUST send at least one ack-eliciting packet as a probe, unless there is no data available to send."

We need remove the "unless".

@martinduke
Copy link
Contributor Author

@martinthomson I believe discussion in Cupertino resolved your concern.

I believe @marten-seemann's chart has been addressed by discussion in Cupertino, pending further analysis that shows a different problem.

@martinthomson
Copy link
Member

martinthomson commented Oct 16, 2019

Belay that last comment from me. The friendly amendment here that we need is to the text on discarding Handshake keys that says "no sooner than handshake confirmed" rather than requiring Handshake keys to be discarded at that point.

@martinduke
Copy link
Contributor Author

martinduke commented Oct 16, 2019

@marten-seemann has a fairly silly corner case that breaks this:

  1. the ack of the CFIN is lost; the server sends no 1-RTT packet.
  2. the client acks and sends a PING
  3. the server acks it.
  4. the client sends ACK + PADDING (which is technically legal). This is sufficient for the server to discard state.
  5. The client has not started the 1RTT PTO timer, and we are going to fail the connection.

This is fixable by disallowing ACK+PADDING in response to a non-ack-eliciting packet. At the moment 13.2.1 of quic-transport does not explicitly forbid this. IMO this is an oversight and we should change 13.2.1.

After a brief offline chat with some people, I'll include that change to 13.2.1 to the PR if people agree with this conclusion.

@marten-seemann
Copy link
Contributor

In my opinion, what this corner case shows is that option 4 is pretty brittle. Disallowing ACK +PADDING seems to work to fix this particular corner case, but it doesn't guarantee that there isn't any other corner case that will lead to a deadlock.

@ianswett
Copy link
Contributor

ACK+PADDING packets aren't ack-eliciting, so the server shouldn't discard keys at step 4.

@marten-seemann
Copy link
Contributor

@ianswett The server already received a PING in 1-RTT packets in step 2.

@martinduke
Copy link
Contributor Author

@ianswett it has received an ack-eliciting packet in step 2 and a 1-RTT ack in 4. This is also fixable by changing the condition to "ack of an ack-eliciting packet"

@marten-seemann
Copy link
Contributor

@martinduke The ACK sent in step 4 is an acknowledgement for an ack-eliciting packet.

@martinduke
Copy link
Contributor Author

@marten-seemann No it isn't, it's an ack of an ack.

@martinthomson
Copy link
Member

martinthomson commented Oct 22, 2019

As we concluded at the meeting, we're going to keep Handshake keys indefinitely.

FWIW, I don't believe that it is possible to find a confirmation step that doesn't involve both peers actively driving the state machine. Otherwise, peers might not reach the same state within a reasonable time. This change still has problems (more below) and only adds delay to the confirmed state by making the requirements more complex.

Here's an example of how this fix breaks. I'm sure that there are more:

image

web sequence diagram code
title Bad scenario

Client->Server: Initial
Client->Server: 0-RTT (ack-eliciting)
Server->Client: Initial + Handshake
Server->Client: 1-RTT (ACK for 0-RTT)

Client->Server: Handshake
Server-->Client: ACK (lost)

Client->Server: 1-RTT (ack-eliciting + ACK for 0.5-RTT)
Server->Client: 1-RTT (ACK for 1-RTT)
note right of Server
   The server now believes that 
   the handshake is confirmed
end note

note left of Client
   The client isn't confirmed
end note

note over Client,Server
   Considerable time can now elapse
end note

Client->Server: Handshake (retransmission)
note over Server: drop garbage
Client->Server: Handshake (retransmission)
note over Server: drop garbage ...

@ianswett
Copy link
Contributor

Thanks for the diagram, and I'm happy to stick with the approach of keeping the keys forever.

If we change nothing about the existing text, what you describe occurs, but as I suggested elsewhere, if we are going to delay dropping the Handshake keys(somewhat or indefinitely), I think we should update the text of transport or recovery to be clear that you can stop retransmitting any outstanding data in Handshake once you've received a 1-RTT ACK.

@martinduke
Copy link
Contributor Author

martinduke commented Oct 22, 2019 via email

@martinthomson
Copy link
Member

@martinduke, the diagram is correct. There is no ack-eliciting packet in 0.5-RTT. The ACK of the 0.5-RTT data is bundled with the other ack-eliciting frames in the packet, that's all.

@martinduke
Copy link
Contributor Author

@martinthomson Ah, I see that now. Yeah, I think we need an explicit signal, which I would prefer over keeping the keys forever.

@martinduke martinduke deleted the hold-hs-key-longer branch October 22, 2019 03:36
@mikkelfj
Copy link
Contributor

mikkelfj commented Oct 22, 2019

I don't see the problem with an explicit signal. The handshake is uniquely complex already and it helps isolate that complexity. You could even say that if you cannot drive a handshake to completion within a timeout period, the connection is broken. An explicit signal helps deal with cases where one endpoint otherwise has nothing to send. Now it has.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
-tls -transport design An issue that affects the design of the protocol; resolution requires consensus.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

unrecoverable loss pattern leads to deadlock
6 participants