Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Which DCID do Handshake retransmissions use? #3348

Closed
marten-seemann opened this issue Jan 15, 2020 · 28 comments · Fixed by #3438
Closed

Which DCID do Handshake retransmissions use? #3348

marten-seemann opened this issue Jan 15, 2020 · 28 comments · Fixed by #3438
Assignees
Labels
-transport design has-consensus

Comments

@marten-seemann
Copy link
Contributor

@marten-seemann marten-seemann commented Jan 15, 2020

We're currently not specifying which DCID is supposed to be used on retransmissions of Handshake packets.

A client might receive a NEW_CONNECTION_ID frame in the first couple of 1-RTT packets it receives, before receiving either a HANDSHAKE_DONE or an acknowledgement for a 1-RTT packet is sent itself. Therefore, it's still running loss recovery on the Handshake packets it sent out earlier.

If there's an acknowledgement outstanding for one of those Handshake packets, the client will have to retransmit that packet. Which DCID is it supposed to use on the retransmission? Is it the DCID that was used on the original packet? Or is it the CID provided in the NEW_CONNECTION_ID frame?
Note that the NEW_CONNECTION_ID frame might already have requested the retiring of the CID used during the Handshake.

@kazuho
Copy link
Member

@kazuho kazuho commented Jan 15, 2020

As I can see, there are three options:

  • a) let the receiver of NCID to send Handshake packets using the new CIDs
  • b) require (or recommend) receiver of NCID to continue using the original DCID until the handshake is confirmed
  • c) require (or recommend) the sender to withhold the emission NCID (with a non-zero RTP) to until it sees an ACK for handshake confirmation

I prefer (c) (or maybe (b)), because (a) has corner cases even though it might seem easy.

Consider the case where a server sends TP.preferred_address then as the first packet, sends the NCID frame with RTP set to 2. If we adopt (a), this would mean that the client would have only one CID that it can use (the one being embedded in the NCID frame) even though it needs two (one for the original path, and another for the path specified by TP.preferred_address).

As you can see, the design of the NCID has been based on the assumption that providing one new CID alongside a retirement request guarantees that the receiver would have enough active CIDs. But in this particular case, that is not the case.

Based on this observation, and based on my assumption that a non-zero RTP would be sent by only some of the endpoints (the necessity of CID retirement is questionable for HTTP/3, the scope of QUIC v1, as the HTTP connections are typcially short-lived and can most likely be gracefully terminated to induce a new connection when retirement of CIDs is necessary), my preference goes to (c): an endpoint that uses RTP should cover the cost.

@janaiyengar
Copy link
Contributor

@janaiyengar janaiyengar commented Jan 15, 2020

@kazuho: (I think you mean RPT, not RTP) Thank you for the analysis. I prefer (c) as well, except that I do not want to condition it on non-zero RPT, to keep things simpler. Otherwise we're still left with what to do at the receiver. Ensuring that the handshake confirmation is acked means that there are no HS packets pending.

@janaiyengar
Copy link
Contributor

@janaiyengar janaiyengar commented Jan 15, 2020

I'll also note that this issue was found through a resilience test we've built for the interop runner -- high loss handshake test. We're excited to see how everyone's handshake machinery performs under 50% loss :-)

@tatsuhiro-t
Copy link
Contributor

@tatsuhiro-t tatsuhiro-t commented Jan 15, 2020

At the time of this writing, ngtcp2 does (b).

@mnot mnot added this to Triage in Late Stage Processing Jan 15, 2020
@nibanks
Copy link
Member

@nibanks nibanks commented Jan 15, 2020

@kazuho for you problem with (a) isn't that just a general problem with having multiple paths and then getting a NEW_CONNECTION_ID with RPT for all previous CIDs? You end up in a scenario where you only have 1 CID (the new one) and multiple paths that just had all their CIDs retired. If you can get away with using that one CID for the active path, you can still make forward progress. If not, the connection dies.

@ianswett
Copy link
Contributor

@ianswett ianswett commented Jan 15, 2020

I believe we specify (b) already:
"A client MUST only change the value it sends in the Destination Connection ID in response to the first packet of each type it receives from the server (Retry or Initial); a server MUST set its value based on the Initial packet. Any additional changes are not permitted; if subsequent packets of those types include a different Source Connection ID, they MUST be discarded. This avoids problems that might arise from stateless processing of multiple Initial packets producing different connection IDs."
https://github.com/quicwg/base-drafts/blob/master/draft-ietf-quic-transport.md#negotiating-connection-ids-negotiating-connection-ids

@nibanks
Copy link
Member

@nibanks nibanks commented Jan 15, 2020

@ianswett that is just indicating when a client should change DCID in response to receiving a Initial/Retry packet with a new CID. I don't think anything restricts either side from sending NEW_CONNECTION_ID frames in their first 1-RTT packets and their peers immediately using them for any packets (including handshake) they send.

Personally, I don't have a problem with (a). I think that no matter what receivers will have to handle the multiple paths but single CID scenario, so I would prefer not to special case the handshake. I also don't think we have handshake confirmation problems because an endpoint should only send a NEW_CONNECTION_ID out when it's able to accept it. Then, if the peer can decrypt the packet with the new CID, I see no reason why it shouldn't be allowed to use it immediately.

@ianswett
Copy link
Contributor

@ianswett ianswett commented Jan 15, 2020

@nibanks my reading of the text I quoted is that you cannot use any DCIDs in long header packets that weren't either the original client DCID/SCID or from a server Initial or Retry.

@marten-seemann
Copy link
Contributor Author

@marten-seemann marten-seemann commented Jan 15, 2020

  • a) let the receiver of NCID to send Handshake packets using the new CIDs
  • b) require (or recommend) receiver of NCID to continue using the original DCID until the handshake is confirmed
  • c) require (or recommend) the sender to withhold the emission NCID (with a non-zero RTP) to until it sees an ACK for handshake confirmation

@kazuho There's a fourth option:

  • d) require that the sender withholds the NCID frame until it has confirmed the handshake.

As soon as an endpoint confirms the handshake, it drops the Handshake keys (I'm assuming #3145 here). As a result, the endpoint obviously doesn't care about Handshake packets any longer (and especially doesn't care which DCID was used to send the packet).
Therefore, there's no problem if the receiver of the NCID frame immediately retires the connection ID used during the handshake, even if the the NCID and the HANDSHAKE_DONE frame were reordered.

@janaiyengar
Copy link
Contributor

@janaiyengar janaiyengar commented Jan 15, 2020

@ianswett: The part about what to do with DCID changes is underspecified. From the part of the spec you quote -- "Any additional changes are not permitted; if subsequent packets of those types include a different Source Connection ID, they MUST be discarded." -- it's not clear what these additional changes are limited to. I think the intent was for all long-form packets, but that's certainly not clear here.

In any case, the problem with that approach is that the client will have to carefully construct HS packets with the old DCID while sending everything else with the new DCID.

@janaiyengar
Copy link
Contributor

@janaiyengar janaiyengar commented Jan 15, 2020

@marten-seemann : (d) works too, but an endpoint still has to decide which DCID to use for sending HS packets in the case that the HANDSHAKE_DONE frame is lost. I'd rather not leave that unspecified.

@MikeBishop
Copy link
Contributor

@MikeBishop MikeBishop commented Jan 15, 2020

@ianswett: The part about what to do with DCID changes is underspecified. From the part of the spec you quote -- "Any additional changes are not permitted; if subsequent packets of those types include a different Source Connection ID, they MUST be discarded." -- it's not clear what these additional changes are limited to. I think the intent was for all long-form packets, but that's certainly not clear here.

"of those types" => Retry or Initial, since that's what it was talking about. That's trying to avoid the case where the server statelessly generates a new Retry (or Initial, in the rare case that would be stateless) with a second SCID upon receiving a retransmission or duplication of the client's Initial. But it doesn't say anything about the server issuing new CIDs via an NCID frame in 1-RTT before the handshake is confirmed.

I think (a) or (b) would be fine; the server is clearly intending to accept the other CID if it issued it, but could conceivably use separate logic for long-header versus short-header routing.

@martinthomson
Copy link
Member

@martinthomson martinthomson commented Jan 16, 2020

Can someone explain why using the new connection ID is a problem?

@kazuho
Copy link
Member

@kazuho kazuho commented Jan 16, 2020

@martinthomson I think that that is a good way at looking into the issue. I agree that the endpoints can start using the new CIDs during handshake, if the peer sends NCID frames.

The only problem that I can think of is related to retirement (see #3348 (comment)). When the server provides TP.preferred_address, and also decides to retire the CIDs immediately after sending all the handshake transcript, there's going to be a very small time window in which it has to send two CIDs (one for the current path and another for the path using the preferred address). Unless the QUIC stack allows bundling of two NCID frames when building a QUIC packet, the client might end up not having sufficient amount of CIDs, in which case it might terminate the connection.

It is my view that this is very different from the ordinary case of retiring CIDs, in which case there would be enough time window to guarantee that a handful number of new generation CIDs are delivered before the older ones are delivered (cc @nibanks).

That said, I now tend to think that the most we might have to do is the following two:

  • clarify that CIDs received using NCID frames can be used for Handshake packets
  • have a cautionary text stating that a server needs to make sure that the client has enough number of new generation CIDs when it retires an older generation of CIDs

@marten-seemann
Copy link
Contributor Author

@marten-seemann marten-seemann commented Jan 16, 2020

  • have a cautionary text stating that a server needs to make sure that the client has enough number of new generation CIDs when it retires an older generation of CIDs

@kazuho Wouldn't we be able to solve this by using my proposed option d?

@martinthomson
Copy link
Member

@martinthomson martinthomson commented Jan 16, 2020

Artificial constraints on when the frame can be sent would be terrible to enforce and would introduce a performance cost for using NEW_CONNECTION_ID. In a great many cases, the 0.5RTT from the server is idle, which makes it a great way to get things like NEW_CONNECTION_ID sent without taking capacity from real work. Asking the server to defer sending would mean that NEW_CONNECTION_ID could compete with HTTP responses and the like.

I see no reason not to allow use of connection IDs when they are available. They apply to the connection as a whole. You will note that we allow changing them with the Initial/Retry mini-protocol in ways that don't correlate with other connection-level events, so this isn't any different in that regard. Forcing the connection ID state to synchronize with other state changes seems more likely to complicate things than help.

The caution @kazuho mentions is always true. Whenever an endpoint is forced to retire connection IDs, a short supply is always a risk to the connection.

@tatsuhiro-t
Copy link
Contributor

@tatsuhiro-t tatsuhiro-t commented Jan 16, 2020

Unless the QUIC stack allows bundling of two NCID frames when building a QUIC packet, the client might end up not having sufficient amount of CIDs, in which case it might terminate the connection.

I think QUIC allows more than 1 NCID frames in a single QUIC packet.

Now my preference is (a). If server utilizes retire-prior-to, it should know the consequences and should provide sufficient backup connection IDs. Otherwise, it would be misconfiguration of server software.

@kazuho
Copy link
Member

@kazuho kazuho commented Jan 16, 2020

@tatsuhiro-t

I think QUIC allows more than 1 NCID frames in a single QUIC packet.

The problem is that the client might end up receiving only some of the NCID frames, unless the server always sends (and retransmits) the set of NCID frames as a whole.

But anyways, I agree that (a) is probably fine.

That is because it is trivial to design a server that never requests CID retirement during the handshake. All you need to do is start issuing the new generation of CIDs at least N seconds before sending NCID frames that asks for the retirement of the previous generation, where N is your handshake timeout.

Knowing that this design is why I initially preferred (c), not realizing that it works equally well with (a). But know that I understand that, I prefer (a) as it is simpler.

@ianswett
Copy link
Contributor

@ianswett ianswett commented Jan 16, 2020

Given a server has to sent NCID to have the CID change, I'm fine with (a), but I thought it was not what we'd previously agreed to and specified, so I believed (b) was the status quo.

@DavidSchinazi
Copy link
Contributor

@DavidSchinazi DavidSchinazi commented Jan 16, 2020

I don't think this is a serious problem, because if the server doesn't want the long header DCID to change, it can wait before sending NEW_CONNECTION_ID. That said, we do have an issue here in that different readers of the draft had different interpretations of what was allowed. All that to say I don't much care if this is allowed or prohibited, but we should at least make an editorial change to clarify things.

@mnot mnot added the design label Jan 17, 2020
@mnot mnot moved this from Triage to Design Issues in Late Stage Processing Jan 17, 2020
@igorlord
Copy link
Contributor

@igorlord igorlord commented Feb 2, 2020

I agree with @DavidSchinazi here. Any alternative is ok, but the draft should say whether servers can expect that the client might switch DCID and whether the clients can expect the server to send NCID w/ RPT>0 before the handshake is complete.

@huitema
Copy link
Contributor

@huitema huitema commented Feb 5, 2020

Let's keep it simple: no migration before Handshake Done, period.
That would include prohibit NCID in 0-RTT
Also state that implementations MAY drop any packet received with NCID different from original before Handshake Done.

@kazuho
Copy link
Member

@kazuho kazuho commented Feb 5, 2020

@huitema The problem is not that simple.

Consider the case where the server receives ClientFinished and sends two 1-RTT packets, the first packet containing HANDSHAKE_DONE, and the second packet containing an NCID with RPT set to 1.

If the first 1-RTT packet gets lost, the client receives a request to retire the connection ID before it confirms the handshake.

@kazuho
Copy link
Member

@kazuho kazuho commented Feb 5, 2020

I think I'd prefer a design that meets the two requirements stated below:

  • A server should not be required to withhold NCID frames until it receives an ACK for HANDSHAKE_DONE.
  • A client should be allowed to use the original CID until the handshake is confirmed.

The intention is to minimize additional complexity to existing (and future) QUIC stacks.

One solution that meets these requirements is to state that a NCID frame confirms the handshake, exactly the same way as HANDSHAKE_DONE frame does. That would be a trivial change to any of the QUIC stacks, and there would be no ambiguity regarding the state changes.

@larseggert
Copy link
Member

@larseggert larseggert commented Feb 5, 2020

Discussed in ZRH. Proposed resolution is to clarify that issuing a CID means being prepared to seeing it used at any time. Additional Editorial issue on packetization issues.

@ianswett
Copy link
Contributor

@ianswett ianswett commented Feb 5, 2020

Editorial PR on packetization challenges with retransmitted frames is #3424.

erickinnear pushed a commit to ekinnear/quic-base-drafts that referenced this issue Feb 6, 2020
Late Stage Processing automation moved this from Design Issues to Text Incorporated Feb 11, 2020
martinthomson added a commit that referenced this issue Feb 11, 2020
#3348 Clarify that you can use any issued CID at any time, even during the handshake
@martinthomson martinthomson reopened this Feb 11, 2020
Late Stage Processing automation moved this from Text Incorporated to Triage Feb 11, 2020
@martinthomson martinthomson added the proposal-ready label Feb 11, 2020
@project-bot project-bot bot moved this from Triage to Consensus Emerging in Late Stage Processing Feb 11, 2020
@martinthomson
Copy link
Member

@martinthomson martinthomson commented Feb 11, 2020

Ugh, editorial changes for a design issue. Reopened for consensus calling.

@LPardue LPardue moved this from Consensus Emerging to Consensus Call issued in Late Stage Processing Feb 19, 2020
@LPardue LPardue removed the proposal-ready label Feb 19, 2020
@LPardue LPardue added the call-issued label Feb 26, 2020
@LPardue LPardue added has-consensus and removed call-issued labels Mar 4, 2020
@project-bot project-bot bot moved this from Consensus Call issued to Consensus Declared in Late Stage Processing Mar 4, 2020
@janaiyengar
Copy link
Contributor

@janaiyengar janaiyengar commented Mar 9, 2020

This was fixed by #3438. #3349 is related to this issue, but that simply clarifies things. To keep things sane, let's just close this when #3349 is merged.

Late Stage Processing automation moved this from Consensus Declared to Text Incorporated Mar 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
-transport design has-consensus
Projects
Late Stage Processing
  
Issue Handled