-
Notifications
You must be signed in to change notification settings - Fork 205
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prune transport error codes #467
Conversation
baa43bf
to
640d6d2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need a bit more time to go through the errors here. I'll get back by tomorrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should replace the finer-grained errors that are replaced with PROTOCOL_VIOLATION_ERROR or FRAME_FORMAT_ERROR. Having details is very useful to a peer that can monitor these errors and debug quickly. Otherwise, knowing there was a frame that wasn't formatted correctly is almost useless information.
I'll come up with a list of things that I think we should retain, and I have an idea for how to simplify the description. I'll give it a shot this evening.
@@ -2522,13 +2520,12 @@ An endpoint MUST NOT send data on a stream at or beyond the final offset. | |||
|
|||
Once a final offset for a stream is known, it cannot change. If a RST_STREAM or | |||
STREAM frame causes the final offset to change for a stream, an endpoint SHOULD |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this isn't part of your PR, but SHOULD --> MUST
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's fix/discuss that separately.
draft-ietf-quic-transport.md
Outdated
@@ -2565,6 +2562,10 @@ risks a peer missing the first such packet. The only mechanism available to an | |||
endpoint that continues to receive data for a terminated connection is to send a | |||
Public Reset packet. | |||
|
|||
An endpoint that receives an invalid error code in a CONNECTION_CLOSE frame MUST | |||
NOT signal the existence of the error to its peer. It MAY treat the error as an | |||
INTERNAL_ERROR on the basis that there is some fault with the peer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think I understand what this para is trying to say... do you mean "existence of the error" or "absence of the error"? What's this trying to achieve?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this needs work. It's trying to say that you can't respond to a CONNECTION_CLOSE with a PROTOCOL_VIOLATION. Give me another shot at this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, ok, that makes sense.
draft-ietf-quic-transport.md
Outdated
@@ -2575,7 +2576,7 @@ affected stream. | |||
|
|||
Stream 1 is critical to the functioning of the entire connection. If stream 1 | |||
is closed with either a RST_STREAM or STREAM frame bearing the FIN flag, an | |||
endpoint MUST generate a connection error of type QUIC_CLOSED_CRITICAL_STREAM. | |||
endpoint MUST generate a connection error of type PROTOCOL_VIOLATION. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why remove this error? It's strictly more information, and is surely useful for debugging on the side that's likely messing up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment below.
draft-ietf-quic-transport.md
Outdated
|
||
QUIC_NETWORK_IDLE_TIMEOUT (0x80000019): | ||
: The connection timed out due to no network activity. | ||
ENHANCE_YOUR_CALM (0x80000002): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where's this intended to be used? It's not clear to me what exactly the endpoint wants the peer to reduce generation of -- Data? Acks? Window updates? -- and what the peer is expected to do when this error code is received. Is this a stream or connection error? It would be an odd time to say reduce generation of something if used as a connection error... the connection is now closed. Does this error code mean that the peer SHOULD NOT reconnect?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, this does not seem helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the analogous code in h2. There isn't a whole lot of text in QUIC for this, but you can find the sorts of applications we use this for in h2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's already in H2, should it be in the http mapping? I can't find a good link for H2's usage, the best I found was it's informal use by twitter: https://en.wikipedia.org/wiki/List_of_HTTP_status_codes
And why is it not error code 420?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that it is part of the transport in h2, not the application. If you disagree with that assessment, let me know.
It was 420, for a while, but then we decided that a contiguous error space was more valuable. However, it's not a status code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think I said anything to warrant a rude response. Perhaps you didn't mean it to be rude, but it sure sounds it.
I know of the error code from HTTP/2, and I looked it up before my first comment on this PR. RFC 7540 is not helpful in answering the questions I'm asking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I apologize, I just think that me trying to restate its purpose wouldn't really help because it's got a lot of context attached to it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
link?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still confusing, particularly on a transport level.
And if we're going to add it, we're going to need to restate it's purpose, we can't just assume readers know what it means.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From RFC 7540:
ENHANCE_YOUR_CALM (0xb): The endpoint detected that its peer is
exhibiting a behavior that might be generating excessive load.
draft-ietf-quic-transport.md
Outdated
: QUIC timed out after too many RTOs. | ||
: An endpoint received a frame that was badly formatted. For instance, an empty | ||
STREAM frame that omitted the FIN flag, or an ACK frame that has more | ||
acknowledgment ranges than the remainder of the packet could carry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are very different errors in terms of what a peer would do with this information. I don't think collapsing these two examples is a good idea, since the peer can actually do something concrete with this information -- separating STREAM_FRAME_FORMAT_ERROR from ACK_FRAME_FORMAT_ERROR is a useful for monitoring and debugging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
Gah, this is already rotten :( I think that you might have misunderstood the approach that I've taken. The point is to identify classes of errors that can have operational practices associated with them. Finer grained errors are certainly useful during the development of a protocol, but I don't know how someone would deploy an operational practice to distinguish between a cases where it sent a badly formatted STREAM frame vs. a badly formatted ACK frame. Those errors are going to need someone to sit down with traces and work out what is going on. I'm not suggesting that an implementation couldn't generate different error codes for local consumption. That's great for debugging. But we're talking protocol signals here. And when we use a different code point in a protocol, we have to have something in those signals that can be acted upon differently. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of the removals seem good, but I think you went too far in removing some error codes.
(connection termination is described in {{termination}}.) | ||
reaches 2^64 - 1, the sender MUST close the connection without sending a | ||
CONNECTION_CLOSE frame or any further packets; the sender MAY send a Public | ||
Reset packet in response to further packets that it receives. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't it use the last packet number to send a connection close?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, that is a possibility, but think about what it will take to reach this number. We could put lots of effort into dealing with this case, but you have to ask yourself if that's worth it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, fair enough. I'm happy handling that edge case in whatever way is easiest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I agree. I also realized that this max packet number is known to the receiver, and the receiver knows that the sender is near the max, so clearly it knows that the sender would close the connection.
draft-ietf-quic-transport.md
Outdated
|
||
QUIC_NETWORK_IDLE_TIMEOUT (0x80000019): | ||
: The connection timed out due to no network activity. | ||
ENHANCE_YOUR_CALM (0x80000002): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, this does not seem helpful.
draft-ietf-quic-transport.md
Outdated
: QUIC timed out after too many RTOs. | ||
: An endpoint received a frame that was badly formatted. For instance, an empty | ||
STREAM frame that omitted the FIN flag, or an ACK frame that has more | ||
acknowledgment ranges than the remainder of the packet could carry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
Generating different error codes for local consumption is not a direction I'd like to go in if I can avoid it. We have a huge error code space available, so as long as it's clear when one should use an error code, I'm not sure why we wouldn't want a relatively complete set? |
It is very helpful operationally to have these codes available, since a server, for instance, can monitor the different error codes that are received. A spike in FRAME_ERROR for a STREAM frame is much easier to diagnose than a general FRAME_ERROR. As to why this may be useful later and not just during development -- any time you touch code, you can inadvertently introduce bugs. A server that does not want to monitor does not have to, but having the error codes makes it possible for interested parties to do so. |
c0a6acf
to
d3dec57
Compare
@@ -2858,173 +2858,74 @@ CONNECTION_CLOSE or RST_STREAM frame. Error codes share a common code space. | |||
Some error codes apply only to either streams or the entire connection and have | |||
no defined semantics in the other context. | |||
|
|||
QUIC_INTERNAL_ERROR (0x80000001): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you'll want this or some equivalent.
QUIC_INVALID_STREAM_DATA (0x8000002E): | ||
: STREAM frame data is malformed. | ||
|
||
QUIC_UNENCRYPTED_STREAM_DATA (0x8000003D): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We definitely want this.
QUIC_INVALID_CONNECTION_CLOSE_DATA (0x80000007): | ||
: CONNECTION_CLOSE frame data is malformed. | ||
|
||
QUIC_INVALID_GOAWAY_DATA (0x80000008): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do want an error code per frame type. It really helps narrow down issues.
draft-ietf-quic-transport.md
Outdated
|
||
QUIC_NETWORK_IDLE_TIMEOUT (0x80000019): | ||
: The connection timed out due to no network activity. | ||
ENHANCE_YOUR_CALM (0x80000002): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still confusing, particularly on a transport level.
And if we're going to add it, we're going to need to restate it's purpose, we can't just assume readers know what it means.
Discussed in Paris; we're going to add the frame-specific codes in addition to the generic one. We'll remove ENHANCE_YOUR_CALM from this one. |
@@ -1213,7 +1213,7 @@ server MUST check that it would have sent a version negotiation packet if it had | |||
received a packet with the indicated initial_version. If a server would have | |||
accepted the version included in the initial_version and the value differs from | |||
the value of negotiated_version, the server MUST terminate the connection with a | |||
QUIC_VERSION_NEGOTIATION_MISMATCH error. | |||
VERSION_NEGOTIATION_ERROR error. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am having some trouble understanding how to send this error #597 - but it might be resolved with updated PUBLIC_RESET to replace CONNECTION_CLOSE.
This is just the error codes, not references to those codes.
One interesting change here is that we don't have an error code for running out of packet numbers. If you think about this for even a few seconds, you will realize that you can't send that error code because it can't go in a packet. And that's even assuming that you manage to keep a connection alive long enough to send that many packets. At 1 million packets a second, that's almost 600 thousand years. Even if we assume that you jump 2^31 packets every time you switch between networks, and change once a minute, that's 16 thousand years. I didn't remove the parenthetical mention of QUIC_NO_ERROR because that is being removed in other pull requests.
d3dec57
to
67ccacb
Compare
@janaiyengar, I've rebased and updated this with what we agreed at the interim. There is now a generic frame format error, and a set of frame-type-specific errors. I removed ENHANCE_YOUR_CALM. I went over the error codes that you added back in #521 and I'm going to suggest that we separate those out. Some depend on the conclusion of #608, and I think that the others aren't useful to signal. I'm sure that I'm wrong about at least one, but we should discuss them separately. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for rewriting this change. LGTM.
This prunes the transport error codes to the short list I shared on the mailing list.
@janaiyengar this rots quickly, I want to close on before we next chat.
Closes #96, #177, #184, #211.