Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prune transport error codes #467

Merged
merged 4 commits into from
Jun 27, 2017
Merged

Prune transport error codes #467

merged 4 commits into from
Jun 27, 2017

Conversation

martinthomson
Copy link
Member

@martinthomson martinthomson commented Apr 27, 2017

This prunes the transport error codes to the short list I shared on the mailing list.

@janaiyengar this rots quickly, I want to close on before we next chat.

Closes #96, #177, #184, #211.

@martinthomson martinthomson added design An issue that affects the design of the protocol; resolution requires consensus. -transport labels Apr 27, 2017
Copy link
Contributor

@janaiyengar janaiyengar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need a bit more time to go through the errors here. I'll get back by tomorrow.

Copy link
Contributor

@janaiyengar janaiyengar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should replace the finer-grained errors that are replaced with PROTOCOL_VIOLATION_ERROR or FRAME_FORMAT_ERROR. Having details is very useful to a peer that can monitor these errors and debug quickly. Otherwise, knowing there was a frame that wasn't formatted correctly is almost useless information.

I'll come up with a list of things that I think we should retain, and I have an idea for how to simplify the description. I'll give it a shot this evening.

@@ -2522,13 +2520,12 @@ An endpoint MUST NOT send data on a stream at or beyond the final offset.

Once a final offset for a stream is known, it cannot change. If a RST_STREAM or
STREAM frame causes the final offset to change for a stream, an endpoint SHOULD
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this isn't part of your PR, but SHOULD --> MUST

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's fix/discuss that separately.

@@ -2565,6 +2562,10 @@ risks a peer missing the first such packet. The only mechanism available to an
endpoint that continues to receive data for a terminated connection is to send a
Public Reset packet.

An endpoint that receives an invalid error code in a CONNECTION_CLOSE frame MUST
NOT signal the existence of the error to its peer. It MAY treat the error as an
INTERNAL_ERROR on the basis that there is some fault with the peer.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I understand what this para is trying to say... do you mean "existence of the error" or "absence of the error"? What's this trying to achieve?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this needs work. It's trying to say that you can't respond to a CONNECTION_CLOSE with a PROTOCOL_VIOLATION. Give me another shot at this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, ok, that makes sense.

@@ -2575,7 +2576,7 @@ affected stream.

Stream 1 is critical to the functioning of the entire connection. If stream 1
is closed with either a RST_STREAM or STREAM frame bearing the FIN flag, an
endpoint MUST generate a connection error of type QUIC_CLOSED_CRITICAL_STREAM.
endpoint MUST generate a connection error of type PROTOCOL_VIOLATION.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why remove this error? It's strictly more information, and is surely useful for debugging on the side that's likely messing up.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment below.


QUIC_NETWORK_IDLE_TIMEOUT (0x80000019):
: The connection timed out due to no network activity.
ENHANCE_YOUR_CALM (0x80000002):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where's this intended to be used? It's not clear to me what exactly the endpoint wants the peer to reduce generation of -- Data? Acks? Window updates? -- and what the peer is expected to do when this error code is received. Is this a stream or connection error? It would be an odd time to say reduce generation of something if used as a connection error... the connection is now closed. Does this error code mean that the peer SHOULD NOT reconnect?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, this does not seem helpful.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the analogous code in h2. There isn't a whole lot of text in QUIC for this, but you can find the sorts of applications we use this for in h2.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's already in H2, should it be in the http mapping? I can't find a good link for H2's usage, the best I found was it's informal use by twitter: https://en.wikipedia.org/wiki/List_of_HTTP_status_codes

And why is it not error code 420?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that it is part of the transport in h2, not the application. If you disagree with that assessment, let me know.

It was 420, for a while, but then we decided that a contiguous error space was more valuable. However, it's not a status code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I said anything to warrant a rude response. Perhaps you didn't mean it to be rude, but it sure sounds it.

I know of the error code from HTTP/2, and I looked it up before my first comment on this PR. RFC 7540 is not helpful in answering the questions I'm asking.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I apologize, I just think that me trying to restate its purpose wouldn't really help because it's got a lot of context attached to it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

link?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still confusing, particularly on a transport level.

And if we're going to add it, we're going to need to restate it's purpose, we can't just assume readers know what it means.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From RFC 7540:

ENHANCE_YOUR_CALM (0xb):  The endpoint detected that its peer is
exhibiting a behavior that might be generating excessive load.

: QUIC timed out after too many RTOs.
: An endpoint received a frame that was badly formatted. For instance, an empty
STREAM frame that omitted the FIN flag, or an ACK frame that has more
acknowledgment ranges than the remainder of the packet could carry.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are very different errors in terms of what a peer would do with this information. I don't think collapsing these two examples is a good idea, since the peer can actually do something concrete with this information -- separating STREAM_FRAME_FORMAT_ERROR from ACK_FRAME_FORMAT_ERROR is a useful for monitoring and debugging.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@martinthomson
Copy link
Member Author

Gah, this is already rotten :(

I think that you might have misunderstood the approach that I've taken. The point is to identify classes of errors that can have operational practices associated with them. Finer grained errors are certainly useful during the development of a protocol, but I don't know how someone would deploy an operational practice to distinguish between a cases where it sent a badly formatted STREAM frame vs. a badly formatted ACK frame. Those errors are going to need someone to sit down with traces and work out what is going on.

I'm not suggesting that an implementation couldn't generate different error codes for local consumption. That's great for debugging. But we're talking protocol signals here. And when we use a different code point in a protocol, we have to have something in those signals that can be acted upon differently.

Copy link
Contributor

@ianswett ianswett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the removals seem good, but I think you went too far in removing some error codes.

(connection termination is described in {{termination}}.)
reaches 2^64 - 1, the sender MUST close the connection without sending a
CONNECTION_CLOSE frame or any further packets; the sender MAY send a Public
Reset packet in response to further packets that it receives.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't it use the last packet number to send a connection close?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, that is a possibility, but think about what it will take to reach this number. We could put lots of effort into dealing with this case, but you have to ask yourself if that's worth it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, fair enough. I'm happy handling that edge case in whatever way is easiest.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree. I also realized that this max packet number is known to the receiver, and the receiver knows that the sender is near the max, so clearly it knows that the sender would close the connection.


QUIC_NETWORK_IDLE_TIMEOUT (0x80000019):
: The connection timed out due to no network activity.
ENHANCE_YOUR_CALM (0x80000002):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, this does not seem helpful.

: QUIC timed out after too many RTOs.
: An endpoint received a frame that was badly formatted. For instance, an empty
STREAM frame that omitted the FIN flag, or an ACK frame that has more
acknowledgment ranges than the remainder of the packet could carry.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@ianswett
Copy link
Contributor

Generating different error codes for local consumption is not a direction I'd like to go in if I can avoid it.

We have a huge error code space available, so as long as it's clear when one should use an error code, I'm not sure why we wouldn't want a relatively complete set?

@janaiyengar
Copy link
Contributor

It is very helpful operationally to have these codes available, since a server, for instance, can monitor the different error codes that are received. A spike in FRAME_ERROR for a STREAM frame is much easier to diagnose than a general FRAME_ERROR. As to why this may be useful later and not just during development -- any time you touch code, you can inadvertently introduce bugs. A server that does not want to monitor does not have to, but having the error codes makes it possible for interested parties to do so.

@@ -2858,173 +2858,74 @@ CONNECTION_CLOSE or RST_STREAM frame. Error codes share a common code space.
Some error codes apply only to either streams or the entire connection and have
no defined semantics in the other context.

QUIC_INTERNAL_ERROR (0x80000001):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you'll want this or some equivalent.

QUIC_INVALID_STREAM_DATA (0x8000002E):
: STREAM frame data is malformed.

QUIC_UNENCRYPTED_STREAM_DATA (0x8000003D):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We definitely want this.

QUIC_INVALID_CONNECTION_CLOSE_DATA (0x80000007):
: CONNECTION_CLOSE frame data is malformed.

QUIC_INVALID_GOAWAY_DATA (0x80000008):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do want an error code per frame type. It really helps narrow down issues.


QUIC_NETWORK_IDLE_TIMEOUT (0x80000019):
: The connection timed out due to no network activity.
ENHANCE_YOUR_CALM (0x80000002):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still confusing, particularly on a transport level.

And if we're going to add it, we're going to need to restate it's purpose, we can't just assume readers know what it means.

@martinthomson
Copy link
Member Author

Discussed in Paris; we're going to add the frame-specific codes in addition to the generic one.

We'll remove ENHANCE_YOUR_CALM from this one.

@@ -1213,7 +1213,7 @@ server MUST check that it would have sent a version negotiation packet if it had
received a packet with the indicated initial_version. If a server would have
accepted the version included in the initial_version and the value differs from
the value of negotiated_version, the server MUST terminate the connection with a
QUIC_VERSION_NEGOTIATION_MISMATCH error.
VERSION_NEGOTIATION_ERROR error.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am having some trouble understanding how to send this error #597 - but it might be resolved with updated PUBLIC_RESET to replace CONNECTION_CLOSE.

This is just the error codes, not references to those codes.
One interesting change here is that we don't have an error code for running out
of packet numbers.  If you think about this for even a few seconds, you will
realize that you can't send that error code because it can't go in a packet.

And that's even assuming that you manage to keep a connection alive long enough
to send that many packets.  At 1 million packets a second, that's almost 600
thousand years.  Even if we assume that you jump 2^31 packets every time you
switch between networks, and change once a minute, that's 16 thousand years.

I didn't remove the parenthetical mention of QUIC_NO_ERROR because that is
being removed in other pull requests.
@martinthomson
Copy link
Member Author

@janaiyengar, I've rebased and updated this with what we agreed at the interim. There is now a generic frame format error, and a set of frame-type-specific errors. I removed ENHANCE_YOUR_CALM.

I went over the error codes that you added back in #521 and I'm going to suggest that we separate those out. Some depend on the conclusion of #608, and I think that the others aren't useful to signal. I'm sure that I'm wrong about at least one, but we should discuss them separately.

Copy link
Contributor

@ianswett ianswett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for rewriting this change. LGTM.

@janaiyengar
Copy link
Contributor

I'm not sure which ones from #521 are not added in here. I wouldn't remove ones that we need discussion on... If you want to discuss ones that are in #521, please add them back in and have a separate issue on removing them.

@martinthomson martinthomson mentioned this pull request Jun 22, 2017
@martinthomson martinthomson merged commit 1b2446f into master Jun 27, 2017
@MikeBishop MikeBishop deleted the transport_errors branch September 23, 2017 00:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
-transport design An issue that affects the design of the protocol; resolution requires consensus.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants