Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Additional triggers and info for dropped packets #25

Closed
rmarx opened this issue Oct 9, 2019 · 8 comments · Fixed by #278
Closed

Additional triggers and info for dropped packets #25

rmarx opened this issue Oct 9, 2019 · 8 comments · Fixed by #278
Labels
future-versions issue will be tackled but not for the current iteration

Comments

@rmarx
Copy link
Contributor

rmarx commented Oct 9, 2019

The current text lists about 8 reasons for dropping packets.
Microsoft's implementation lists 60+ individual reasons (via @nibanks)

Some of those:

LogDrop("Retry sent to server");
LogDrop("Already received server response");
LogDrop("No room for ODCID");
LogDrop("Invalid ODCID");
LogDrop("InitialToken alloc failed");
LogDrop("OrigCID alloc failed");
LogDrop("Max deferred datagram count reached");
LogDrop("Key no longer accepted");
LogDrop("SH packet during version negotiation");
LogDrop("Too short for HP");
LogDrop("Packet number too big");
LogDrop("Payload length less than encryption tag");
LogDrop("Generate new packet keys");
LogDrop("Decryption failure");
LogDrop("Invalid SH Reserved bits values");
LogDrop("Invalid LH Reserved bits values");
LogDrop("Duplicate packet number");
LogDrop("Key no longer accepted (batch)");
LogDrop("Failed to compute HP mask");
LogDrop("Different remote address");
LogDrop("Too small for Packet->Invariant");
LogDrop("LH no room for DestCID");
LogDrop("Zero length DestCID");
LogDrop("LH no room for SourceCID");
LogDrop("SH no room for DestCID");
LogDrop("DestCID don't match");
LogDrop("SourceCID don't match");
LogDrop("Greater than allowed max CID length");
LogDropWithValue("Invalid client/server packet type", Packet->LH->Type);
LogDrop("Invalid LH FixedBit bits values");
LogDrop("Long header has invalid token length");
LogDropWithValue("Long header has token length larger than buffer length", TokenLengthVarInt);
LogDrop("Long header has invalid payload length");
LogDropWithValue("Long header has length larger than buffer length", LengthVarInt);
LogDropWithValue("Long Header doesn't have enough room for packet number",
LogDrop("Invalid SH FixedBit bits values");
LogDrop("Non-initial packet not matched with a Connection");
LogDrop("Retry received after initial");

I feel that we don't need to list things in this level of detail in the qlog spec (the "trigger" field allows any text anyway). However, maybe some guidance text on this would be helpful and maybe a few more suggested triggers would be interesting.

@rmarx rmarx added the future-versions issue will be tackled but not for the current iteration label Sep 8, 2020
@LPardue
Copy link
Member

LPardue commented Sep 7, 2022

Wonder how many reasons Microsoft has for dropping in packets in 2022. Any input @nibanks? What are your top 20 used reasons?

I think we need to be judiscious with minimizing scope in the qlog document. Unless the triggers are common and self explanatory, we'll just risk confounding people with options or tying ourselves in knots with text to explain them.

Our escape valve is that this is free form text and probably not commonly used for anything special. Being frugal in the base spec is an advantage.

@marten-seemann
Copy link
Collaborator

fwiw, this is what quic-go has: https://github.com/lucas-clemente/quic-go/blob/62b82789c0cc8a9acf48ee516ce2dd303253f395/logging/types.go#L36-L59

It seems like many of @nibanks' reasons are just variations of a generic "header parse error", and could be logged as details to that event.

@nibanks
Copy link
Member

nibanks commented Sep 8, 2022

In my experience, if you're debugging an incident, you need to know the exact line of code, the exact reason why a drop happened, because you need to figure out how to fix it right then. If you only log enough to narrow the scope down to a dozen different code places, that doesn't help.

I don't care if there is one field for all the info, or two, with one of them being the "detailed_info" or something, but all the details need to be captured to make it useful.

@LPardue
Copy link
Member

LPardue commented Sep 8, 2022

I agree deployments would want the ability to log to that level.

Let's look at HTTP status codes for example. Proxies can fail requests with high level 4xx or 5xx status codes and then use the proxy-status header to include additional data https://datatracker.ietf.org/doc/html/rfc9209; namely the error parameter. HTTP also has problem details https://www.ietf.org/archive/id/draft-ietf-httpapi-rfc7807bis-04.html.

This is effectively a 3 layer model of coarseness - limited top level, broader mid level, very specific low layer. Each of these layers is also well structured, which makes interoperability easier - I.e. grepping out specific cases for alerting or investigation.

To map to qlog we could have a limited set of top-level triggers, an optional error type, and a field for detailed info. The last field is likely to be text but in theory you could indicate the format of that text for interop. E.g., if I want to encode JSON in event and singal that to consumers explicitly, using a media type could help. But is suspect text is probably enough for a lot of folks.

@lnicco
Copy link
Collaborator

lnicco commented Sep 8, 2022

just adding a datapoint in favor for being as specific as possible about the reason a packet got dropped on the floor
in MVFST we have about 40 different reasons and as @nibanks says they are useful to identify the specific line of code that drops the packet
https://github.com/facebookincubator/mvfst/blob/main/quic/state/QuicTransportStatsCallback.h#L30-L73

That being said encoding this information with a generic "Packet Dropped" event and an optional free-form reason/detailed_info field seems like the best course of action to me since it's not up to qlog specs to define which errors are useful and which ones are less useful.

@rmarx
Copy link
Contributor Author

rmarx commented Sep 28, 2022

Looking at this, I would agree keeping the list in qlog itself quite short and providing a generic field where implementations can add any type of freeform info they desire.

For this, we need 2 things:

  1. Decide WHICH of these current triggers on packet_dropped we want to keep (or do we want to rename/exchange some maybe?). Probably best coming from someone with a large deployment (@LPardue @nibanks @lnicco):
        "key_unavailable" /
        "unknown_connection_id" /
        "header_parse_error" /
        "payload_decrypt_error" /
        "protocol_violation" /
        "dos_prevention" /
        "unsupported_version" /
        "unexpected_packet" /
        "unexpected_source_connection_id" /
        "unexpected_version" /
        "duplicate" /
        "invalid_initial"
  1. Add a details field to the event. I have it as a JSON object instead of a text but I don't mind either way.
TransportPacketDropped = {
    ; primarily packet_type should be filled here,
    ; as other fields might not be parseable
    ? header: PacketHeader

    ? raw: RawInfo
    ? datagram_id: uint32

    ? details: {* text => any}

    ; list of leftover triggers here:
    ; ...
}

I'd put the "error type" in the details field as well then @LPardue, feeling that the trigger can act as an "error category" indicator.

@nibanks
Copy link
Member

nibanks commented Sep 28, 2022

Looking at our (currently 79) drop reasons in MsQuic, I think the following top-level categories would be reasonable:

  • Internal Error (not initialized, OOM, etc.)
  • Rejected (internal limits reached, ddos protection, unwilling to track more paths, duplicate)
  • Unsupported (unknown/unsupported version)
  • Invalid Packet (packet parsing/validation failed)
  • Connection Unknown (CID doesn't match)
  • Decryption Failure (decrypt failed or key unavailable)

@LPardue
Copy link
Member

LPardue commented Sep 28, 2022

I like Nicks categories. All I might add is a "General" category for anything else, where details can be added or not at operator discretion

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
future-versions issue will be tackled but not for the current iteration
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants