New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Public Troubleshooting Flags #279

Open
martinduke opened this Issue Feb 9, 2017 · 18 comments

Comments

7 participants
@martinduke
Contributor

martinduke commented Feb 9, 2017

Something that came up in Tokyo was the need for service providers (i.e. middleboxes) to troubleshoot performance issues in the network. Packet Number Echo (#269) could partially address this problem., particularly for RTT measurement. But it is imperfect for measuring other problems.

As someone who has spent quite a bit of time debugging TCP packet traces from service providers, there are two other things that would help greatly in answering the question "why is my QUIC slow", without significantly compromising privacy. To do both would take two bits in the public flags, which may be available depending on how some other issues turn out.

(1) QUIC endpoints MAY set the BLOCKED bit if they include a BLOCKED frame in the packet. If you want to get fancy, you could have a different bit for "stream blocked" and "connection blocked", but flow control issues are a common cause of traffic pauses with no other discernable symptoms.
(2) QUIC endpoints MUST (SHOULD?) set the RETRANS flag in any packet that contains retransmitted data. This would help operators identify that a connection is slowing down because it is responding to loss, and would allow for very reliable loss statistics for the downstream network.

@ianswett

This comment has been minimized.

Show comment
Hide comment
@ianswett

ianswett Feb 9, 2017

Contributor

The blocked bit is an interesting idea.

The retrans bit has a problem, which is that a QUIC connection may not retransmit the same thing or may not retransmit anything at all. However, if we make it a LOSS_DETECTION bit, and say it mean "this bit means I detected a packet loss", I believe it conveys the exact signal you're hoping to achieve, and typically it would end up being the retansmit bit.

<Warning, bad idea ahead> If we were feeling clever, we could abuse the packet echo bit for the loss detection case by specifying half the packet number space as valid for echo and the other half for delta to lost packet number. The only reason this would be useful is to allow us to create time-sequence diagrams without decrypting the payload.

The worst part about abusing the packet echo bit is that consuming extra bytes to indicate a packet number means we can fit less data into packets, and may cause a retransmission to be split into two packets. I suspect this may be enough hassle implementations wouldn't bother implementing it.

Contributor

ianswett commented Feb 9, 2017

The blocked bit is an interesting idea.

The retrans bit has a problem, which is that a QUIC connection may not retransmit the same thing or may not retransmit anything at all. However, if we make it a LOSS_DETECTION bit, and say it mean "this bit means I detected a packet loss", I believe it conveys the exact signal you're hoping to achieve, and typically it would end up being the retansmit bit.

<Warning, bad idea ahead> If we were feeling clever, we could abuse the packet echo bit for the loss detection case by specifying half the packet number space as valid for echo and the other half for delta to lost packet number. The only reason this would be useful is to allow us to create time-sequence diagrams without decrypting the payload.

The worst part about abusing the packet echo bit is that consuming extra bytes to indicate a packet number means we can fit less data into packets, and may cause a retransmission to be split into two packets. I suspect this may be enough hassle implementations wouldn't bother implementing it.

@martinduke

This comment has been minimized.

Show comment
Hide comment
@martinduke

martinduke Feb 9, 2017

Contributor

LOSS_DETECTION sounds good, though we have to set the bit for each lost packet in a burst loss, or we have the same problem of incomplete data. So in effect, this is the same as the "RETRANS" bit except we have to find a packet to put it on for unretransmittable packets.

The "bad idea" sounds complicated, but I'm open to discussing it.

Contributor

martinduke commented Feb 9, 2017

LOSS_DETECTION sounds good, though we have to set the bit for each lost packet in a burst loss, or we have the same problem of incomplete data. So in effect, this is the same as the "RETRANS" bit except we have to find a packet to put it on for unretransmittable packets.

The "bad idea" sounds complicated, but I'm open to discussing it.

@mirjak

This comment has been minimized.

Show comment
Hide comment
@mirjak

mirjak Feb 9, 2017

Contributor

Yes, I agree, blocked it interesting. I didn't think about blocked bit yet but I'm also not sure if I understand the use case fully (see below).

All in all, my first reaction, however, in-line with ian's comment about naming of the retrans/loss bit, is that this exposes the wrong semantics. Rather than 'just' exposing anything we do in the transport, we should consider what's the information the network actually wants/needs.

In case of loss/retrans, this is probably inline of what ConEx is/was trying to do. We called this whole-path-congestion and it was only based on ECN/CE (not loss) because loss was already exposed in TCP when monitoring retransmissions. I've been also thinking about exposing the ECN counters and potentially a loss counter, however, I wasn't sure if the overhead is worth the win. So the signal the network actually needs is, I think, 'I saw congestion and that's the reason why I'm slowing down (congestion control)'; I guess this could be accounted by using one bit for both together loss or ECN/CE (if the packet rate is high enough to signal multiple losses in a timely manner).

Regarding the blocked bit, I think this is related. My understanding of the use case is that you (the network) sees that a flow is slowing down/sending less data than previously or than possible and the network wants to distinguish the case where it's limited by the endpoint (flow control) or there is something wrong in the network.
However, there are actually (at least?) three cases how a connection can be limited:

  1. limited by the application itself: there is simply no more data to send
  2. limited by the flow control in the endpoint's transport
  3. limited by a bottleneck in the network (and respectively congestion window)

I guess what you'd like to expose it the difference between 3 and 1/2, right? Or would it be also helpful to expose the difference between 1 and 2 e.g. for wireshark like debugging at the endpoint? Then I guess we would need two bits...

Contributor

mirjak commented Feb 9, 2017

Yes, I agree, blocked it interesting. I didn't think about blocked bit yet but I'm also not sure if I understand the use case fully (see below).

All in all, my first reaction, however, in-line with ian's comment about naming of the retrans/loss bit, is that this exposes the wrong semantics. Rather than 'just' exposing anything we do in the transport, we should consider what's the information the network actually wants/needs.

In case of loss/retrans, this is probably inline of what ConEx is/was trying to do. We called this whole-path-congestion and it was only based on ECN/CE (not loss) because loss was already exposed in TCP when monitoring retransmissions. I've been also thinking about exposing the ECN counters and potentially a loss counter, however, I wasn't sure if the overhead is worth the win. So the signal the network actually needs is, I think, 'I saw congestion and that's the reason why I'm slowing down (congestion control)'; I guess this could be accounted by using one bit for both together loss or ECN/CE (if the packet rate is high enough to signal multiple losses in a timely manner).

Regarding the blocked bit, I think this is related. My understanding of the use case is that you (the network) sees that a flow is slowing down/sending less data than previously or than possible and the network wants to distinguish the case where it's limited by the endpoint (flow control) or there is something wrong in the network.
However, there are actually (at least?) three cases how a connection can be limited:

  1. limited by the application itself: there is simply no more data to send
  2. limited by the flow control in the endpoint's transport
  3. limited by a bottleneck in the network (and respectively congestion window)

I guess what you'd like to expose it the difference between 3 and 1/2, right? Or would it be also helpful to expose the difference between 1 and 2 e.g. for wireshark like debugging at the endpoint? Then I guess we would need two bits...

@ianswett

This comment has been minimized.

Show comment
Hide comment
@ianswett

ianswett Feb 9, 2017

Contributor

I think the blocked bit is less useful than I originally thought. I think connections are in two states of the world:

  1. Not blocked
  2. Blocked, but you'd have to have access to the encrypted context to understand why.
  • ie: A stream is blocked because of a slow server backend?

TCP flow control is not truly end-to-end, unlike QUIC. And so I think it means it needs to be debugged end to end. Also, the path has no control over flow control in QUIC, so I don't think it provides actionable data for a network operator.

Contributor

ianswett commented Feb 9, 2017

I think the blocked bit is less useful than I originally thought. I think connections are in two states of the world:

  1. Not blocked
  2. Blocked, but you'd have to have access to the encrypted context to understand why.
  • ie: A stream is blocked because of a slow server backend?

TCP flow control is not truly end-to-end, unlike QUIC. And so I think it means it needs to be debugged end to end. Also, the path has no control over flow control in QUIC, so I don't think it provides actionable data for a network operator.

@martinduke

This comment has been minimized.

Show comment
Hide comment
@martinduke

martinduke Feb 9, 2017

Contributor

Also, the path has no control over flow control in QUIC, so I don't think it provides actionable data for a network operator.

I agree that there is no direct action that the operator can take, but it is a very clear signal that there is no action to take. It is useful as a "stop worrying" bit.

Contributor

martinduke commented Feb 9, 2017

Also, the path has no control over flow control in QUIC, so I don't think it provides actionable data for a network operator.

I agree that there is no direct action that the operator can take, but it is a very clear signal that there is no action to take. It is useful as a "stop worrying" bit.

@martinduke

This comment has been minimized.

Show comment
Hide comment
@martinduke

martinduke Feb 9, 2017

Contributor

We called this whole-path-congestion and it was only based on ECN/CE (not loss) because loss was already exposed in TCP when monitoring retransmissions

I think there is a case that the ECE/CWR bits should be in the public flags, rather than the ACK frame, and this discussion is (sort of) happening in the ECN issue #68.

However, there are actually (at least?) three cases how a connection can be limited:

  1. limited by the application itself: there is simply no more data to send
  2. limited by the flow control in the endpoint's transport
  3. limited by a bottleneck in the network (and respectively congestion window)

#1 is hard to detect, even in TCP, and tends not to bring customer complaints. So I think it's really down to 2 or 3, and the BLOCKED bit would handle this.

Contributor

martinduke commented Feb 9, 2017

We called this whole-path-congestion and it was only based on ECN/CE (not loss) because loss was already exposed in TCP when monitoring retransmissions

I think there is a case that the ECE/CWR bits should be in the public flags, rather than the ACK frame, and this discussion is (sort of) happening in the ECN issue #68.

However, there are actually (at least?) three cases how a connection can be limited:

  1. limited by the application itself: there is simply no more data to send
  2. limited by the flow control in the endpoint's transport
  3. limited by a bottleneck in the network (and respectively congestion window)

#1 is hard to detect, even in TCP, and tends not to bring customer complaints. So I think it's really down to 2 or 3, and the BLOCKED bit would handle this.

@martinthomson

This comment has been minimized.

Show comment
Hide comment
@martinthomson

martinthomson Mar 9, 2017

Member

The flow control one might run afoul of attacks like HEIST which benefit from being able to observe flow control.

Member

martinthomson commented Mar 9, 2017

The flow control one might run afoul of attacks like HEIST which benefit from being able to observe flow control.

@ianswett

This comment has been minimized.

Show comment
Hide comment
@ianswett

ianswett Mar 9, 2017

Contributor

Re 2, Flow control issues are to easily monitor end-to-end, and in QUIC, flow control is not modifiable by middleboxes, so QUIC removes responsibility for flow control issues from the network operators entirely. As a person who's debugged my share of flow control issues, I'm trying to imagine a case where you or a network operator would have to debug flow control issues with QUIC, and I can't come up with one.

Re 3, That's congestion control's domain, not flow control(which uses BLOCKED), and definitely feels like it's the sender's responsibility. In most applications, the sender is congestion control limited a large portion of the time, so I wouldn't expect this to be a useful signal, as most connections would see it.

I'm not excited about putting ECE/CWR bits in the unencrypted portion of the packet, as I can't imagine a use case.

Contributor

ianswett commented Mar 9, 2017

Re 2, Flow control issues are to easily monitor end-to-end, and in QUIC, flow control is not modifiable by middleboxes, so QUIC removes responsibility for flow control issues from the network operators entirely. As a person who's debugged my share of flow control issues, I'm trying to imagine a case where you or a network operator would have to debug flow control issues with QUIC, and I can't come up with one.

Re 3, That's congestion control's domain, not flow control(which uses BLOCKED), and definitely feels like it's the sender's responsibility. In most applications, the sender is congestion control limited a large portion of the time, so I wouldn't expect this to be a useful signal, as most connections would see it.

I'm not excited about putting ECE/CWR bits in the unencrypted portion of the packet, as I can't imagine a use case.

@martinduke

This comment has been minimized.

Show comment
Hide comment
@martinduke

martinduke Mar 9, 2017

Contributor

I agree that there is no action a middlebox can take on flow control. As I said above, I think this is a "stop worrying" bit when service providers are trying to diagnose performance issues, which is logistically hard to do at endpoints.

ECE/CWR in the public flags would be sort of like the LOSSDETECT bit, if we wanted to separate those signals. But I don't think it would hurt much to encrypt them, as long as LOSSDETECT is there.

Contributor

martinduke commented Mar 9, 2017

I agree that there is no action a middlebox can take on flow control. As I said above, I think this is a "stop worrying" bit when service providers are trying to diagnose performance issues, which is logistically hard to do at endpoints.

ECE/CWR in the public flags would be sort of like the LOSSDETECT bit, if we wanted to separate those signals. But I don't think it would hurt much to encrypt them, as long as LOSSDETECT is there.

@ianswett

This comment has been minimized.

Show comment
Hide comment
@ianswett

ianswett Mar 9, 2017

Contributor

I understand the "stop worrying" bit, but I'm hoping flow control is rarely the limiting factor for QUIC. TCP has suffered from a variety of flow control problems, but QUIC doesn't inherit any of those issues.

That's not to say all implementers of QUIC will implement flow control correctly, but it's more likely website A's upload performance will be awful than it will be a widespread problem. As such, it's incentive compatible for both the client and server to implement flow control well, because otherwise performance will suffer.

For example, I believe Chrome's flow control window is 8MB, which is vastly larger than the TCP defaults.

Contributor

ianswett commented Mar 9, 2017

I understand the "stop worrying" bit, but I'm hoping flow control is rarely the limiting factor for QUIC. TCP has suffered from a variety of flow control problems, but QUIC doesn't inherit any of those issues.

That's not to say all implementers of QUIC will implement flow control correctly, but it's more likely website A's upload performance will be awful than it will be a widespread problem. As such, it's incentive compatible for both the client and server to implement flow control well, because otherwise performance will suffer.

For example, I believe Chrome's flow control window is 8MB, which is vastly larger than the TCP defaults.

@martinthomson

This comment has been minimized.

Show comment
Hide comment
@martinthomson

martinthomson Mar 9, 2017

Member

To add to what @ianswett says, Firefox uses a different strategy for h2. While you might see flow control limits being hit on individual streams (push in particular), that doesn't mean that there is a problem.

Member

martinthomson commented Mar 9, 2017

To add to what @ianswett says, Firefox uses a different strategy for h2. While you might see flow control limits being hit on individual streams (push in particular), that doesn't mean that there is a problem.

@martinduke

This comment has been minimized.

Show comment
Hide comment
@martinduke

martinduke Mar 9, 2017

Contributor

Obviously if BLOCKED frames are exceedingly rare, then the it's not worth the real estate in the public header. But then, then (different issue) perhaps BLOCKED doesn't make any sense either.

But if they're not exceedingly rare...

As such, it's incentive compatible for both the client and server to implement flow control well, because otherwise performance will suffer.

This is true of TCP, too. HOL blocking problems are much diminished in QUIC, but not quite eliminated in the case of large streams.

Contributor

martinduke commented Mar 9, 2017

Obviously if BLOCKED frames are exceedingly rare, then the it's not worth the real estate in the public header. But then, then (different issue) perhaps BLOCKED doesn't make any sense either.

But if they're not exceedingly rare...

As such, it's incentive compatible for both the client and server to implement flow control well, because otherwise performance will suffer.

This is true of TCP, too. HOL blocking problems are much diminished in QUIC, but not quite eliminated in the case of large streams.

@martinthomson

This comment has been minimized.

Show comment
Hide comment
@martinthomson

martinthomson Mar 9, 2017

Member

@martinduke, Google also proposed BLOCKED for h2. You can see for yourself the results of that.

Member

martinthomson commented Mar 9, 2017

@martinduke, Google also proposed BLOCKED for h2. You can see for yourself the results of that.

@ianswett

This comment has been minimized.

Show comment
Hide comment
@ianswett

ianswett Mar 9, 2017

Contributor

BLOCKED frames are rare, but tend to be isolated to a small number of services, which either are slow at consuming data or serve clients with much higher than average bandwidth. As such, they serve as an excellent monitoring and debugging tool if you understand the full context.

If this was available in QUIC, what use cases would you expect it to help with?

Contributor

ianswett commented Mar 9, 2017

BLOCKED frames are rare, but tend to be isolated to a small number of services, which either are slow at consuming data or serve clients with much higher than average bandwidth. As such, they serve as an excellent monitoring and debugging tool if you understand the full context.

If this was available in QUIC, what use cases would you expect it to help with?

@martinduke

This comment has been minimized.

Show comment
Hide comment
@martinduke

martinduke Mar 9, 2017

Contributor

BLOCKED frames are rare, but tend to be isolated to a small number of services, which either are slow at consuming data or serve clients with much higher than average bandwidth. As such, they serve as an excellent monitoring and debugging tool if you understand the full context.

If this was available in QUIC, what use cases would you expect it to help with?

Are you asking me to defend BLOCKED frames, or the header bit? In the use case where BLOCKED would be helpful, the bit would be helpful for monitoring performance of provider networks. The BLOCKED frame makes window-limitation more obvious, but if you have the decrypted payload, you can also glean this from WINDOW_UPDATE frames.

I can see three possibilites:

  • window-limited connections are a common occurrence in QUIC: in this case, the header bit is essential to have any idea what is a real network performance problem and what comes from client configuration. In some cases, it might lead to a mobile software update that fixes the config.
  • window-limited connections are extremely rare: it is not worth it to take a bit in the header. In the extreme limit, there is no need for flow control at all because QUIC endpoints have ample memory resources and reduced HOL blocking.
  • window-limited connections are "sort of" rare: in which case we ought to have flow control and BLOCKED, but it's not worth a valuable public header bit. I think this is where you believe we are.

Obviously, I have no real-world QUIC deployment experience, so I can't state with authority where we are on this spectrum, and if current experience is relevant to a future with IoT devices, etc. All I can say is that in TCP the flow control information is very valuable to have, and it's not hard to see how certain applications and implementations in QUIC might have the same problems.

Contributor

martinduke commented Mar 9, 2017

BLOCKED frames are rare, but tend to be isolated to a small number of services, which either are slow at consuming data or serve clients with much higher than average bandwidth. As such, they serve as an excellent monitoring and debugging tool if you understand the full context.

If this was available in QUIC, what use cases would you expect it to help with?

Are you asking me to defend BLOCKED frames, or the header bit? In the use case where BLOCKED would be helpful, the bit would be helpful for monitoring performance of provider networks. The BLOCKED frame makes window-limitation more obvious, but if you have the decrypted payload, you can also glean this from WINDOW_UPDATE frames.

I can see three possibilites:

  • window-limited connections are a common occurrence in QUIC: in this case, the header bit is essential to have any idea what is a real network performance problem and what comes from client configuration. In some cases, it might lead to a mobile software update that fixes the config.
  • window-limited connections are extremely rare: it is not worth it to take a bit in the header. In the extreme limit, there is no need for flow control at all because QUIC endpoints have ample memory resources and reduced HOL blocking.
  • window-limited connections are "sort of" rare: in which case we ought to have flow control and BLOCKED, but it's not worth a valuable public header bit. I think this is where you believe we are.

Obviously, I have no real-world QUIC deployment experience, so I can't state with authority where we are on this spectrum, and if current experience is relevant to a future with IoT devices, etc. All I can say is that in TCP the flow control information is very valuable to have, and it's not hard to see how certain applications and implementations in QUIC might have the same problems.

@ianswett

This comment has been minimized.

Show comment
Hide comment
@ianswett

ianswett Mar 9, 2017

Contributor

Yes, I believe we're in the third case, at least today. I may be able to produce some stats on what percentage of connections see blockage with existing settings, but it's very close to 0 today from server to client. It's low, but not that close to 0 from client to server, where the flow control windows are much smaller.

But yes, it will depend upon the application, and so I can't forsee every possible use case.

I'm also a bit concerned about privacy leakage, as Martin pointed out, even though I wouldn't anticipate that being an information rich channel.

I'm happy to defend BLOCKED frames. They've been amazingly useful in debugging and monitoring specific services and detecting implementation bugs.

Contributor

ianswett commented Mar 9, 2017

Yes, I believe we're in the third case, at least today. I may be able to produce some stats on what percentage of connections see blockage with existing settings, but it's very close to 0 today from server to client. It's low, but not that close to 0 from client to server, where the flow control windows are much smaller.

But yes, it will depend upon the application, and so I can't forsee every possible use case.

I'm also a bit concerned about privacy leakage, as Martin pointed out, even though I wouldn't anticipate that being an information rich channel.

I'm happy to defend BLOCKED frames. They've been amazingly useful in debugging and monitoring specific services and detecting implementation bugs.

@martinduke

This comment has been minimized.

Show comment
Hide comment
@martinduke

martinduke Mar 9, 2017

Contributor

@martinthomson I have now taken a look at HEIST, and I won't pretend to fully understand it. But I'm having trouble seeing if a BLOCKED bit will generally introduce a larger problem.

To defeat the cwnd attack, I believe QUIC has to insert PADDING frames prior to the full delivery of the last STREAM data, to make the delivery of the resource consistent (in terms of RTTs) regardless of echoed requests. This is all orthogonal to BLOCKED, of course. Even this only works if PADDING counts against cwnd, which I don't believe is established.

If we can resolve this issue, there has to be access to the IFCW to have a problem. This is in the clear, but I'm not sure it's visible to the javascript the attacker has installed on the browser. But if the attacker can overcome these issues, I can see a problem where a blocked bit, specifically tied to connection flow control, can indicate when the actual payload has exceeded the flow control. If we reach this point, I would probably abandon the idea of having a blocked bit specifically related to connection flow control, and instead make it deliberately ambiguous between stream and connection flow control. That probably makes things ambiguous enough to defeat this attack.

Anyway, I'd have to define exactly what the blocked bit is, to have a proposal that security guys could properly shoot at. The intent of this issue was to gauge general interest in a PR that would specify one or both concepts. The signal I'm getting is mixed, to say the least.

Contributor

martinduke commented Mar 9, 2017

@martinthomson I have now taken a look at HEIST, and I won't pretend to fully understand it. But I'm having trouble seeing if a BLOCKED bit will generally introduce a larger problem.

To defeat the cwnd attack, I believe QUIC has to insert PADDING frames prior to the full delivery of the last STREAM data, to make the delivery of the resource consistent (in terms of RTTs) regardless of echoed requests. This is all orthogonal to BLOCKED, of course. Even this only works if PADDING counts against cwnd, which I don't believe is established.

If we can resolve this issue, there has to be access to the IFCW to have a problem. This is in the clear, but I'm not sure it's visible to the javascript the attacker has installed on the browser. But if the attacker can overcome these issues, I can see a problem where a blocked bit, specifically tied to connection flow control, can indicate when the actual payload has exceeded the flow control. If we reach this point, I would probably abandon the idea of having a blocked bit specifically related to connection flow control, and instead make it deliberately ambiguous between stream and connection flow control. That probably makes things ambiguous enough to defeat this attack.

Anyway, I'd have to define exactly what the blocked bit is, to have a proposal that security guys could properly shoot at. The intent of this issue was to gauge general interest in a PR that would specify one or both concepts. The signal I'm getting is mixed, to say the least.

martinduke added a commit to martinduke/base-drafts that referenced this issue Mar 29, 2017

Troubleshooting flags in public header
To move along quicwg#279, this PR provides a straw man proposal for both a Loss Detection Flag and a Blocked flag. I emphasize that these changes are separable, if one is palatable and the other is not.
@martinduke

This comment has been minimized.

Show comment
Hide comment
@martinduke

martinduke Mar 29, 2017

Contributor

See the PR in #418 so that there is a coherent proposal to shoot at.

Contributor

martinduke commented Mar 29, 2017

See the PR in #418 so that there is a coherent proposal to shoot at.

@mnot mnot added this to Middleboxen in QUIC Apr 28, 2017

@mnot mnot changed the title from Public Flags to Aid Troubleshooting to Public Troubleshooting Flags Jun 21, 2017

@mnot mnot added the parked label Sep 4, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment