Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revert Spurious Congestion Event #4836

Closed
nibanks opened this issue Mar 12, 2021 · 10 comments
Closed

Revert Spurious Congestion Event #4836

nibanks opened this issue Mar 12, 2021 · 10 comments
Labels
-bis An issue that is best addressed in a future revision of a published document -recovery proposal-ready An issue which has a proposal that is believed to be ready for a consensus call.

Comments

@nibanks
Copy link
Member

nibanks commented Mar 12, 2021

This is not necessarily an "issue" but more of a "feature recommendation" and I'm writing it up after a Gather discussion with Jana. This could possibly be integrated into v2, though I don't know if there is any official process for that yet.

Problem space:

Currently, large enough packet reordering triggers the congestion event logic, possibly resulting in a significant degradation of performance. When the ACK eventually comes in to indicate to loss recovery that the "loss" was spurious, the damage has already been done to congestion control.

Proposed solution:

I propose to add a new interface to CC: "Spurious Congestion Event", which is simply triggered by loss recovery if/when all lost packets since the last congestion event end up being acknowledged. When this happens, CC should be able to "undo" the effect of the previous congestion event.

This should be quite trivial for existing QUIC implementations to implement since the spec already recommends tracking "lost" packets to be able to detect spurious loss. Assuming that logic exists the change to determine when these are found to all be acknowledged is likely fairly trivial. The changes on the CC side generally amount to saving a "backup" of the state whenever there is a congestion event, and then reverting to that state when it's found to be spurious.

We've implemented this in msquic and it effectively eliminates the effect of large reordering on performance (as measured in our simulated WAN perf testing). I believe picoquic also has similar logic and saw similar benefit (fyi @huitema to confirm).

@larseggert larseggert added this to Triage in Late Stage Processing via automation Mar 12, 2021
@LPardue
Copy link
Member

LPardue commented Mar 12, 2021

With my chair hat on, this sounds like a substantial change to the recovery draft and we are too far along to consider it in v1. I also think that trying to triage issues for extensions or v2 should not be done on this repo,

That said, it seems like something with merit that could easily be done as an extension. There are a suite of other I-Ds that can improve recovery without defining a congestion control algorithm so they could fall into the scope of the future charter,

My advice would be to write a proposal in I-D format and solicit it to the mailing list.

@nibanks
Copy link
Member Author

nibanks commented Mar 12, 2021

I am not proposing this as a change to v1. I also don't think this qualifies as an extension. There is no negotiation required here. It's purely a sender side, implementation detail (like much of loss recovery / congestion control). I also disagree that it's a substantial change (67 lines of C code to msquic). A whole doc/draft just for this seems like overkill. Waiting for v2 would be my preference (which may be sooner rather than later if we truly want to exercise VN soon).

@LPardue
Copy link
Member

LPardue commented Mar 12, 2021

with chair hat off

also disagree that it's a substantial change (67 lines of C code to msquic).

My comment was a substantial change to the document, I don't care about code :)

I also don't think this qualifies as an extension. There is no negotiation required here.

Thanks for the clarification. I guess your proposal sounds like an extension to the interface, hence my suggestion that it is an extension, even if it is not part of the wire image and is not negotiated. Either way it seems like something that would be in scope for the WG's new charter. I honestly don't know at this time what a v2 would look like. I might suspect that we would rev the transport and recovery docs independently.

It sounds useful, why couple yourself to an unknown v2 process timeline? I-Ds are cheap and even a document that has 1 page of prose can give people an anchor for implementation, deployment and discussion.

@huitema
Copy link
Contributor

huitema commented Mar 12, 2021

Picoquic has implemented this since 2019, and I can confirm that (a) this is useful and (b) this is a unilateral decision by the implementation. I don't think we need to change the recovery draft, because that draft is only meant to specify minimal rules for developers who don't know any better. Those who know better read the litterature, and react accordingly.

I documented the interaction between Cubic-based congestion control and spurious transmission in a blog published in November 2019. The handling has since been added to the Cubic "bis" draft, see section 4.9 of the bis draft. So we can say that this is now documented.

There is a missing part for Cubic implementation. The Cubic draft does not explain how implementations of QUIC actually detect spurious retransmission. The solution implemented in picoquic is rather simple: if a packet is marked lost, remember information about that packet for a while -- maybe 3*RTT. If an ACK frame acknowledging the packet number is received during that interval, detect the spurious retransmission. It would probably be nice to add something like that to a revision of the recovery specification, but I don't think developers will spend too much time figuring this one out by themselves. So I assume that we can wait for the bis edition of the recovery RFC.

By the way, the November 2019 blog mentioned three problems. Two of those are addressed in the revised Cubic draft: specify the units of the constants that Cubic uses, and specify the handling of spurious retransmission. The third problem is not so much with Cubic as with Hystart. Hystart uses RTT measurements to detect queue build-up and exit slow start before starting to lose packets, which is generally great. However, on some links, the RTT measurements are affected by a lot of jitter. This can cause early exit, which is not great. The solution implemented in picoquic is to implement a kind of low pass filter before using RTT measurements in Hystart -- I use the minimum of the last N measurements, with currently N=6. I also use the time stamp extension when available to isolate variations on the data path from those on the high path. I don't know if there is appetite in the WG to further document this problem.

@LPardue
Copy link
Member

LPardue commented Mar 12, 2021

Thanks for the additional insight Christian. Is the hystart stuff something that could go into the hystart++ draft or is it something specific to QUIC?

@huitema
Copy link
Contributor

huitema commented Mar 12, 2021

It should go in the Hystart++ draft. Is there one?

Updated: I did find the hystart draft

@junhochoi
Copy link
Contributor

junhochoi commented Mar 13, 2021 via email

@ianswett ianswett added parked An issue that we can't immediately address; for future discussion. and removed parked An issue that we can't immediately address; for future discussion. labels Mar 14, 2021
@ianswett
Copy link
Contributor

I'd be supportive of an update to the recovery doc to incorporate detection of spurious retransmissions and any other changes which make sense(Scope TBD).

I agree with @nibanks that this isn't well suited to an ID, given it's unilateral.

I was going to put parked on this, indicating it might warrant discussion in v2, but I realize I really want the quicv2 label.

@LPardue LPardue added the -bis An issue that is best addressed in a future revision of a published document label Mar 23, 2021
@LPardue
Copy link
Member

LPardue commented Mar 23, 2021

Marking as bis since it seems folks are happy to consider this in some future revision of the drafts.

@LPardue LPardue added the proposal-ready An issue which has a proposal that is believed to be ready for a consensus call. label Mar 23, 2021
@project-bot project-bot bot moved this from Triage to Consensus Emerging in Late Stage Processing Mar 23, 2021
@LPardue LPardue moved this from Consensus Emerging to Issue Handled in Late Stage Processing Jun 14, 2021
@LPardue
Copy link
Member

LPardue commented Jun 14, 2021

The RFC shipped, closing this. We'll revist the issue as part of any bis effort.

@LPardue LPardue closed this as completed Jun 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
-bis An issue that is best addressed in a future revision of a published document -recovery proposal-ready An issue which has a proposal that is believed to be ready for a consensus call.
Projects
Late Stage Processing
  
Issue Handled
Development

No branches or pull requests

6 participants