Revert Spurious Congestion Event #4836

nibanks · 2021-03-12T15:22:38Z

This is not necessarily an "issue" but more of a "feature recommendation" and I'm writing it up after a Gather discussion with Jana. This could possibly be integrated into v2, though I don't know if there is any official process for that yet.

Problem space:

Currently, large enough packet reordering triggers the congestion event logic, possibly resulting in a significant degradation of performance. When the ACK eventually comes in to indicate to loss recovery that the "loss" was spurious, the damage has already been done to congestion control.

Proposed solution:

I propose to add a new interface to CC: "Spurious Congestion Event", which is simply triggered by loss recovery if/when all lost packets since the last congestion event end up being acknowledged. When this happens, CC should be able to "undo" the effect of the previous congestion event.

This should be quite trivial for existing QUIC implementations to implement since the spec already recommends tracking "lost" packets to be able to detect spurious loss. Assuming that logic exists the change to determine when these are found to all be acknowledged is likely fairly trivial. The changes on the CC side generally amount to saving a "backup" of the state whenever there is a congestion event, and then reverting to that state when it's found to be spurious.

We've implemented this in msquic and it effectively eliminates the effect of large reordering on performance (as measured in our simulated WAN perf testing). I believe picoquic also has similar logic and saw similar benefit (fyi @huitema to confirm).

LPardue · 2021-03-12T16:46:30Z

With my chair hat on, this sounds like a substantial change to the recovery draft and we are too far along to consider it in v1. I also think that trying to triage issues for extensions or v2 should not be done on this repo,

That said, it seems like something with merit that could easily be done as an extension. There are a suite of other I-Ds that can improve recovery without defining a congestion control algorithm so they could fall into the scope of the future charter,

My advice would be to write a proposal in I-D format and solicit it to the mailing list.

nibanks · 2021-03-12T16:54:59Z

I am not proposing this as a change to v1. I also don't think this qualifies as an extension. There is no negotiation required here. It's purely a sender side, implementation detail (like much of loss recovery / congestion control). I also disagree that it's a substantial change (67 lines of C code to msquic). A whole doc/draft just for this seems like overkill. Waiting for v2 would be my preference (which may be sooner rather than later if we truly want to exercise VN soon).

LPardue · 2021-03-12T17:04:41Z

with chair hat off

also disagree that it's a substantial change (67 lines of C code to msquic).

My comment was a substantial change to the document, I don't care about code :)

I also don't think this qualifies as an extension. There is no negotiation required here.

Thanks for the clarification. I guess your proposal sounds like an extension to the interface, hence my suggestion that it is an extension, even if it is not part of the wire image and is not negotiated. Either way it seems like something that would be in scope for the WG's new charter. I honestly don't know at this time what a v2 would look like. I might suspect that we would rev the transport and recovery docs independently.

It sounds useful, why couple yourself to an unknown v2 process timeline? I-Ds are cheap and even a document that has 1 page of prose can give people an anchor for implementation, deployment and discussion.

huitema · 2021-03-12T20:04:27Z

Picoquic has implemented this since 2019, and I can confirm that (a) this is useful and (b) this is a unilateral decision by the implementation. I don't think we need to change the recovery draft, because that draft is only meant to specify minimal rules for developers who don't know any better. Those who know better read the litterature, and react accordingly.

I documented the interaction between Cubic-based congestion control and spurious transmission in a blog published in November 2019. The handling has since been added to the Cubic "bis" draft, see section 4.9 of the bis draft. So we can say that this is now documented.

There is a missing part for Cubic implementation. The Cubic draft does not explain how implementations of QUIC actually detect spurious retransmission. The solution implemented in picoquic is rather simple: if a packet is marked lost, remember information about that packet for a while -- maybe 3*RTT. If an ACK frame acknowledging the packet number is received during that interval, detect the spurious retransmission. It would probably be nice to add something like that to a revision of the recovery specification, but I don't think developers will spend too much time figuring this one out by themselves. So I assume that we can wait for the bis edition of the recovery RFC.

By the way, the November 2019 blog mentioned three problems. Two of those are addressed in the revised Cubic draft: specify the units of the constants that Cubic uses, and specify the handling of spurious retransmission. The third problem is not so much with Cubic as with Hystart. Hystart uses RTT measurements to detect queue build-up and exit slow start before starting to lose packets, which is generally great. However, on some links, the RTT measurements are affected by a lot of jitter. This can cause early exit, which is not great. The solution implemented in picoquic is to implement a kind of low pass filter before using RTT measurements in Hystart -- I use the minimum of the last N measurements, with currently N=6. I also use the time stamp extension when available to isolate variations on the data path from those on the high path. I don't know if there is appetite in the WG to further document this problem.

LPardue · 2021-03-12T20:13:49Z

Thanks for the additional insight Christian. Is the hystart stuff something that could go into the hystart++ draft or is it something specific to QUIC?

huitema · 2021-03-12T20:39:00Z

It should go in the Hystart++ draft. Is there one?

Updated: I did find the hystart draft

junhochoi · 2021-03-13T00:52:49Z

There was some discussion on NTAP/rfc8312bis#45 (Christian already explained his algorithm in the thread). I also did some similar experiment a long time ago ( cloudflare/quiche#308) with Reno but the same thing can be done with Cubic easily. I believe it's beneficial for the loss/reorder recovered in a short time. p.s. hystart++ I-D is here: https://tools.ietf.org/html/draft-balasubramanian-tcpm-hystartplusplus-03

…

On Fri, Mar 12, 2021 at 12:39 PM Christian Huitema ***@***.***> wrote: It should go in the Hystart++ draft. Is there one? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#4836 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAJMHESHEUWY6KGDYUOFO3DTDJUWPANCNFSM4ZCPHQTQ> .

-- Junho Choi <junho dot choi at gmail.com> | https://saturnsoft.net

ianswett · 2021-03-14T22:13:40Z

I'd be supportive of an update to the recovery doc to incorporate detection of spurious retransmissions and any other changes which make sense(Scope TBD).

I agree with @nibanks that this isn't well suited to an ID, given it's unilateral.

I was going to put parked on this, indicating it might warrant discussion in v2, but I realize I really want the quicv2 label.

LPardue · 2021-03-23T22:19:46Z

Marking as bis since it seems folks are happy to consider this in some future revision of the drafts.

LPardue · 2021-06-14T18:47:48Z

The RFC shipped, closing this. We'll revist the issue as part of any bis effort.

larseggert added this to Triage in Late Stage Processing via automation Mar 12, 2021

larseggert added the -recovery label Mar 12, 2021

ianswett added parked An issue that we can't immediately address; for future discussion. and removed parked An issue that we can't immediately address; for future discussion. labels Mar 14, 2021

LPardue added the -bis An issue that is best addressed in a future revision of a published document label Mar 23, 2021

LPardue added the proposal-ready An issue which has a proposal that is believed to be ready for a consensus call. label Mar 23, 2021

project-bot bot moved this from Triage to Consensus Emerging in Late Stage Processing Mar 23, 2021

LPardue moved this from Consensus Emerging to Issue Handled in Late Stage Processing Jun 14, 2021

LPardue closed this as completed Jun 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revert Spurious Congestion Event #4836

Revert Spurious Congestion Event #4836

nibanks commented Mar 12, 2021

LPardue commented Mar 12, 2021 •

edited

Loading

nibanks commented Mar 12, 2021 •

edited

Loading

LPardue commented Mar 12, 2021

huitema commented Mar 12, 2021

LPardue commented Mar 12, 2021

huitema commented Mar 12, 2021 •

edited

Loading

junhochoi commented Mar 13, 2021 via email

ianswett commented Mar 14, 2021

LPardue commented Mar 23, 2021

LPardue commented Jun 14, 2021

Revert Spurious Congestion Event #4836

Revert Spurious Congestion Event #4836

Comments

nibanks commented Mar 12, 2021

LPardue commented Mar 12, 2021 • edited Loading

nibanks commented Mar 12, 2021 • edited Loading

LPardue commented Mar 12, 2021

huitema commented Mar 12, 2021

LPardue commented Mar 12, 2021

huitema commented Mar 12, 2021 • edited Loading

junhochoi commented Mar 13, 2021 via email

ianswett commented Mar 14, 2021

LPardue commented Mar 23, 2021

LPardue commented Jun 14, 2021

LPardue commented Mar 12, 2021 •

edited

Loading

nibanks commented Mar 12, 2021 •

edited

Loading

huitema commented Mar 12, 2021 •

edited

Loading