Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FORWARD-TSN for unreceived chunk results in ABORT #597

Open
boivie opened this issue Jul 15, 2021 · 17 comments
Open

FORWARD-TSN for unreceived chunk results in ABORT #597

boivie opened this issue Jul 15, 2021 · 17 comments

Comments

@boivie
Copy link

boivie commented Jul 15, 2021

This is similar to issue #592, which was found to be a dcSCTP issue. In this case, I don't exactly know why usrsctp sends an ABORT. I have a packetdrill script that reproduces it (I haven't rebuilt the FreeBSD kernel in 21 years, so that was fun - and it doesn't seem to have changed a bit!)

So: The scenario is as follows:

Peer A is sending a lot of data. Some of these packets are lost (representing TSN ranges [284, 297], and some of these packets had a limited retransmission timeout, which expired before they could be retransmitted. So when the chunks were supposed to be resent, a FORWARD-TSN was sent to "skip" some of these chunks, while also bundling some chunks that hasn't yet expired but that were scheduled for retransmission.

The packetdrill script:

--tolerance_usecs=25000
 0.000 `sysctl -w net.inet.sctp.ecn_enable=0`
+0.000 `sysctl -w net.inet.sctp.pr_enable=1`
+0.000 `sysctl -w net.inet.sctp.asconf_enable=0`
+0.000 `sysctl -w net.inet.sctp.auth_enable=0`
+0.000 `sysctl -w net.inet.sctp.reconfig_enable=1`
+0.000 `sysctl -w net.inet.sctp.nrsack_enable=0`
+0.000 `sysctl -w net.inet.sctp.pktdrop_enable=0`
+0.000 socket(..., SOCK_STREAM, IPPROTO_SCTP) = 3
+0.000 bind(3, ..., ...) = 0
+0.000 listen(3, 1) = 0
+0.000 < sctp: INIT[flgs=0, tag=1, a_rwnd=5242880, os=65535, is=65535, tsn=281,
                    FORWARD_TSN_SUPPORTED[],
                    SUPPORTED_EXTENSIONS[types=[RECONFIG, FORWARD_TSN]]]
+0.000 > sctp: INIT_ACK[flgs=0, tag=2, a_rwnd=..., os=..., is=..., tsn=10,
                        FORWARD_TSN_SUPPORTED[],
                        SUPPORTED_EXTENSIONS[types=[FORWARD_TSN, RECONFIG]],
                        STATE_COOKIE[len=..., val=...]]
+0.010 < sctp: COOKIE_ECHO[flgs=0, len=..., val=...]
+0.000 > sctp: COOKIE_ACK[flgs=0]
+0.000 accept(3, ..., ...) = 4
+0.000 close(3) = 0
+0.000 < sctp: DATA[flgs=UBE, len=356, tsn=281, sid=1, ssn=0, ppid=0];
               DATA[flgs=UBE, len=354, tsn=282, sid=1, ssn=0, ppid=0];
               DATA[flgs=UB, len=172, tsn=283, sid=1, ssn=0, ppid=0]

+0.000 read(4, ..., 2000) = 340

+0.000 > sctp: SACK[flgs=0, cum_tsn=283, a_rwnd=..., gaps=[], dups=[]]

// There are four packets lost, containing TSNs [284..297]:
// 284 (UE), 285 (UBE), 286 (UBE), 287 (UB), 288 (UE), 289 (UBE), 
// 290 (UBE), 291 (UB), 292 (UE), 293 (UBE), 294 (UBE), 295 (UB),
// 296 (UE), 297 (UB)

// The next packet that is actually received, contains TSNs 298 and forwards.

+0.040 < sctp: DATA[flgs=UE, len=135, tsn=298, sid=1, ssn=0, ppid=0];
               DATA[flgs=UBE, len=355, tsn=299, sid=1, ssn=0, ppid=0];
               DATA[flgs=UBE, len=353, tsn=300, sid=1, ssn=0, ppid=0];
               DATA[flgs=UB, len=264, tsn=301, sid=1, ssn=0, ppid=0]

+0.000 > sctp: SACK[flgs=0x00, cum_tsn=283, a_rwnd=1861474, gaps=[15:18], dups=[]]

// Some of these chunks that were never received, expire, which results
// in a FORWARD-TSN sent. That chunk is bundled with some DATA chunks that
// represent these being retransmitted. In this simple packetdrill, it's not
// clear why these would be retransmitted, but the script is made to be simple.
// In reality, numerous SACKs resulted in these being retransmitted.

+0.050 < sctp: FORWARD_TSN[cum_tsn=288, ids=[]];
               DATA[flgs=UBE, len=355, tsn=289, sid=1, ssn=0, ppid=0];
               DATA[flgs=UBE, len=356, tsn=290, sid=1, ssn=0, ppid=0];
               DATA[flgs=UB, len=260, tsn=291, sid=1, ssn=0, ppid=0];
               DATA[flgs=UE, len=96, tsn=292, sid=1, ssn=0, ppid=0]

+0.000 > sctp: ABORT[flgs=0, PROTOCOL_VIOLATION[info="Illegal message sequence, missing end for MID: 00000120"]]

+0.200 close(4) = 0

0x120 is 288, which is an end chunk, but the receiver doesn't even know that, as it has never seen that chunk.

I have the full PCAP of the sender side, if it would be of any help. And I have the most highest regard of usrsctp when it comes to protocol conformity, so I look forward to learning where my understanding of RFC3758 is lacking.

Thanks,
Victor

@tuexen
Copy link
Member

tuexen commented Jul 16, 2021

By looking at the packetdrill script, I would say that it is a bug in the FreeBSD stack... At least by looking at it at this point of time... Will look into in more detail...

@boivie
Copy link
Author

boivie commented Jul 17, 2021

Thanks Michael.

To help you pinpoint the issue, I have created a minimal reproducing sample, and learned some things while doing it. I'm essentially trying to find out if there is any way I can ensure that the packets I sent out will not provoke this issue. It's not a super-common issue, but if I can avoid it (as upgrading all clients will take some time), I'd prefer to do that.

--tolerance_usecs=25000
 0.000 `sysctl -w net.inet.sctp.ecn_enable=0`
+0.000 `sysctl -w net.inet.sctp.pr_enable=1`
+0.000 `sysctl -w net.inet.sctp.asconf_enable=0`
+0.000 `sysctl -w net.inet.sctp.auth_enable=0`
+0.000 `sysctl -w net.inet.sctp.reconfig_enable=1`
+0.000 `sysctl -w net.inet.sctp.nrsack_enable=0`
+0.000 `sysctl -w net.inet.sctp.pktdrop_enable=0`
+0.000 socket(..., SOCK_STREAM, IPPROTO_SCTP) = 3
+0.000 bind(3, ..., ...) = 0
+0.000 listen(3, 1) = 0
+0.000 < sctp: INIT[flgs=0, tag=1, a_rwnd=5242880, os=65535, is=65535, tsn=281,
                    FORWARD_TSN_SUPPORTED[],
                    SUPPORTED_EXTENSIONS[types=[RECONFIG, FORWARD_TSN]]]
+0.000 > sctp: INIT_ACK[flgs=0, tag=2, a_rwnd=..., os=..., is=..., tsn=10,
                        FORWARD_TSN_SUPPORTED[],
                        SUPPORTED_EXTENSIONS[types=[FORWARD_TSN, RECONFIG]],
                        STATE_COOKIE[len=..., val=...]]
+0.010 < sctp: COOKIE_ECHO[flgs=0, len=..., val=...]
+0.000 > sctp: COOKIE_ACK[flgs=0]
+0.000 accept(3, ..., ...) = 4
+0.000 close(3) = 0
+0.000 < sctp: DATA[flgs=UB, len=356, tsn=281, sid=1, ssn=0, ppid=0]

+0.000 > sctp: SACK[flgs=0, cum_tsn=281, a_rwnd=..., gaps=[], dups=[]]

// There are a few packets lost, TSNs [282..300],

+0.040 < sctp: DATA[flgs=UE, len=135, tsn=301, sid=1, ssn=0, ppid=0];
               DATA[flgs=UB, len=135, tsn=302, sid=1, ssn=0, ppid=0]

+0.000 > sctp: SACK[flgs=0x00, cum_tsn=281, a_rwnd=..., gaps=[20:21], dups=[]]

+0.050 < sctp: FORWARD_TSN[cum_tsn=288, ids=[]];
               DATA[flgs=UBE, len=355, tsn=289, sid=1, ssn=0, ppid=0]

+0.000 > sctp: ABORT[flgs=0, PROTOCOL_VIOLATION[info="Illegal message sequence, missing end for MID: 00000120"]]

+0.200 close(4) = 0

For this to trigger, the second packet (with a TSN much larger than what was previously seen) has to have at least two DATA chunks in it, with the first having the END bit set, and the last having the BEGINNING bit set. Any other combinations, and the ABORT will not trigger, and a normal SACK will be sent instead.

@boivie
Copy link
Author

boivie commented Aug 23, 2021

Hi again,

I'm wondering if you have any idea if it's possible to work around the problem - to ensure that usrsctp doesn't abort? With the packetdrill and the sequence above, I tried to always send FORWARD-TSN as the only chunk in its packet, as bundling it with a DATA chunk seemed to reproduce the issue more frequently, but when that workaround was evaluated in production, it still happened.

With the popularity of usrsctp as SCTP library in most browsers - and which will remain so for a long time, I really need to consider working around the issue even if it would be corrected.

Thanks!

@tuexen
Copy link
Member

tuexen commented Aug 23, 2021

Let me see if I can figure out what the bug is. Then it might be able to give a way to work around it.

I'm a bit surprised by the plan to work around the bug, even if it is fixed. I'm not familiar with the browser world, but when talking at the beginning about the usrsctp integration into browsers, the statement was that not much effort needs to go into testing, since the browser vendors can roll out bug fixes world wide very fast. So has this changed and code stays much longer in use?

@Orphis
Copy link
Contributor

Orphis commented Aug 23, 2021

While it is possible to release emergency fixes quickly or use our experiment framework to disable codepaths, I don't think that a fix for this issue would qualify, so it would have to go through the regular pipeline which takes a few months at least between the time it takes to merge the fix to the fix being promoted to stable. Some other browsers are also only updated with OS releases, which might be postponed indefinitely by the users.

There are also a lot of users with fixed older versions (typically enterprise users) and other releases integrating Chromium (such as Electron or CEF) that might take a while to be updated. You also have a lot more users using SaaS products based on older versions of libwebrtc (and thus usrsctp) trying to communicate with all kinds of browser versions.

Hence, a workaround is probably needed in dcSCTP and usrsctp might benefit from one as well.

@boivie
Copy link
Author

boivie commented Aug 24, 2021

Thanks Michael - looking forward to your findings. Let me know if I can be of any assistance.

Adding to what @Orphis said, while Chrome and Firefox on the consumer desktop are often up-to-date, mobile browsers and also mobile apps that use usrsctp, that are not strictly browsers. also have a lower uptake of new versions. Some mobile phones simply don't have enough storage to update, or are stuck on old OS versions that the new versions of the apps are incompatible with and will never be updated. Firefox says that it takes 4 weeks to reach 70% with the new version, and I don't know if the remaining 30% are ever updated...

@tuexen
Copy link
Member

tuexen commented Aug 24, 2021

While it is possible to release emergency fixes quickly or use our experiment framework to disable codepaths, I don't think that a fix for this issue would qualify, so it would have to go through the regular pipeline which takes a few months at least between the time it takes to merge the fix to the fix being promoted to stable. Some other browsers are also only updated with OS releases, which might be postponed indefinitely by the users.

There are also a lot of users with fixed older versions (typically enterprise users) and other releases integrating Chromium (such as Electron or CEF) that might take a while to be updated. You also have a lot more users using SaaS products based on older versions of libwebrtc (and thus usrsctp) trying to communicate with all kinds of browser versions.

Hence, a workaround is probably needed in dcSCTP and usrsctp might benefit from one as well.

Thanks for the explanation!

@tuexen
Copy link
Member

tuexen commented Aug 24, 2021

Thanks Michael - looking forward to your findings. Let me know if I can be of any assistance.

Will report here by the end of this week.

Adding to what @Orphis said, while Chrome and Firefox on the consumer desktop are often up-to-date, mobile browsers and also mobile apps that use usrsctp, that are not strictly browsers. also have a lower uptake of new versions. Some mobile phones simply don't have enough storage to update, or are stuck on old OS versions that the new versions of the apps are incompatible with and will never be updated. Firefox says that it takes 4 weeks to reach 70% with the new version, and I don't know if the remaining 30% are ever updated...

Thanks for the data point.

@boivie
Copy link
Author

boivie commented Aug 25, 2021

Just for reference, and perhaps it helps during troubleshooting, this is a packetdrill that reproduced the issue with FORWARD-TSN in its own packet, not bundled with any DATA. Another difference here (from previous example), is that there is only a single fragmented message in-flight, and the FORWARD-TSN is not moved beyond it - just skipping a few unfragmented messages that were sent prior.

+0.000 `sysctl -w net.inet.sctp.pktdrop_enable=0`
+0.000 socket(..., SOCK_STREAM, IPPROTO_SCTP) = 3
+0.000 bind(3, ..., ...) = 0
+0.000 listen(3, 1) = 0
+0.000 < sctp: INIT[flgs=0, tag=1, a_rwnd=5242880, os=65535, is=65535, tsn=0,
                    FORWARD_TSN_SUPPORTED[],
                    SUPPORTED_EXTENSIONS[types=[RECONFIG, FORWARD_TSN]]]
+0.000 > sctp: INIT_ACK[flgs=0, tag=2, a_rwnd=..., os=..., is=..., tsn=10,
                        FORWARD_TSN_SUPPORTED[],
                        SUPPORTED_EXTENSIONS[types=[FORWARD_TSN, RECONFIG]],
                        STATE_COOKIE[len=..., val=...]]
+0.010 < sctp: COOKIE_ECHO[flgs=0, len=..., val=...]
+0.000 > sctp: COOKIE_ACK[flgs=0]
+0.000 accept(3, ..., ...) = 4
+0.000 close(3) = 0
+0.000 < sctp: DATA[flgs=UBE, len=100, tsn=0, sid=1, ssn=0, ppid=0]

+0.000 > sctp: SACK[flgs=0, cum_tsn=0, a_rwnd=..., gaps=[], dups=[]]

// There are a few packets lost, TSNs 1 (UBE), 2 (UBE), 3 (UB),

+0.040 < sctp: DATA[flgs=U, len=100, tsn=4, sid=1, ssn=0, ppid=0]

+0.000 > sctp: SACK[flgs=0x00, cum_tsn=0, a_rwnd=..., gaps=[4:4], dups=[]]

+0.000 < sctp: DATA[flgs=UE, len=100, tsn=5, sid=1, ssn=0, ppid=0]

+0.000 > sctp: SACK[flgs=0x00, cum_tsn=0, a_rwnd=..., gaps=[4:5], dups=[]]

// The SACK will abandon TSN 1, as it was sent with limited retransmissions.
+0.050 < sctp: FORWARD_TSN[cum_tsn=1, ids=[]]

+0.000 > sctp: SACK[flgs=0x00, cum_tsn=1, a_rwnd=..., gaps=[3:4], dups=[]]

// Retransmits TSN 2 and 3
+0.000 < sctp: DATA[flgs=UBE, len=100, tsn=2, sid=1, ssn=0, ppid=0]
+0.000 < sctp: DATA[flgs=UB, len=100, tsn=3, sid=1, ssn=0, ppid=0]

+0.000 > sctp: ABORT[flgs=0, PROTOCOL_VIOLATION[info="Illegal message sequence, missing end for MID: 00000001"]]

+0.200 close(4) = 0

@tuexen
Copy link
Member

tuexen commented Sep 1, 2021

Here is a minimal (at least according to my current thinking) reproducer of the issue:

--tolerance_usecs=25000
 0.000 `kldload -n sctp`
+0.000 `sysctl -w net.inet.sctp.ecn_enable=0`
+0.000 `sysctl -w net.inet.sctp.pr_enable=1`
+0.000 `sysctl -w net.inet.sctp.asconf_enable=0`
+0.000 `sysctl -w net.inet.sctp.auth_enable=0`
+0.000 `sysctl -w net.inet.sctp.reconfig_enable=0`
+0.000 `sysctl -w net.inet.sctp.nrsack_enable=0`
+0.000 `sysctl -w net.inet.sctp.pktdrop_enable=0`
+0.000 socket(..., SOCK_STREAM, IPPROTO_SCTP) = 3
+0.000 bind(3, ..., ...) = 0
+0.000 listen(3, 1) = 0
+0.000 < sctp: INIT[flgs=0, tag=1, a_rwnd=5242880, os=65535, is=65535, tsn=0,
                    FORWARD_TSN_SUPPORTED[],
                    SUPPORTED_EXTENSIONS[types=[FORWARD_TSN]]]
+0.000 > sctp: INIT_ACK[flgs=0, tag=2, a_rwnd=..., os=..., is=..., tsn=10,
                        FORWARD_TSN_SUPPORTED[],
                        SUPPORTED_EXTENSIONS[types=[FORWARD_TSN]],
                        STATE_COOKIE[len=..., val=...]]
+0.010 < sctp: COOKIE_ECHO[flgs=0, len=..., val=...]
+0.000 > sctp: COOKIE_ACK[flgs=0]
+0.000 accept(3, ..., ...) = 4
+0.000 close(3) = 0

//    <----- DATA[UBE, TSN=0] --------
//    ------ SACK[cumTSN=0]---------->
//     |-----DATA[UBE, TSN=1] --------  To be skipped
//     |-----DATA[UBE, TSN=2] --------  Trigger the problem on retransmission
//     |-----DATA[UB, TSN=3] ---------
//    <------DATA[U,  TSN=4] ---------
//    ------ SACK[cumTSN=0,[4]]------>
//    <------FORWARD_TSN[cumTSN=1]----
//    ------ SACK[cumTSN=1,[3]]------>
//    <------DATA[UBE, TSN=2] -------- Bug!

+0.000 < sctp: DATA[flgs=UBE, len=116, tsn=0, sid=0, ssn=0, ppid=0]
+0.000 > sctp: SACK[flgs=0, cum_tsn=0, a_rwnd=..., gaps=[], dups=[]]
+0.000 recv(4, ..., 1024, 0) = 100
+0.040 < sctp: DATA[flgs=U, len=120, tsn=4, sid=0, ssn=0, ppid=0]
+0.000 > sctp: SACK[flgs=0x00, cum_tsn=0, a_rwnd=..., gaps=[4:4], dups=[]]
+0.050 < sctp: FORWARD_TSN[cum_tsn=1, ids=[]]
+0.000 > sctp: SACK[flgs=0x00, cum_tsn=1, a_rwnd=..., gaps=[3:3], dups=[]]
+0.000 < sctp: DATA[flgs=UBE, len=118, tsn=2, sid=0, ssn=0, ppid=0]
+0.000 > sctp: ABORT[flgs=0, PROTOCOL_VIOLATION[info="Illegal message sequence, missing end for MID: 00000001"]]
+0.200 close(4) = 0

Still contemplating about the best way to fix this...

@boivie
Copy link
Author

boivie commented Sep 2, 2021

Thanks for the update, Michael.

I agree - it's very similar to my previous comment which was an actual capture "from the wild", but even shorter.

@boivie
Copy link
Author

boivie commented Sep 17, 2021

Hi again,

Just wondering if you have made any progress on this issue?

Thanks, and have a great weekend!

@tuexen
Copy link
Member

tuexen commented Sep 17, 2021

I'm interrupted by other bugs. Need to look into more detail. Planned for next week.

@boivie
Copy link
Author

boivie commented Sep 17, 2021

Thanks, Michael.

@boivie
Copy link
Author

boivie commented Nov 4, 2021

Hi Michael,

I'm just checking in to see if you have had time to look more into this.

Thanks!

@tuexen
Copy link
Member

tuexen commented Nov 4, 2021

Hi Michael,

I'm just checking in to see if you have had time to look more into this.

Haven't nailed down it yet. The semester just started and the IETF meeting is next week... Will look into it.

Best regards
Michael

Thanks!

ibaoger pushed a commit to ibaoger/webrtc that referenced this issue Apr 20, 2022
This proved to be not very efficient unfortunately, so revert it and
keep bundling FORWARD-TSN with other packets to be more efficient.

sctplab/usrsctp#597 is still unresolved.

Note that this is not a clean revert; The logic to rate limit the
sending of FORWARD-TSN is kept, as it still makes sense.

This partly reverts commit 0ca62e3.

Bug: webrtc:12961
Change-Id: I42728434290e7ece19e9c23f24ef6f3d3b171315
Reviewed-on: https://webrtc-review.googlesource.com/c/src/+/259520
Reviewed-by: Harald Alvestrand <hta@webrtc.org>
Commit-Queue: Victor Boivie <boivie@webrtc.org>
Cr-Commit-Position: refs/heads/main@{#36584}
aethanyc pushed a commit to aethanyc/gecko-dev that referenced this issue Sep 2, 2022
Upstream commit: https://webrtc.googlesource.com/src/+/ddc2f334c49c5287dae7ade273a530f4817ee0a6
    Revert "dcsctp: Avoid bundling FORWARD-TSN and DATA chunks"

    This proved to be not very efficient unfortunately, so revert it and
    keep bundling FORWARD-TSN with other packets to be more efficient.

    sctplab/usrsctp#597 is still unresolved.

    Note that this is not a clean revert; The logic to rate limit the
    sending of FORWARD-TSN is kept, as it still makes sense.

    This partly reverts commit 0ca62e3752149ad37f73bf074db0a5f8fcaf6585.

    Bug: webrtc:12961
    Change-Id: I42728434290e7ece19e9c23f24ef6f3d3b171315
    Reviewed-on: https://webrtc-review.googlesource.com/c/src/+/259520
    Reviewed-by: Harald Alvestrand <hta@webrtc.org>
    Commit-Queue: Victor Boivie <boivie@webrtc.org>
    Cr-Commit-Position: refs/heads/main@{#36584}
@jesup
Copy link

jesup commented Oct 27, 2023

@tuexen Any chance of restarting work on this?

daemory pushed a commit to dangxiwang/webrtc that referenced this issue Apr 19, 2024
dcSCTP seems to be able to provoke usrsctp to send ABORT in some
situations, as described in
sctplab/usrsctp#597. Using a packetdrill
script, it seems as a contributing factor to this behavior is when a
FORWARD-TSN chunk is bundled with a DATA chunk. This is a valid and
recommended pattern in RFC3758:

  "F2) The data sender SHOULD always attempt to bundle an outgoing
       FORWARD TSN with outbound DATA chunks for efficiency."

However, that seems to be a rare event in usrsctp, which generally sends
each FORWARD-TSN in a separate packet.

By doing the same, the assumption is that this scenario will generally
be avoided.

With many browsers and other clients using usrsctp, and which will not
be upgraded for a long time, there is an advantage of avoiding the issue
even if it is according to specification.

Before this change, a FORWARD-TSN was bundled with outgoing DATA and due
to this, it wasn't rate limited as the overhead was very little. With
this change, a rate limiting behavior has been added to avoid sending
too many FORWARD-TSN in small packets. It will be sent every RTT, or
200 ms, whichever is smallest. This is also described in the RFC.

Bug: webrtc:12961
Change-Id: I3d8036a34f999f405958982534bfa0e99e330da3
daemory pushed a commit to dangxiwang/webrtc that referenced this issue Apr 19, 2024
This proved to be not very efficient unfortunately, so revert it and
keep bundling FORWARD-TSN with other packets to be more efficient.

sctplab/usrsctp#597 is still unresolved.

Note that this is not a clean revert; The logic to rate limit the
sending of FORWARD-TSN is kept, as it still makes sense.

This partly reverts commit 0ca62e3.

Bug: webrtc:12961
Change-Id: I42728434290e7ece19e9c23f24ef6f3d3b171315
daemory pushed a commit to dangxiwang/webrtc that referenced this issue Apr 19, 2024
This proved to be not very efficient unfortunately, so revert it and
keep bundling FORWARD-TSN with other packets to be more efficient.

sctplab/usrsctp#597 is still unresolved.

This reverts commit 0ca62e3.
Bug:

Change-Id: I42728434290e7ece19e9c23f24ef6f3d3b171315
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants