Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aborted with sctp_timeout_handler: tmr->self corrupted, but tmr->self is null in the core dump #673

Open
JonathanLennox opened this issue Apr 27, 2023 · 6 comments

Comments

@JonathanLennox
Copy link
Contributor

Hi - I'm debugging an assert crash inside the usrsctp library.

What I see is that it output the debug output
sctp_timeout_handler: tmr->self corrupted
from netinet/sctputil.c:1820 just before it aborted, but when I look at the core dump, I see this value correctly as NULL:

(gdb) up
#6  0x00007f4c76993db2 in sctp_timeout_handler (t=0x7f4cf02a96e8) at netinet/sctputil.c:1820
1820		KASSERT(tmr->self == NULL || tmr->self == tmr,
(gdb) p tmr->self
$1 = (void *) 0x0
(gdb) p *tmr
$1 = {timer = {tqe = {tqe_next = 0x7f4c880d9fe0, tqe_prev = 0x7f4c9c066f00}, c_time = 11115660, c_arg = 0x7f4cf02a96e8, 
    c_func = 0x7f4c76993c9f <sctp_timeout_handler>, c_flags = 0}, type = 3, ep = 0x7f4cf02a9e70, tcb = 0x7f4cf02a9610, net = 0x0, self = 0x0, 
  ticks = 11115460, stopped_from = 2415919107}

So I assume this must be a race condition of some sort where the value of tmr->self isn't properly protected.

I don't see any other threads running inside usrsctp in my core dump, but from my logs, it appears that usrsctp_conninput was called immediately before the crash (within the same millisecond).

This is usrsctp c1d6cb3 built for Linux/x86_64, as built in https://github.com/jitsi/jitsi-sctp, running as native code under an OpenJDK 11.0.18 Java VM running https://github.com/jitsi/jitsi-videobridge.

Let me know if there's any other information I can provide.

@JonathanLennox
Copy link
Contributor Author

This has now happened three more times, with the same symptoms. Please let me know any further information you need.

@tuexen
Copy link
Member

tuexen commented May 24, 2023

How often does it happen? Do you have a way to reproduce this?

@JonathanLennox
Copy link
Contributor Author

Sadly, I don't have a way to reproduce it reliably. It happens once every few days across our fleet of meet.jit.si production servers.

I have the core dumps so if there's any other information I can share that would be useful let me know.

See also #676 which is a rarer crash (I've only seen it once so far) but I suspect has the same root cause, and may be more revealing?

@JonathanLennox
Copy link
Contributor Author

I've finally managed to extract a Java heap dump corresponding to this core dump, which lets me correlate my user-level objects and logs with the usrsctp objects. (For most of the crashes I'm running into a Java bug which is preventing this heap dump from being created.)

In this case it appears that the socket with the crashing timer received an SCTP packet just under 200 ms before the crash. Five other SCTP sockets received a packet in the interval between that packet receipt and the crash.

The crashing timer appears to be a SCTP_TIMER_TYPE_RECV timer.

@tuexen
Copy link
Member

tuexen commented May 24, 2023

I've finally managed to extract a Java heap dump corresponding to this core dump, which lets me correlate my user-level objects and logs with the usrsctp objects. (For most of the crashes I'm running into a Java bug which is preventing this heap dump from being created.)

In this case it appears that the socket with the crashing timer received an SCTP packet just under 200 ms before the crash. Five other SCTP sockets received a packet in the interval between that packet receipt and the crash.

The crashing timer appears to be a SCTP_TIMER_TYPE_RECV timer.

That is the delay ACK timer and normally expires at 200ms. I'm looking at a timer related problem where a socket is freed twice. Once I have a fix committed, you could try it, if it fixes also your issue. I'll let you know once I have solved the issue.

@JonathanLennox
Copy link
Contributor Author

Any news on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants