actual crash fix this time #800

gisforgirard · 2023-05-02T21:48:21Z

this is the actual crash fix i kept messing up and submitting pull requests for, for real this time

honestly this is probably not the best way to solve it, but there is a case where amt_produce ends up going over 32768 and it crashes the program faithfully every time, and this has appeared to solve it for me without any obvious negative side effects, there is probably a better way to go about doing this and its pretty much a hack, but it's a hack that works at least

this is the actual crash fix i kept messing up, for real this time

Dygear · 2023-05-02T22:24:42Z

When does amt_produce go over the 32768 limit? Sounds like there is another bug lurking there somewhere else. This feels a little like a band aide to the problem. Will for example discarding past the 32767th byte cause significant data loss for voice data or something else?

robotastic · 2023-05-03T00:47:23Z

This is definitely interesting! let me see if I can find some suspects. amt_produced should never get that large.

tadscottsmith · 2023-05-03T01:10:19Z

I say it's always a good idea to check for overflows, but I'm pretty sure you've identified the reason you're seeing the issue, and most others aren't: #773 (comment).

robotastic · 2023-05-03T01:11:01Z

Tracking some research:
amt_produce is an int which means it can have a range between: -2147483648 to 2147483647
it is set to the value of the queue holding the output - which is returned as a size_t.

The baffling thing is that amt_produce should never be that big. It is however many samples were processed since the last time things were processed.

Could you uncomment line 183 in that file? I am curious what values you are seeing for the output queue...

robotastic · 2023-05-03T01:15:21Z

oh - you maybe right @tadscottsmith That would explain why so much gets built up... I am honestly surprised this fixes things... and also why that value, it seems like an int could hold a lot more. maybe it is something further down the stream.

faultywarrior · 2023-05-03T01:40:48Z

I suppose it is possible that the queue is getting built up that big due to filesystem writes being held; but it does seem odd. Usually if you start to out-strip your storage throughput, you just start dropping traffic or get spammed with O's. FWIW, I capture about a dozen of calls per second on my T-R instance to a NAS which, like @gisforgirard, I have it mounted as a local folder. I've never seen this crash occur - my TR instance has had months of uptime. However, my NAS is fully solid-state and has a 10Gb connection to the ESXi host where the T-R instance runs; so its possible that despite the large write volume (my T-R logs are witten to the same NAS); my latency is low enough that I don't build the buffer up big enough to trip this bug.

tadscottsmith · 2023-05-03T02:00:19Z

If my math is right, and 160 samples = 20 ms, then 32768 samples is (32768 / 160 = 204.8 * 20 ms = 4096 ms) ~ 4.1 seconds behind.

The max size of an Int_16 is 32767 and there's an awful lot of those in the area that would make me suspicious.

trunk-recorder/lib/op25_repeater/lib/p25_frame_assembler_impl.cc

Line 181 in 1718ea5

int16_t *out = (int16_t *)output_items[0];

This line stands out after a quick glance.

trunk-recorder/lib/op25_repeater/lib/p25_frame_assembler_impl.cc

Line 207 in 1718ea5

out[i] = output_queue[i];

taclane · 2023-05-03T02:27:58Z

If my math is right, and 160 samples = 20 ms, then 32768 samples is (32768 / 160 = 204.8 * 20 ms = 4096 ms) ~ 4.1 seconds behind.

That sounds about right. A couple years back there was an issue where buffers would back up and it led to call audio falling out of sequence. 60-90k in the queue time-shifted transmissions 8-10 seconds into the next call.

A little curious about "// buffer limit is 32768, see gnuradio/gnuradio-runtime/lib/../include/gnuradio/buffer.h:186". Which version of gnuradio is this from?

Dygear · 2023-05-03T03:55:08Z

Oh, not flushing to the network is a problem. Didn't think that it would hold onto it in a buffer there, but in a file buffer somewhere else. It could save to the file system (even remote file systems) once the device becomes ready for it.

Sounds like the real fix is making a file save function that holds onto a que of data that it's waiting to write to disk somewhere in the main loop so it doesn't backpressure into this buffer.

robotastic · 2023-05-05T01:34:57Z

I added a few more debug values to the message. I am also checking with a friend to see if he has any ideas... either way, this seems like a good fix, I am merging it in.

gisforgirard · 2023-05-13T05:12:19Z

I suppose it is possible that the queue is getting built up that big due to filesystem writes being held; but it does seem odd. Usually if you start to out-strip your storage throughput, you just start dropping traffic or get spammed with O's. FWIW, I capture about a dozen of calls per second on my T-R instance to a NAS which, like @gisforgirard, I have it mounted as a local folder. I've never seen this crash occur - my TR instance has had months of uptime. However, my NAS is fully solid-state and has a 10Gb connection to the ESXi host where the T-R instance runs; so its possible that despite the large write volume (my T-R logs are witten to the same NAS); my latency is low enough that I don't build the buffer up big enough to trip this bug.

yeah i have a 10GBASE-T link straight to my NAS but for whatever reason it has weird latency issues where packets won't drop but it will freeze the entire system at the kernel whenever it happens, i have tried different network cards, SFP+ adapters, ethernet cords, kernel settings, and it drives me crazy trying to track down what is going on but still haven't been able to figure it out, it's probably some kind of problem with synology's software but i'm already like $3000 into this stupid thing so maybe i'll go back to truenas for the next build instead...

Update p25_frame_assembler_impl.cc

1718ea5

this is the actual crash fix i kept messing up, for real this time

Update p25_frame_assembler_impl.cc

9be8181

robotastic merged commit 8cabd46 into robotastic:master May 5, 2023
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

actual crash fix this time #800

actual crash fix this time #800

gisforgirard commented May 2, 2023

Dygear commented May 2, 2023

robotastic commented May 3, 2023

tadscottsmith commented May 3, 2023

robotastic commented May 3, 2023

robotastic commented May 3, 2023

faultywarrior commented May 3, 2023

tadscottsmith commented May 3, 2023

taclane commented May 3, 2023

Dygear commented May 3, 2023

robotastic commented May 5, 2023

gisforgirard commented May 13, 2023 •

edited

Loading

actual crash fix this time #800

actual crash fix this time #800

Conversation

gisforgirard commented May 2, 2023

Dygear commented May 2, 2023

robotastic commented May 3, 2023

tadscottsmith commented May 3, 2023

robotastic commented May 3, 2023

robotastic commented May 3, 2023

faultywarrior commented May 3, 2023

tadscottsmith commented May 3, 2023

taclane commented May 3, 2023

Dygear commented May 3, 2023

robotastic commented May 5, 2023

gisforgirard commented May 13, 2023 • edited Loading

gisforgirard commented May 13, 2023 •

edited

Loading