Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Freeswitch dies (logs show either signal 6,9 or 11), resurrects but sometimes remains a zombie #252

Closed
marcellinog opened this issue Jan 21, 2020 · 1 comment

Comments

@marcellinog
Copy link

marcellinog commented Jan 21, 2020

We are facing strange occurrences in fs 1.8.7 running on Debian 9.
With very light load the main process crashes or dies.

We have ruled out OOMkiller for the signal 9 that kills it from time so that is still a mystery and any clue or hint on why this is happening would be awesome.

After compiling freeswitch with tcmalloc or jemalloc we have not seen crashes with signal 6 (ABORT).
The segfaults (signal 11) seem to point to memory buffer corruption on media recording: any help would be gladly accepted. Note: recording is from g711 to wav.

Example:

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/usr/local/freeswitch/bin/freeswitch -u root -g root -ncwait -nonat -rp -core'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 __memmove_sse2_unaligned_erms () at ../sysdeps/x86_64/multiarch/../multiarch/memmove-vec-unaligned-erms.S:363
363 ../sysdeps/x86_64/multiarch/../multiarch/memmove-vec-unaligned-erms.S: No such file or directory.
[Current thread is 1 (Thread 0x7f240d347700 (LWP 29527))]
(gdb) bt full
#0 _memmove_sse2_unaligned_erms () at ../sysdeps/x86_64/multiarch/../multiarch/memmove-vec-unaligned-erms.S:363
No locals.
#1 0x00007f243099c808 in switch_buffer_read (buffer=0x7f240af76250, data=, datalen=datalen@entry=65536) at src/switch_buffer.c:247
reading = 65536
#2 0x00007f24309bcace in switch_core_file_write (fh=0x7f240af151b8, data=data@entry=0x7f240943c028, len=len@entry=0x7f240d345518) at src/switch_core_file.c:656
datalen_adj = 65536
status = SWITCH_STATUS_SUCCESS
asis = 0
rlen =
blen = 32768
orig_len =
PRETTY_FUNCTION = "switch_core_file_write"
func = "switch_core_file_write"
#3 0x00007f2430a5e12e in recording_thread (thread=, obj=) at src/switch_ivr_async.c:1281
bug =
session = 0x7f24207676e8
channel = 0x7f2421659110
rh = 0x7f240af15408
bsize = 8192
samples = 192
inuse = 384
data = 0x7f240943c028 "\340\002\346\004V\v\223\bJ\v\227\r\321\n\024\v\305\f\250\016Y\v\255\022\006\024\033\025\340\024\032\025J\033 \030B\031\263\035.\032\215\037\372!,\036/ \235!\253 2"\334$\313#\230&\254)4'p$\177*\222(\271&\033*\313(K)\177'\273'\371(\373(\005'~(\240)\212!\261%\006%\223!\255"
\246 \004 ^\035\246\034\t\031#\026v\024\351\022\034\024(\020T\017\206\r\214\016\215\t\302\a-\bS\003\333\375\334\376w\367\310\364\327\366?\361\337\362\201\363)\364\375\356\272\357\351\356\361\357\350\354)\354$\361\343\356\325\351\004\355@\352v\341\313\345\301\341O\341\252\342\271\336W\335R\334\026\333\334", <incomplete sequence \327>...
channels = 1
read_impl = {codec_type = SWITCH_CODEC_TYPE_AUDIO, ianacode = 0 '\000', iananame = 0x0, fmtp = 0x0, samples_per_second = 0, actual_samples_per_second = 0,
bits_per_second = 0, microseconds_per_packet = 0, samples_per_packet = 0, decoded_bytes_per_packet = 0, encoded_bytes_per_packet = 0, number_of_channels = 0 '\000',
codec_frames_per_packet = 0, init = 0x0, encode = 0x0, decode = 0x0, encode_video = 0x0, decode_video = 0x0, codec_control = 0x0, destroy = 0x0, codec_id = 0,
impl_id = 0, modname = 0x0, next = 0x0}
func = "recording_thread"
#4 0x00007f2430cfd3d7 in dummy_worker (opaque=0x7f2409cb9160) at threadproc/unix/thread.c:151
thread = 0x7f2409cb9160
#5 0x00007f242fee34a4 in start_thread (arg=0x7f240d347700) at pthread_create.c:456
__res =
pd = 0x7f240d347700
now =
unwind_buf = {cancel_jmp_buf = {{jmp_buf = {139792817092352, -3982925440384300703, 139792690191582, 139792690191583, 139792816844800, 3, 3959178655632394593,
3959105760450644321}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
not_first_call =
pagesize_m1 =
sp =
freesize =
PRETTY_FUNCTION = "start_thread"
#6 0x00007f242f520d0f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:97
No locals.

We use lua to do logic call control. lua 5.2.4 in this case.

The main freeswitch process is controlled by systemctl.
Some of our freeswitches when they crash and restart are stuck into a limbo:

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/usr/local/freeswitch/bin/freeswitch'.
#0 0x00007ff03291b603 in select () at ../sysdeps/unix/syscall-template.S:84
84 ../sysdeps/unix/syscall-template.S: No such file or directory.
[Current thread is 1 (Thread 0x7ff035397040 (LWP 28473))]
(gdb) bt full
#0 0x00007ff03291b603 in select () at ../sysdeps/unix/syscall-template.S:84
No locals.
#1 0x00007ff034100915 in apr_sleep (t=) at time/unix/time.c:246
tv = {tv_sec = 0, tv_usec = 383914}
#2 0x00007ff033e9571c in do_sleep (t=) at src/switch_time.c:164
No locals.
#3 0x00007ff033ddbf02 in switch_core_runtime_loop (bg=) at src/switch_core.c:1197
No locals.
#4 0x000055777b916de6 in main (argc=9, argv=) at src/switch.c:1208
pid_path = "/usr/local/freeswitch/run/freeswitch.pid", '\000' <repeats 4055 times>
pid_buffer = "28473", '\000' <repeats 26 times>
old_pid_buffer = '\000' <repeats 31 times>
pid_len = 5
old_pid_len = 0
err = 0x7ff03412c02d "Success"
nf =
do_wait =
runas_user =
runas_group =
reincarnate =
reincarnate_reexec =
fds = {3, -1}
nc = SWITCH_TRUE
pid =
i =
x =
opts =
opts_str = '\000' <repeats 1023 times>
local_argv = {0x7fff2e91de54 "/usr/local/freeswitch/bin/freeswitch", 0x7fff2e91de79 "-u", 0x7fff2e91de7c "root", 0x7fff2e91de81 "-g", 0x7fff2e91de84 "root",
0x7fff2e91de89 "-ncwait", 0x7fff2e91de91 "-nonat", 0x7fff2e91de98 "-rp", 0x7fff2e91de9c "-core", 0x0 <repeats 1015 times>}
local_argc =
arg_argv = {0x0 <repeats 128 times>}
alt_dirs =
alt_base =
log_set =
run_set =
do_kill =
priority =
flags =
ret = 0
destroy_status =
fd = 0x7ff02c04a0a0
pool = 0x7ff02c04a028

And fs_cli is of course unresponsive.

When this happens we have to manually 'systemctl restart freeswitch'.

We also have freeswitch configured to attempt recovery of the sofia stack. This might have an impact.

If anybody has suggestions or tips, that would be great.

UPDATE:
Debugging under gdb and valgrind shows that when doing an attended transfer while recording the call and retaining the recording on transfer causes the bug on channel a to be moved to channel b pointing at the same file but with different buffer handle pointing to the same memory buffer. When party c gets the call ( a transferred to c from b), if b hangs up, freeswitch crashes with memory corruption because the memory buffer used by the bug is destroyed.

@briankwest
Copy link
Collaborator

Update to the latest release, 1.8 is not supported anymore, please try again with 1.10.x

@crienzo crienzo closed this as completed Mar 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants