Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CPU utilization #25

Closed
skpy opened this issue Jun 15, 2016 · 13 comments
Closed

CPU utilization #25

skpy opened this issue Jun 15, 2016 · 13 comments

Comments

@skpy
Copy link

skpy commented Jun 15, 2016

We're running tmate-slave on a RHEL7 server. Load on this system is extremely high, and CPU is pegged in nice.

top - 09:45:40 up 30 days, 22:35,  1 user,  load average: 15.50, 15.51, 15.49
Tasks: 176 total,  17 running, 159 sleeping,   0 stopped,   0 zombie
%Cpu(s):  1.2 us,  0.3 sy, 98.5 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem :  1884384 total,   227100 free,   382580 used,  1274704 buff/cache
KiB Swap:  2097148 total,  2068184 free,    28964 used.  1254276 avail Mem

12163 nobody    21   1  162392   1916    340 R  14.3  0.1   1071:42 tmate-slave
 8998 nobody    21   1  161820   1352    352 R  14.3  0.1   1081:14 tmate-slave
 9125 nobody    21   1  161936   1428    352 R  14.0  0.1   1082:36 tmate-slave
12163 nobody    21   1  162392   1916    340 R  14.0  0.1   1071:53 tmate-slave
14908 nobody    21   1  162532   1768      0 R  14.0  0.1   1072:09 tmate-slave
15130 nobody    21   1  162532   1776      0 R  14.0  0.1   1067:50 tmate-slave
15475 nobody    21   1  162700   2012      0 R  13.6  0.1   1067:42 tmate-slave
32267 nobody    21   1  161936   1392    344 R  13.6  0.1   7681:25 tmate-slave
32258 nobody    21   1  161824   1360    344 R  12.6  0.1   7682:29 tmate-slave
32293 nobody    21   1  161936   1460    352 R  12.6  0.1   7673:53 tmate-slave
 9204 nobody    21   1  161936   1448    352 R  12.3  0.1   1078:23 tmate-slave
 9283 nobody    21   1  161936   1448    352 R  12.3  0.1   1083:18 tmate-slave
10259 nobody    21   1  162064   1624    352 R  12.3  0.1   1083:34 tmate-slave
10664 nobody    21   1  162172   1744    352 R  12.3  0.1   1075:29 tmate-slave
12819 nobody    21   1  162408   1924    340 R  12.3  0.1   1073:48 tmate-slave
15313 nobody    21   1  162700   2012      0 R  12.0  0.1   1072:00 tmate-slave

Is this normal behavior? Do we need to allocate more CPU to tmate-slave hosts?

@nviennot
Copy link
Member

definitely not normal, can you get a core dump? or stack trace? strace?

you can tmate me in if you'd like me to help on the machine

@skpy
Copy link
Author

skpy commented Jun 16, 2016

$ ps aux | grep tmate | grep -v grep
nobody    8998 13.0  0.0 161820  1352 ?        RN   Jun09 1173:04 tmate-slave [uK2ra5DcEFe51bfDzi2HPwUvn] (daemon) 10.1.73.216
nobody    9107  0.0  0.0 161936  1440 ?        SN   Jun09   4:54 tmate-slave [uK2ra5DcEFe51bfDzi2HPwUvn] (daemon) 10.1.73.216
nobody    9125 13.1  0.0 161936  1428 ?        RN   Jun09 1175:05 tmate-slave [uK2ra5DcEFe51bfDzi2HPwUvn] (daemon) 10.1.73.216
nobody    9204 13.0  0.0 161936  1448 ?        RN   Jun09 1171:56 tmate-slave [uK2ra5DcEFe51bfDzi2HPwUvn] (daemon) 10.1.73.216
nobody    9207  0.0  0.0 161936  1304 ?        SN   Jun09   4:39 tmate-slave [uK2ra5DcEFe51bfDzi2HPwUvn] (daemon) 10.1.73.216
nobody    9283 13.1  0.0 161936  1448 ?        RN   Jun09 1176:43 tmate-slave [uK2ra5DcEFe51bfDzi2HPwUvn] (daemon) 10.1.73.216
nobody    9922  0.0  0.0 162064  1560 ?        SN   Jun09   4:42 tmate-slave [uK2ra5DcEFe51bfDzi2HPwUvn] (daemon) 10.1.73.216
nobody   10259 13.1  0.0 162064  1624 ?        RN   Jun09 1175:41 tmate-slave [uK2ra5DcEFe51bfDzi2HPwUvn] (daemon) 10.1.73.216
nobody   10664 13.0  0.0 162172  1744 ?        RN   Jun09 1167:32 tmate-slave [uK2ra5DcEFe51bfDzi2HPwUvn] (daemon) 10.1.73.216
nobody   12163 13.1  0.1 162392  1916 ?        RN   Jun09 1166:13 tmate-slave [uK2ra5DcEFe51bfDzi2HPwUvn] (daemon) 10.1.73.216
nobody   12819 13.1  0.1 162408  1924 ?        RN   Jun09 1166:08 tmate-slave [uK2ra5DcEFe51bfDzi2HPwUvn] (daemon) 10.1.73.216
nobody   14908 13.1  0.0 162532  1768 ?        RN   Jun09 1167:03 tmate-slave [uK2ra5DcEFe51bfDzi2HPwUvn] (daemon) 10.1.73.216
nobody   15130 13.1  0.0 162532  1776 ?        RN   Jun09 1161:41 tmate-slave [uK2ra5DcEFe51bfDzi2HPwUvn] (daemon) 10.1.73.216
nobody   15313 13.2  0.1 162700  2012 ?        RN   Jun09 1166:40 tmate-slave [uK2ra5DcEFe51bfDzi2HPwUvn] (daemon) 10.1.73.216
root     15355  0.0  0.1 161336  2892 ?        Ss   May16   0:00 /usr/bin/tmate-slave -k /etc/tmate-slave/keys -p 22000
nobody   15475 13.1  0.1 162700  2012 ?        RN   Jun09 1160:18 tmate-slave [uK2ra5DcEFe51bfDzi2HPwUvn] (daemon) 10.1.73.216
nobody   32258 40.7  0.0 161824  1360 ?        RN   Jun02 7777:39 tmate-slave [Nnh9mflyXfAvrgPq2vPeqGBHl] (daemon) 10.1.74.31
nobody   32267 40.7  0.0 161936  1392 ?        RN   Jun02 7776:24 tmate-slave [Nnh9mflyXfAvrgPq2vPeqGBHl] (daemon) 10.1.74.31
nobody   32286  0.0  0.0 161936  1308 ?        SN   Jun02   8:36 tmate-slave [Nnh9mflyXfAvrgPq2vPeqGBHl] (daemon) 10.1.74.31
nobody   32293 40.6  0.0 161936  1460 ?        RN   Jun02 7768:08 tmate-slave [Nnh9mflyXfAvrgPq2vPeqGBHl] (daemon) 10.1.74.31
nobody   32607  0.0  0.0 162032  1360 ?        SN   Jun02   7:19 tmate-slave [Nnh9mflyXfAvrgPq2vPeqGBHl] (daemon) 10.1.74.31
$ sudo strace -f -p 8998 -p 9107 -p 9125 -p 9204 -p 9207 -p 9283 -p 9922 -p 10259 -p 10664 -p 12163 -p 12819 -p 14908 -p 15130 -p 15313 -p 15355 -p 15475 -p 32258 -p 32267 -p 32286 -p 32293 -p 32607
Process 8998 attached
Process 9107 attached
Process 9125 attached
Process 9204 attached
Process 9207 attached
Process 9283 attached
Process 9922 attached
Process 10259 attached
Process 10664 attached
Process 12163 attached
Process 12819 attached
Process 14908 attached
Process 15130 attached
Process 15313 attached
Process 15355 attached
Process 15475 attached
Process 32258 attached
Process 32267 attached
Process 32286 attached
Process 32293 attached
Process 32607 attached
[pid 15355] accept(5,  <unfinished ...>
[pid 32607] futex(0x7f42c694d760, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 32286] futex(0x7f42c694d760, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid  9922] futex(0x7f42c694d760, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid  9107] futex(0x7f42c694d760, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid  9207] futex(0x7f42c694d760, FUTEX_WAIT_PRIVATE, 2, NULL

Is there an easy way to determine stale or hung tmate-slave services, such that we can know which can be safely terminated and which might be in use by a client?

@nviennot
Copy link
Member

I really don't understand your ps aux: you cannot have multiple daemon having the same session token (and tmate is single threaded).

to see what are the active sessions, you could netstat -tanpu | grep tmate and see what are processes have active connections

@skpy
Copy link
Author

skpy commented Jun 16, 2016

sudo netstat -tanpu | grep tmate
tcp        0      0 0.0.0.0:22000           0.0.0.0:*               LISTEN      15355/tmate-slave

I can fully shutdown the daemon and forcibly terminate any remaining tomorrow, after communicating the shutdown to everyone who might be using it. If after we restart tmate-slave we see the same situation, what do you suggest we try next?

@nviennot
Copy link
Member

give me the stack traces (or let me in that box through tmate so I can debug) (gdb -p , etc.)

@skpy
Copy link
Author

skpy commented Jun 16, 2016

Private box inside the firewall. I can certainly ask tomorrow, but I wouldn't expect to see permission granted for you to connect.

$ sudo gdb -p 15355
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-80.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Attaching to process 15355
Reading symbols from /usr/bin/tmate-slave...Reading symbols from /usr/bin/tmate-slave...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Reading symbols from /lib64/libutil.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib64/libutil.so.1
Reading symbols from /lib64/libssh.so.4...Reading symbols from /lib64/libssh.so.4...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /lib64/libssh.so.4
Reading symbols from /lib64/libmsgpack.so.4...Reading symbols from /lib64/libmsgpack.so.4...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /lib64/libmsgpack.so.4
Reading symbols from /lib64/libncurses.so.5...Reading symbols from /lib64/libncurses.so.5...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /lib64/libncurses.so.5
Reading symbols from /lib64/libtinfo.so.5...Reading symbols from /lib64/libtinfo.so.5...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /lib64/libtinfo.so.5
Reading symbols from /lib64/libevent-2.0.so.5...Reading symbols from /lib64/libevent-2.0.so.5...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /lib64/libevent-2.0.so.5
Reading symbols from /lib64/libresolv.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libresolv.so.2
Reading symbols from /lib64/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libc.so.6
Reading symbols from /lib64/librt.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib64/librt.so.1
Reading symbols from /lib64/libcrypto.so.10...Reading symbols from /lib64/libcrypto.so.10...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /lib64/libcrypto.so.10
Reading symbols from /lib64/libz.so.1...Reading symbols from /lib64/libz.so.1...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /lib64/libz.so.1
Reading symbols from /lib64/libgssapi_krb5.so.2...Reading symbols from /lib64/libgssapi_krb5.so.2...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /lib64/libgssapi_krb5.so.2
Reading symbols from /lib64/libkrb5.so.3...bReading symbols from /lib64/libkrb5.so.3...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /lib64/libkrb5.so.3
Reading symbols from /lib64/libk5crypto.so.3...Reading symbols from /lib64/libk5crypto.so.3...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /lib64/libk5crypto.so.3
Reading symbols from /lib64/libcom_err.so.2...Reading symbols from /lib64/libcom_err.so.2...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /lib64/libcom_err.so.2
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /lib64/libdl.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libdl.so.2
Reading symbols from /lib64/libpthread.so.0...(no debugging symbols found)...done.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Loaded symbols for /lib64/libpthread.so.0
Reading symbols from /lib64/libkrb5support.so.0...Reading symbols from /lib64/libkrb5support.so.0...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /lib64/libkrb5support.so.0
Reading symbols from /lib64/libkeyutils.so.1...Reading symbols from /lib64/libkeyutils.so.1...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /lib64/libkeyutils.so.1
Reading symbols from /lib64/libselinux.so.1...Reading symbols from /lib64/libselinux.so.1...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /lib64/libselinux.so.1
Reading symbols from /lib64/libpcre.so.1...Reading symbols from /lib64/libpcre.so.1...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /lib64/libpcre.so.1
Reading symbols from /lib64/liblzma.so.5...Reading symbols from /lib64/liblzma.so.5...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /lib64/liblzma.so.5
Reading symbols from /lib64/libnss_files.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libnss_files.so.2
Reading symbols from /lib64/libgcc_s.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib64/libgcc_s.so.1
0x00007f42c6689980 in __accept_nocancel () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install tmate-slave-20160426-1.el7.x86_64
(gdb) bt
#0  0x00007f42c6689980 in __accept_nocancel () from /lib64/libc.so.6
#1  0x00007f42c7448563 in ssh_bind_accept () from /lib64/libssh.so.4
#2  0x000000000044d7bd in tmate_ssh_server_main ()
#3  0x0000000000417a24 in main ()

Apologies if I'm not giving you all you need: I'm a sysadmin, not a developer. gdb is new and scary to me.

@nviennot
Copy link
Member

nviennot commented Jun 16, 2016

this process is fine (it's the listener) how about how of these process that is stuck?

@skpy
Copy link
Author

skpy commented Jun 16, 2016

Here's the backtrace output from a rather random selection of PIDs:
PID 8998

#0  0x00007f42c6618aa0 in strdup () from /lib64/libc.so.6
#1  0x000000000045e439 in xstrdup ()
#2  0x000000000042af55 in environ_push ()
#3  0x0000000000433af1 in job_run ()
#4  0x000000000042d6ba in format_job_get ()
#5  0x000000000042d51d in format_expand ()
#6  0x000000000042de6f in format_expand_time ()
#7  0x0000000000443db2 in status_replace ()
#8  0x000000000044421c in status_redraw ()
#9  0x000000000043f2ca in server_client_check_redraw ()
#10 0x000000000043f30a in server_client_loop ()
#11 0x0000000000440d49 in server_loop ()
#12 0x000000000043a12a in proc_loop ()
#13 0x00000000004412c2 in server_start ()
#14 0x000000000044c557 in tmate_spawn_slave ()
#15 0x000000000044da0a in tmate_ssh_server_main ()
#16 0x0000000000417a24 in main ()

PID 9107:

#0  0x00007f42c66966ac in __lll_lock_wait_private () from /lib64/libc.so.6
#1  0x00007f42c66151c7 in _L_lock_14687 () from /lib64/libc.so.6
#2  0x00007f42c66122c3 in malloc () from /lib64/libc.so.6
#3  0x00007f42c669ce0e in backtrace_symbols () from /lib64/libc.so.6
#4  0x0000000000446ed5 in tmate_print_stack_trace ()
#5  0x000000000044d2ca in handle_sigsegv ()
#6  <signal handler called>
#7  0x00007f42c660eb0e in _int_free () from /lib64/libc.so.6
#8  0x000000000045d6eb in winlink_remove ()
#9  0x0000000000442df7 in session_destroy ()
#10 0x000000000044103e in server_send_exit ()
#11 <signal handler called>
#12 0x00007f42c65cab96 in unsetenv () from /lib64/libc.so.6
#13 0x000000000042af96 in environ_push ()
#14 0x0000000000433af1 in job_run ()
#15 0x000000000042d6ba in format_job_get ()
#16 0x000000000042d51d in format_expand ()
#17 0x000000000042de6f in format_expand_time ()
#18 0x0000000000443db2 in status_replace ()
#19 0x000000000044421c in status_redraw ()
#20 0x000000000043f2ca in server_client_check_redraw ()
#21 0x000000000043f30a in server_client_loop ()
#22 0x0000000000440d49 in server_loop ()
#23 0x000000000043a12a in proc_loop ()
#24 0x00000000004412c2 in server_start ()
#25 0x000000000044c557 in tmate_spawn_slave ()
#26 0x000000000044da0a in tmate_ssh_server_main ()
#27 0x0000000000417a24 in main ()

PID 9125:

#0  0x00007f42c661a7b4 in __strncmp_sse2 () from /lib64/libc.so.6
#1  0x00007f42c65cab60 in unsetenv () from /lib64/libc.so.6
#2  0x000000000042af96 in environ_push ()
#3  0x0000000000433af1 in job_run ()
#4  0x000000000042d6ba in format_job_get ()
#5  0x000000000042d51d in format_expand ()
#6  0x000000000042de6f in format_expand_time ()
#7  0x0000000000443db2 in status_replace ()
#8  0x000000000044421c in status_redraw ()
#9  0x000000000043f2ca in server_client_check_redraw ()
#10 0x000000000043f30a in server_client_loop ()
#11 0x0000000000440d49 in server_loop ()
#12 0x000000000043a12a in proc_loop ()
#13 0x00000000004412c2 in server_start ()
#14 0x000000000044c557 in tmate_spawn_slave ()
#15 0x000000000044da0a in tmate_ssh_server_main ()
#16 0x0000000000417a24 in main ()

PID 9283:

#0  0x00007f42c65caba8 in unsetenv () from /lib64/libc.so.6
#1  0x000000000042af96 in environ_push ()
#2  0x0000000000433af1 in job_run ()
#3  0x000000000042d6ba in format_job_get ()
#4  0x000000000042d51d in format_expand ()
#5  0x000000000042de6f in format_expand_time ()
#6  0x0000000000443db2 in status_replace ()
#7  0x000000000044421c in status_redraw ()
#8  0x000000000043f2ca in server_client_check_redraw ()
#9  0x000000000043f30a in server_client_loop ()
#10 0x0000000000440d49 in server_loop ()
#11 0x000000000043a12a in proc_loop ()
#12 0x00000000004412c2 in server_start ()
#13 0x000000000044c557 in tmate_spawn_slave ()
#14 0x000000000044da0a in tmate_ssh_server_main ()
#15 0x0000000000417a24 in main ()

PID 32258:

#0  0x00007f42c6618e2f in __strlen_sse2 () from /lib64/libc.so.6
#1  0x00007f42c6618aae in strdup () from /lib64/libc.so.6
#2  0x000000000045e439 in xstrdup ()
#3  0x000000000042af55 in environ_push ()
#4  0x0000000000433af1 in job_run ()
#5  0x000000000042d6ba in format_job_get ()
#6  0x000000000042d51d in format_expand ()
#7  0x000000000042de6f in format_expand_time ()
#8  0x0000000000443db2 in status_replace ()
#9  0x000000000044421c in status_redraw ()
#10 0x000000000043f2ca in server_client_check_redraw ()
#11 0x000000000043f30a in server_client_loop ()
#12 0x0000000000440d49 in server_loop ()
#13 0x000000000043a12a in proc_loop ()
#14 0x00000000004412c2 in server_start ()
#15 0x000000000044c557 in tmate_spawn_slave ()
#16 0x000000000044da0a in tmate_ssh_server_main ()
#17 0x0000000000417a24 in main ()

PID 32607:

#0  0x00007f42c66966ac in __lll_lock_wait_private () from /lib64/libc.so.6
#1  0x00007f42c66152cc in _L_lock_16193 () from /lib64/libc.so.6
#2  0x00007f42c6612c89 in calloc () from /lib64/libc.so.6
#3  0x000000000045f3b8 in ibuf_open ()
#4  0x000000000045f40e in ibuf_dynamic ()
#5  0x000000000045efb5 in imsg_create ()
#6  0x000000000045f06e in imsg_compose ()
#7  0x0000000000439de5 in proc_send ()
#8  0x0000000000440ffb in server_send_exit ()
#9  <signal handler called>
#10 0x00007f42c6612264 in malloc () from /lib64/libc.so.6
#11 0x00007f42c6618aba in strdup () from /lib64/libc.so.6
#12 0x000000000045e439 in xstrdup ()
#13 0x000000000042af55 in environ_push ()
#14 0x0000000000433af1 in job_run ()
#15 0x000000000042d6ba in format_job_get ()
#16 0x000000000042d51d in format_expand ()
#17 0x000000000042de6f in format_expand_time ()
#18 0x0000000000443db2 in status_replace ()
#19 0x000000000044421c in status_redraw ()
#20 0x000000000043f2ca in server_client_check_redraw ()
#21 0x000000000043f30a in server_client_loop ()
#22 0x0000000000440d49 in server_loop ()
#23 0x000000000043a12a in proc_loop ()
#24 0x00000000004412c2 in server_start ()
#25 0x000000000044c557 in tmate_spawn_slave ()
#26 0x000000000044da0a in tmate_ssh_server_main ()
#27 0x0000000000417a24 in main ()

@nviennot
Copy link
Member

So there's two problems:

  1. tmate-slave should not be running jobs (e.g. running "uptime" to expand a status line), let me see how to fix this.
  2. the signal handler deadlocks with malloc(), this is a little harder to deal with, but low priority.

@nviennot
Copy link
Member

Try that

@nviennot
Copy link
Member

nviennot commented Jun 16, 2016

FYI: upgrading the tmate-slave daemon, and restarting it will not affect any current running sessions. The new connections will be served by the restarted daemon, while the other connections stay on the old code.

@skpy
Copy link
Author

skpy commented Jun 16, 2016

Thanks! Just to be safe, we shutdown all the existing tmate-slave services. We built and installed a new RPM from the latest code. Things look okay right now. If it goes south again, we'll open a new ticket.

👍

@nviennot
Copy link
Member

👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants