Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tmux freezes when using it or trying to reattach #1352

Closed
qyx268 opened this issue May 21, 2018 · 19 comments
Closed

Tmux freezes when using it or trying to reattach #1352

qyx268 opened this issue May 21, 2018 · 19 comments

Comments

@qyx268
Copy link

qyx268 commented May 21, 2018

  • Tmux version: commit at 9fd9952
  • Platform: Linux farnarkle1 3.10.0-693.21.1.el7.x86_64 Fix typo: teminal -> terminal #1 SMP Wed Mar 7 19:03:37 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
  • $TERM: xterm-256color

Tmux freezes when using it or trying to reattach.
I just started using this new HPC facilitate recently. It happened to me frequently that tmux froze, about once or twice a week. When I was using tmux, it suddenly froze; or when I tried to reattach sessions the other day, it froze. However, the programs hosted by tmux are still running. Here is a screenshot of my top. You could see that the server has been running with ~100% CPU for a very long time, several clients have been created because I attempted to reattach the sessions, and some programs (e.g. PID=86339, 250019) are still running.

screen shot 2018-05-21 at 17 16 21

Also when this is happening, I couldn't start a new tmux session either by simply type tmux. And here is what I meant by freezing.
screen shot 2018-05-21 at 17 23 29

Here is a log file created by tmux -vvv:
screen shot 2018-05-21 at 17 20 22

@qyx268
Copy link
Author

qyx268 commented May 21, 2018

Dear tmux develop team,

Thanks for helping me with this issue. I did find some closed issues that are related to "tmux freezing". But most of them seem to be different from what I'm facing. If the issue I just opened has been discussed before, could you please foward the number to me? Thanks!

Unfortunately that this problem is beyond what our HPC admin could help. But they suspected that there could be a systemd /tmp timer that runs and wipes out old things it thinks are relevant any more. They also suggested that it might be related to the Linux version and told me that centos7 is very different to 6. primarily because of systemd getting its muddy claws in there and controlling everything and its pet pony. By the way, I'm not sure if this is helpful, they suggested me to try screen instead, which, from their experience, seems to be more stable on centos7.

Kind regards,
Yuxiang

@nicm
Copy link
Member

nicm commented May 21, 2018

systemd removing the socket would not cause tmux to freeze, you would not be able to attach at all.

Can you please attach both client and server log files (kill tmux entirely with "pkill tmux" then start it with -vvv and reproduce) to this issue? Screenshots of logs are not very useful.

@qyx268
Copy link
Author

qyx268 commented May 21, 2018

It might take a while for the problem to reappear. I will update this issue once it happens again. Here is my config file if it's helpful.
tmux.txt

@nicm
Copy link
Member

nicm commented May 21, 2018 via email

@nicm
Copy link
Member

nicm commented Jun 1, 2018

Is this still happening?

@kbeldan
Copy link

kbeldan commented Jun 1, 2018

Hi,

Same symptoms here. Here is where the problem occurs:

tmux/grid.c

Lines 1009 to 1025 in 54cd292

for (;;) {
/*
* If this is now the last line, there is nothing more to be
* done.
*/
if (yy + lines == gd->hsize + gd->sy)
break;
line = yy + 1 + lines;
/* If the next line is empty, skip it. */
if (~gd->linedata[line].flags & GRID_LINE_WRAPPED)
wrapped = 0;
if (gd->linedata[line].cellused == 0) {
if (!wrapped)
break;
continue;
}

When not looping indefinitely consuming 100% cpu and preventing any communication with the server (continue with no lines increment), because the last line may reach gd->hsize + gd->sy (hence be off-by-one out-of-bounds of gd->linedata) when exiting, the same code pitfall also sometimes leads to server crash instead and may be the issue lurking in #1313, which I guess should also be fixed by addressing said pitfall.

@nicm
Copy link
Member

nicm commented Jun 1, 2018

Ah yes, well spotted. This might be the fix:

--- grid.c      18 Apr 2018 14:31:42 -0000      1.81
+++ grid.c      1 Jun 2018 16:30:23 -0000
@@ -1011,7 +1011,7 @@ grid_reflow_join(struct grid *target, st
                 * If this is now the last line, there is nothing more to be
                 * done.
                 */
-               if (yy + lines == gd->hsize + gd->sy)
+               if (yy + 1 + lines == gd->hsize + gd->sy)
                        break;
                line = yy + 1 + lines;

@@ -1021,6 +1021,7 @@ grid_reflow_join(struct grid *target, st
                if (gd->linedata[line].cellused == 0) {
                        if (!wrapped)
                                break;
+                       lines++;
                        continue;
                }

@kbeldan
Copy link

kbeldan commented Jun 1, 2018

FYI, I have been using the same modification on my side for some days and haven't seen a single hang or server crash since.

@qyx268
Copy link
Author

qyx268 commented Jun 2, 2018

Sorry for not following this up. There was a system break down on the cluster, and my tmux sessions were terminated several days ago. I had to restart the sessions to see if I could reproduce the problem. But indeed the issue hasn't shown up for more than a week this time. I would make updates as soon as the problem occurs and then try the modification. Thanks!

@nicm
Copy link
Member

nicm commented Jun 3, 2018

I have applied this change now to OpenBSD, it will be GitHub later. Please let me know if you continue to see problems with master.

@nicm nicm closed this as completed Jun 4, 2018
@qyx268
Copy link
Author

qyx268 commented Jun 8, 2018

I've targeted the reason that my tmux sessions were freezing. It had nothing to do with the tmux code. It was due to significantly slow down on our clusters when some team was transferring a huge dataset to the Lustre filesystem. I now move tmux from Lustre to my home directory, and the issue hasn't happened to me again.

Sorry for making this confusion.

@kbeldan
Copy link

kbeldan commented Jun 8, 2018

@qyx268 You wouldn't have had 100% CPU if that was the case.

@qyx268
Copy link
Author

qyx268 commented Aug 7, 2018

Dear all,

I finally managed to create this issue with a log file stored. This time I can be sure that our cluster is running smoothly.

Since the log file is too big, I've attached the last ~160000 lines, which has a lot of duplications as indicated. I'm using the tmux provided by the cluster, hence I haven't included the fix mentioned before by @nicm . I will include it now and try again to see if the issue can be solved.

Could someone help me follow this up? Hopefully, the log can be helpful this time. Thanks!

Here is some information you might need:

  • Tmux version: 2.7

  • $TERM: xterm-256color

  • Platform:
    NAME="CentOS Linux"
    VERSION="7 (Core)"
    ID="centos"
    ID_LIKE="rhel fedora"
    VERSION_ID="7"
    PRETTY_NAME="CentOS Linux 7 (Core)"
    ANSI_COLOR="0;31"
    CPE_NAME="cpe:/o:centos:centos:7"
    HOME_URL="https://www.centos.org/"
    BUG_REPORT_URL="https://bugs.centos.org/"
    CENTOS_MANTISBT_PROJECT="CentOS-7"
    CENTOS_MANTISBT_PROJECT_VERSION="7"
    REDHAT_SUPPORT_PRODUCT="centos"
    REDHAT_SUPPORT_PRODUCT_VERSION="7"

  • logs:
    1533608531.373322 screen_write_linefeed: at 0,11 (region 0-16)
    1533608531.373330 input_csi_dispatch: 'm' "" ""
    1533608531.373338 input_c0_dispatch: '\017'
    1533608531.373347 screen_write_collect_end: 84 292607 yqin 20 0 476624 61844 524 S 0.3 0.0 0:23.57 jupyter-noteb+ (at 0,12)
    1533608531.373356 input_csi_dispatch: 'm' "" ""
    1533608531.373363 input_c0_dispatch: '\017'
    1533608531.373371 input_csi_dispatch: 'm' "" "39;49"
    1533608531.373379 parameter 0: number 39
    1533608531.373386 parameter 1: number 49
    1533608531.373394 input_csi_dispatch: 'K' "" ""
    1533608531.373401 input_c0_dispatch: '\r'
    1533608531.373408 input_c0_dispatch: '\n'
    1533608531.373416 screen_write_linefeed: at 0,12 (region 0-16)
    1533608531.373424 input_csi_dispatch: 'm' "" ""
    1533608531.373431 input_c0_dispatch: '\017'
    1533608531.373441 screen_write_collect_end: 84 430752 yqin 20 0 179400 1744 252 S 0.3 0.0 0:00.29 sshd (at 0,13)
    1533608531.373450 input_csi_dispatch: 'm' "" ""
    1533608531.373457 input_c0_dispatch: '\017'
    1533608531.373465 input_csi_dispatch: 'm' "" "39;49"
    1533608531.373473 parameter 0: number 39
    1533608531.373480 parameter 1: number 49
    1533608531.373488 input_csi_dispatch: 'K' "" ""
    1533608531.373495 input_c0_dispatch: '\r'
    1533608531.373503 input_c0_dispatch: '\n'
    1533608531.373510 screen_write_linefeed: at 0,13 (region 0-16)
    1533608531.373518 input_csi_dispatch: 'm' "" ""
    1533608531.373526 input_c0_dispatch: '\017'
    1533608531.373535 screen_write_collect_end: 84 21230 yqin 20 0 127432 2096 8 S 0.0 0.0 0:00.02 bash (at 0,14)
    1533608531.373544 input_csi_dispatch: 'm' "" ""
    1533608531.373552 input_c0_dispatch: '\017'
    1533608531.373560 input_csi_dispatch: 'm' "" "39;49"
    1533608531.373567 parameter 0: number 39
    1533608531.373574 parameter 1: number 49
    1533608531.373582 input_csi_dispatch: 'K' "" ""
    1533608531.373589 input_c0_dispatch: '\r'
    1533608531.373597 input_c0_dispatch: '\n'
    1533608531.373604 screen_write_linefeed: at 0,14 (region 0-16)
    1533608531.373612 input_csi_dispatch: 'm' "" ""
    1533608531.373619 input_c0_dispatch: '\017'
    1533608531.373629 screen_write_collect_end: 84 21272 yqin 20 0 127344 2012 8 S 0.0 0.0 0:00.02 bash (at 0,15)
    1533608531.373638 input_csi_dispatch: 'm' "" ""
    1533608531.373645 input_c0_dispatch: '\017'
    1533608531.373653 input_csi_dispatch: 'm' "" "39;49"
    1533608531.373661 parameter 0: number 39
    1533608531.373668 parameter 1: number 49
    1533608531.373675 input_csi_dispatch: 'K' "" ""
    1533608531.373683 input_c0_dispatch: '\r'
    1533608531.373690 input_c0_dispatch: '\n'
    1533608531.373698 screen_write_linefeed: at 0,15 (region 0-16)
    1533608531.373706 input_csi_dispatch: 'm' "" ""
    1533608531.373718 input_c0_dispatch: '\017'
    1533608531.373728 screen_write_collect_end: 84 21321 yqin 20 0 127432 2096 8 S 0.0 0.0 0:00.02 bash (at 0,16)
    1533608531.373737 input_csi_dispatch: 'm' "" ""
    1533608531.373744 input_c0_dispatch: '\017'
    1533608531.373752 input_csi_dispatch: 'm' "" "39;49"
    1533608531.373760 parameter 0: number 39
    1533608531.373767 parameter 1: number 49
    1533608531.373775 input_csi_dispatch: 'K' "" ""
    1533608531.373783 screen_write_collect_flush: flushed 19 items (915 bytes)
    1533608531.373791 screen_write_stop: 915 cells (915 written, 0 skipped)
    1533608531.373800 @4 alerts check, alerts 0x2
    1533608531.373807 window_remove_ref: @4 alerts_callback, now 1
    1533608531.373815 redraw timer fired
    1533608531.373822 cmdq_next : empty
    1533608531.373830 cmdq_next <0x19ccdb0>: empty
    1533608531.373837 cmdq_next <0xdd8c80>: empty
    1533608531.373845 /dev/pts/120: redraw deferred (1402 left)
    1533608531.373852 redraw timer started
    1533608531.374972 redraw timer fired
    1533608531.375015 cmdq_next : empty
    1533608531.375043 cmdq_next <0x19ccdb0>: empty
    1533608531.375070 cmdq_next <0xdd8c80>: empty
    1533608531.375097 /dev/pts/120: redraw deferred (1402 left)
    1533608531.375123 redraw timer started
    1533608531.376261 redraw timer fired
    1533608531.376292 cmdq_next : empty
    1533608531.376320 cmdq_next <0x19ccdb0>: empty
    1533608531.376347 cmdq_next <0xdd8c80>: empty
    1533608531.376374 /dev/pts/120: redraw deferred (1402 left)
    1533608531.376400 redraw timer started
    1533608531.377537 redraw timer fired
    1533608531.377568 cmdq_next : empty
    1533608531.377596 cmdq_next <0x19ccdb0>: empty
    1533608531.377623 cmdq_next <0xdd8c80>: empty
    1533608531.377650 /dev/pts/120: redraw deferred (1402 left)
    ... [DUPLICATION]
    1533608550.280641 cmdq_next <0x19ccdb0>: empty
    1533608550.280668 cmdq_next <0xdd8c80>: empty
    1533608550.280701 /dev/pts/120: redraw deferred (1570 left)
    1533608550.280708 redraw timer started
    1533608550.281811 redraw timer fired
    1533608550.281820 cmdq_next : empty
    1533608550.281828 cmdq_next <0x19ccdb0>: empty
    1533608550.281836 cmdq_next <0xdd8c80>: empty
    1533608550.281843 /dev/pts/120: redraw deferred (1570 left)
    1533608550.281851 redraw timer started
    1533608550.282987 redraw timer fired
    1533608550.283022 cmdq_next : empty
    1533608550.283051 cmdq_next <0x19ccdb0>: empty
    1533608550.283078 cmdq_next <0xdd8c80>: empty
    1533608550.283105 /dev/pts/120: redraw deferred (1570 left)
    1533608550.283132 redraw timer started
    1533608550.284270 redraw timer fired
    1533608550.284301 cmdq_next : empty
    1533608550.284329 cmdq_next <0x19ccdb0>: empty
    1533608550.284356 cmdq_next <0xdd8c80>: empty
    1533608550.284384 /dev/pts/120: redraw deferred (1570 left)
    1533608550.284410 redraw timer started
    1533608550.285547 redraw timer fired
    1533608550.285578 cmdq_next : empty
    1533608550.285606 cmdq_next <0x19ccdb0>: empty
    1533608550.285633 cmdq_next <0xdd8c80>: empty
    1533608550.285661 /dev/pts/120: redraw deferred (1570 left)
    1533608550.285687 redraw timer started
    1533608550.286838 redraw timer fired
    1533608550.286848 cmdq_next : empty
    1533608550.286857 cmdq_next <0x19ccdb0>: empty
    1533608550.286865 cmdq_next <0xdd8c80>: empty
    1533608550.286874 /dev/pts/120: redraw deferred (1570 left)
    1533608550.286881 redraw timer started
    1533608550.288018 redraw timer fired
    1533608550.288054 cmdq_next : empty
    1533608550.288082 cmdq_next <0x19ccdb0>: empty
    1533608550.288109 cmdq_next <0xdd8c80>: empty
    1533608550.288136 /dev/pts/120: redraw deferred (1570 left)
    1533608550.288163 redraw timer started
    1533608550.289304 redraw timer fired
    1533608550.289314 cmdq_next : empty
    1533608550.289322 cmdq_next <0x19ccdb0>: empty
    1533608550.289330 cmdq_next <0xdd8c80>: empty
    1533608550.289338 /dev/pts/120: redraw deferred (1570 left)
    1533608550.289346 redraw timer started
    1533608550.290462 redraw timer fired
    1533608550.290493 cmdq_next : empty
    1533608550.290520 cmdq_next <0x19ccdb0>: empty
    1533608550.290548 cmdq_next <0xdd8c80>: empty
    1533608550.290575 /dev/pts/120: redraw deferred (1570 left)
    1533608550.290602 redraw timer started
    1533608550.291749 redraw timer fired
    1533608550.291784 cmdq_next : empty
    1533608550.291812 cmdq_next <0x19ccdb0>: empty
    1533608550.291841 cmdq_next <0xdd8c80>: empty
    1533608550.291849 /dev/pts/120: redraw deferred (1570 left)
    1533608550.291857 redraw timer started
    1533608550.292994 redraw timer fired
    1533608550.293030 cmdq_next : empty
    1533608550.293058 cmdq_next <0x19ccdb0>: empty
    1533608550.293085 cmdq_next <0xdd8c80>: empty
    1533608550.293113 /dev/pts/120: redraw deferred (1570 left)
    1533608550.293139 redraw timer started
    1533608550.294283 redraw timer fired
    1533608550.294314 cmdq_next : empty
    1533608550.294342 cmdq_next <0x19ccdb0>: empty
    1533608550.294369 cmdq_next <0xdd8c80>: empty
    1533608550.294397 /dev/pts/120: redraw deferred (1570 left)
    1533608550.294423 redraw timer started
    1533608550.295565 redraw timer fired
    1533608550.295578 cmdq_next : empty
    1533608550.295586 cmdq_next <0x19ccdb0>: empty
    1533608550.295594 cmdq_next <0xdd8c80>: empty
    1533608550.295603 /dev/pts/120: redraw deferred (1570 left)
    1533608550.295611 redraw timer started
    1533608550.295731 /dev/pts/70: read 1 bytes (already 0)
    1533608550.295743 /dev/pts/70: keys are 1 (z)
    1533608550.295752 /dev/pts/70: next key is 1 (z) (expired=0)
    1533608550.295761 /dev/pts/70: complete key z 0x7a
    1533608550.295771 session DRAGONS activity 1533608550.295770 (last 1533608549.984858)
    1533608550.295780 cmd_find_from_session: s=$2
    1533608550.295789 cmd_find_from_session: wl=8 1 w=@10 halo_history
    1533608550.295797 cmd_find_from_session: wp=%98
    1533608550.295805 cmd_find_from_session: idx=none
    1533608550.295813 session DRAGONS not pasting
    1533608550.295829 key table prefix (pane %98)
    1533608550.295839 found in key table prefix
    1533608550.295848 cmdq_next : empty
    1533608550.295857 cmdq_next <0x19ccdb0>: empty
    1533608550.295865 cmdq_next <0xdd8c80>: enter
    1533608550.295874 cmdq_next <0xdd8c80>: command[resize-pane] (0), flags 0
    1533608550.295883 cmd_find_target: target none, type pane
    1533608550.295891 cmd_find_target: item 0x2529390, flags 0
    1533608550.295900 cmd_find_target: current is from queue
    1533608550.295908 cmd_find_target: s=$2
    1533608550.295916 cmd_find_target: wl=8 1 w=@10 halo_history
    1533608550.295924 cmd_find_target: wp=%98
    1533608550.295932 cmd_find_target: idx=none
    1533608550.295943 grid_reflow: 234 lines, new width 118

@nicm
Copy link
Member

nicm commented Aug 7, 2018

This is probably the bug fixed by ba31d3a

@nicm
Copy link
Member

nicm commented Aug 7, 2018

@qyx268 you will you need to rebuild tmux from master or apply this fix to 2.7. Or wait for 2.8.

@qyx268
Copy link
Author

qyx268 commented Aug 7, 2018

@nicm Thanks. I have rebuilt it from master. I will let you know if the issue happens again.

@achalddave
Copy link

achalddave commented Sep 23, 2018

I'm not sure if this is the right place to ask this question, but is there any way to recover a tmux session that seems to be frozen due to this bug? I understand the longer term fix is to update tmux, but if, completely hypothetically, someone were in a bit of a pickle where they've left an important, long-running command in a tmux session when it freezes with what seems to be the same symptoms*, is there a short term fix to save the session?

* i.e. running tmux 2.7, all programs within tmux are still running, the tmux session is frozen, the tmux process is using 100% cpu, and trying to create a new session or attach to the existing session fails

edit: I'm able to create a new tmux session by changing the socket path for now, so I can just leave the existing session running for as long as necessary (...but spinning at 100% cpu) and keep working.

@gerritholl
Copy link

I just ran into this problem using tmux 2.7, openSUSE 15.0.

@lock
Copy link

lock bot commented Feb 14, 2020

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Feb 14, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Development

No branches or pull requests

5 participants