sway 1.2 freezing (but not crashing) after monitors power back on #1857

jpeeler · 2019-10-01T19:55:36Z

This problem is not easily reproducible, happens about twice a week or so. After the monitors are put to sleep (either directly swaymsg "output * dpms off" or in this case indirectly via swayidle) occasionally after trying to resume sway freezes. I've checked and sway has not crashed, but has simply become unresponsive:

$ strace -p 3405
trace: Process 3405 attached
futex(0x55a29053b598, FUTEX_WAIT_PRIVATE, 2, NULL

Debug log gist: https://gist.github.com/jpeeler/d0c5a573b611a292bbaf8fb83c490bf5

The screens just remain powered on, but are just black.

wlroots has migrated to gitlab.freedesktop.org. This issue has been moved to:

https://gitlab.freedesktop.org/wlroots/wlroots/-/issues/1857

The text was updated successfully, but these errors were encountered:

Emantor · 2019-10-02T03:14:59Z

Which graphics card are you using? The debug log is truncated and does not contain the full information. Are the last lines truly the last log entries? This could indicate an infinite loop inside the DRM connector scanning code.

dinnymate · 2019-10-02T12:36:34Z

I'm experiencing this too, in 1.2-rc1. I have an intel hd graphics 620. I also have an AMD card if that helps (amdgpu driver).

cpdef · 2019-10-02T12:57:05Z

Me too, sometimes i can fix that unplugging a usb device and replugging it, idk whether it is related.
1.1-rc1-4-g8c8e6a6c

jpeeler · 2019-10-02T13:51:20Z

@Emantor Yes, that is the very last log entry. Sway becomes totally unresponsive.

I have an intel and nvidia card. Here's the output of drm_info: https://gist.github.com/jpeeler/fd4249f443618f0df168966f9a8d20c2.

This will be quite an interesting bug if a USB removal is a workaround. Will try it the next time the error occurs.

Emantor · 2019-10-02T14:39:44Z

Which kernel versions are these? The log indicates a legacy backend which I do not expect for modern intel or amdgpu cards.

jpeeler · 2019-10-02T14:56:07Z

Kernel version: 5.2.16-200.fc30.x86_64. Can you elaborate on what backend you are referring to? Just to clarify, there is no AMD in use here (I assume you meant nouveau).

Emantor · 2019-10-02T16:19:09Z

Inside wlroots the drm backend implements the atomic and legacy interface. The logfile you posted uses the legacy interface to interact with the kernel driver and from the small log snippet you posted its not clear which card does the rendering in your setup. For more info you can dive into the linux gpu driver docs on KMS. Also I just noticed that this is for 1.2, can you test with latest master to try and reproduce the issue?

dinnymate · 2019-10-06T20:53:34Z

Any progress on this? Sway is essentially unusable, i have to reboot every time i close my laptop lid. Can't even switch ttys.

ascent12 · 2019-10-06T22:15:42Z

A stacktrace of the sway process would be helpful. Being stuck on a futex call seems very odd.

jpeeler · 2019-10-14T18:02:40Z

Ok, apparently this issue doesn't occur as often as I thought, more like every 2-3 weeks. Anyway, here's the gdb output that's hopefully sufficient. I'm using the package in Fedora.

Thread 3 (Thread 0x7f002a920700 (LWP 2700)):
#0  0x00007f00390393c5 in pthread_cond_wait@@GLIBC_2.3.2 () at /lib64/libpthread.so.0
swaywm/sway#1  0x00007f002f89fb4b in util_queue_thread_func () at /usr/lib64/dri/nouveau_dri.so
swaywm/sway#2  0x00007f002f89f75b in impl_thrd_routine () at /usr/lib64/dri/nouveau_dri.so
swaywm/sway#3  0x00007f00390334c0 in start_thread () at /lib64/libpthread.so.0
swaywm/sway#4  0x00007f0039148553 in clone () at /lib64/libc.so.6

Thread 2 (Thread 0x7f00348db700 (LWP 2693)):
#0  0x00007f00390393c5 in pthread_cond_wait@@GLIBC_2.3.2 () at /lib64/libpthread.so.0
swaywm/sway#1  0x00007f00353ed6fb in util_queue_thread_func () at /usr/lib64/dri/i965_dri.so
swaywm/sway#2  0x00007f00353ed30b in impl_thrd_routine () at /usr/lib64/dri/i965_dri.so
swaywm/sway#3  0x00007f00390334c0 in start_thread () at /lib64/libpthread.so.0
swaywm/sway#4  0x00007f0039148553 in clone () at /lib64/libc.so.6

Thread 1 (Thread 0x7f0035f16ac0 (LWP 2689)):
#0  0x00007f003903c73d in __lll_lock_wait () at /lib64/libpthread.so.0
swaywm/sway#1  0x00007f0039035ce9 in pthread_mutex_lock () at /lib64/libpthread.so.0
swaywm/sway#2  0x00007f00340a4273 in eglMakeCurrent () at /lib64/libEGL_mesa.so.0
swaywm/sway#3  0x00007f0038bf0f30 in  () at /lib64/libEGL.so.1
swaywm/sway#4  0x00007f00392d9f95 in wlr_egl_make_current () at /lib64/libwlroots.so.3
swaywm/sway#5  0x00007f00392dc103 in gles2_texture_destroy () at /lib64/libwlroots.so.3
swaywm/sway#6  0x00007f0038bb5f27 in gbm_bo_destroy () at /lib64/libgbm.so.1
swaywm/sway#7  0x00007f00340b8ed8 in dri2_drm_destroy_surface () at /lib64/libEGL_mesa.so.0
swaywm/sway#8  0x00007f00340b4c76 in dri2_make_current () at /lib64/libEGL_mesa.so.0
swaywm/sway#9  0x00007f00340a436d in eglMakeCurrent () at /lib64/libEGL_mesa.so.0
swaywm/sway#10 0x00007f0038bf0f30 in  () at /lib64/libEGL.so.1
swaywm/sway#11 0x00007f00392d9f95 in wlr_egl_make_current () at /lib64/libwlroots.so.3
swaywm/sway#12 0x00007f00392dc103 in gles2_texture_destroy () at /lib64/libwlroots.so.3
swaywm/sway#13 0x00007f0038bb5f27 in gbm_bo_destroy () at /lib64/libgbm.so.1
swaywm/sway#14 0x00007f00340b8ed8 in dri2_drm_destroy_surface () at /lib64/libEGL_mesa.so.0
swaywm/sway#15 0x00007f00340a8cff in eglDestroySurface () at /lib64/libEGL_mesa.so.0
swaywm/sway#16 0x00007f00392da7c6 in wlr_egl_destroy_surface () at /lib64/libwlroots.so.3
swaywm/sway#17 0x00007f00392cb3f7 in finish_drm_surface () at /lib64/libwlroots.so.3
swaywm/sway#18 0x00007f00392c8457 in dealloc_crtc () at /lib64/libwlroots.so.3
swaywm/sway#19 0x00007f00392c85d9 in drm_connector_cleanup () at /lib64/libwlroots.so.3
swaywm/sway#20 0x00007f00392ca256 in scan_drm_connectors () at /lib64/libwlroots.so.3
swaywm/sway#21 0x00007f0039306cac in wlr_signal_emit_safe () at /lib64/libwlroots.so.3
swaywm/sway#22 0x00007f00392d19a7 in udev_event () at /lib64/libwlroots.so.3
swaywm/sway#23 0x00007f003934c36a in wl_event_loop_dispatch () at /lib64/libwayland-server.so.0
swaywm/sway#24 0x00007f003934af25 in wl_display_run () at /lib64/libwayland-server.so.0
swaywm/sway#25 0x000055714a9c246d in main (argc=1, argv=0x7ffcd0c895e8) at ../sway/main.c:400
        verbose = 0
        debug = 0
        validate = 0
        allow_unsupported_gpu = 0
        long_options = {{name = 0x55714aa084bd "help", has_arg = 0, flag = 0x0, val = 104}, {name = 0x55714aa0b2b9 "config", has_arg = 1, flag = 0x0, val = 99}, {name = 0x55714aa084c2 "validate", has_arg = 0, flag = 0x0, val = 67}, {name = 0x55714aa084cb "debug", has_arg = 0, flag = 0x0, val = 100}, {name = 0x55714aa08421 "version", has_arg = 0, flag = 0x0, val = 118}, {name = 0x55714aa075e3 "verbose", has_arg = 0, flag = 0x0, val = 86}, {name = 0x55714aa084d1 "get-socketpath", has_arg = 0, flag = 0x0, val = 112}, {name = 0x55714aa084e0 "unsupported-gpu", has_arg = 0, flag = 0x0, val = 117}, {name = 0x55714aa084f0 "my-next-gpu-wont-be-nvidia", has_arg = 0, flag = 0x0, val = 117}, {name = 0x0, has_arg = 0, flag = 0x0, val = 0}}
        config_path = 0x0
        usage = 0x55714aa08800 "Usage: sway [options] [command]\n\n  -h, --help", ' ' <repeats 13 times>, "Show help message and quit.\n  -c, --config <config>  Specify a config file.\n  -C, --validate         Check the validity of the config file, th"...
        c = <optimized out>
Detaching from program: /usr/bin/sway, process 2689
[Inferior 1 (process 2689) detached]

As far as reproducing on master, I will try to eventually do that.

dinnymate · 2019-10-15T02:06:28Z

Turns out the issue for me is quite the opposite of what @jpeeler describes. If i set up swayidle to set dpms to off before going to sleep the issue dissapears.

ascent12 · 2019-10-15T10:36:02Z

Transferred to wlroots, because I figured it's not directly a sway issue.

Apparently we're calling eglMakeCurrent within a callback triggered by eglMakeCurrent, which leads to a deadlock inside of Mesa.
I'll have to investigate this more.

emersion · 2019-10-15T10:38:09Z

Can probably fix this by checking if EGL is already current to prevent the infinite loop.

ascent12 · 2019-10-15T10:59:56Z

Probably. I might run it by #dri-devel at some point and see if they think it's a Mesa issue by somewhat leaking their internal state without documenting it.

emersion · 2019-10-22T16:54:04Z

Probably fixed by 6bb7639

jpeeler · 2019-10-30T17:48:12Z

I'm sorry to say that after running with the above mentioned patch the problem is still present.

Thread 3 (Thread 0x7ff5769a3700 (LWP 19438)):
#0  0x00007ff57ca963c5 in pthread_cond_wait@@GLIBC_2.3.2 () at /lib64/libpthread.so.0
#1  0x00007ff577372b4b in util_queue_thread_func () at /usr/lib64/dri/nouveau_dri.so
#2  0x00007ff57737275b in impl_thrd_routine () at /usr/lib64/dri/nouveau_dri.so
#3  0x00007ff57ca904c0 in start_thread () at /lib64/libpthread.so.0
#4  0x00007ff57cba5553 in clone () at /lib64/libc.so.6

Thread 2 (Thread 0x7ff578338700 (LWP 19437)):
#0  0x00007ff57ca963c5 in pthread_cond_wait@@GLIBC_2.3.2 () at /lib64/libpthread.so.0
#1  0x00007ff578e4a6fb in util_queue_thread_func () at /usr/lib64/dri/i965_dri.so
#2  0x00007ff578e4a30b in impl_thrd_routine () at /usr/lib64/dri/i965_dri.so
#3  0x00007ff57ca904c0 in start_thread () at /lib64/libpthread.so.0
#4  0x00007ff57cba5553 in clone () at /lib64/libc.so.6

Thread 1 (Thread 0x7ff579973ac0 (LWP 19435)):
#0  0x00007ff57ca9973d in __lll_lock_wait () at /lib64/libpthread.so.0
#1  0x00007ff57ca92ce9 in pthread_mutex_lock () at /lib64/libpthread.so.0
#2  0x00007ff577b01273 in eglMakeCurrent () at /lib64/libEGL_mesa.so.0
#3  0x00007ff57c64df30 in  () at /lib64/libEGL.so.1
#4  0x00007ff57cd382d5 in wlr_egl_make_current (egl=0x55902f42ea50, surface=surface@entry=0x0, buffer_age=buffer_age@entry=0x0) at ../render/egl.c:335
#5  0x00007ff57cd3a4bd in gles2_texture_destroy (wlr_texture=<optimized out>) at ../render/gles2/texture.c:119
#6  0x00007ff57cd3a4bd in gles2_texture_destroy (wlr_texture=<optimized out>) at ../render/gles2/texture.c:111
#7  0x00007ff57c612f27 in gbm_bo_destroy () at /lib64/libgbm.so.1
#8  0x00007ff577b15ed8 in dri2_drm_destroy_surface () at /lib64/libEGL_mesa.so.0
#9  0x00007ff577b11c76 in dri2_make_current () at /lib64/libEGL_mesa.so.0
#10 0x00007ff577b0136d in eglMakeCurrent () at /lib64/libEGL_mesa.so.0
#11 0x00007ff57c64df30 in  () at /lib64/libEGL.so.1
#12 0x00007ff57cd382d5 in wlr_egl_make_current (egl=0x55902f42ea50, surface=surface@entry=0x0, buffer_age=buffer_age@entry=0x0) at ../render/egl.c:335
#13 0x00007ff57cd3a4bd in gles2_texture_destroy (wlr_texture=<optimized out>) at ../render/gles2/texture.c:119
#14 0x00007ff57cd3a4bd in gles2_texture_destroy (wlr_texture=<optimized out>) at ../render/gles2/texture.c:111
#15 0x00007ff57c612f27 in gbm_bo_destroy () at /lib64/libgbm.so.1
#16 0x00007ff577b15ed8 in dri2_drm_destroy_surface () at /lib64/libEGL_mesa.so.0
#17 0x00007ff577b05cff in eglDestroySurface () at /lib64/libEGL_mesa.so.0
#18 0x00007ff57cd38b06 in wlr_egl_destroy_surface (egl=<optimized out>, surface=<optimized out>) at ../render/egl.c:646
#19 0x00007ff57cd28477 in finish_drm_surface (surf=0x55902f432258) at ../backend/drm/renderer.c:121
#20 0x00007ff57cd28477 in finish_drm_surface (surf=0x55902f432258) at ../backend/drm/renderer.c:109
#21 0x00007ff57cd25457 in dealloc_crtc (conn=conn@entry=0x55902f536d50) at ../backend/drm/drm.c:976
#22 0x00007ff57cd255d9 in drm_connector_cleanup (conn=conn@entry=0x55902f536d50) at ../backend/drm/drm.c:1521
#23 0x00007ff57cd2726d in scan_drm_connectors (drm=<optimized out>) at ../backend/drm/drm.c:1333
#24 0x00007ff57cd650fc in wlr_signal_emit_safe (signal=<optimized out>, data=data@entry=0x55902f2993e0) at ../util/signal.c:29
#25 0x00007ff57cd2ea57 in udev_event (fd=<optimized out>, mask=<optimized out>, data=0x55902f2993e0) at ../backend/session/session.c:52
#26 0x00007ff57cdac36a in wl_event_loop_dispatch () at /lib64/libwayland-server.so.0
#27 0x00007ff57cdaaf25 in wl_display_run () at /lib64/libwayland-server.so.0
#28 0x000055902d91e46d in main (argc=1, argv=0x7fffec401bc8) at ../sway/main.c:400
Detaching from program: /usr/bin/sway, process 19435
[Inferior 1 (process 19435) detached]

jpeeler · 2019-11-13T16:43:56Z

I did not know lock details could be extracted from gdb:
https://en.wikibooks.org/wiki/Linux_Applications_Debugging_Techniques/Deadlocks

It was not obvious to me from previous comments that this is the case of a lock within the same thread attempting to lock again.

Perhaps this is a mesa issue, mtx_init supports both normal and recursive initialization: https://github.com/mesa3d/mesa/blob/e101af8671a13a8eb8ce714e07294b73a99821cd/include/c11/threads_posix.h#L196

And I kid you not, while typing this report I've now seen kitty lock up and the backtrace is showing egl related calls...

[Switching to thread 1 (Thread 0x7f6050fe3ac0 (LWP 2377))]
#0  0x00007f60540f473d in __lll_lock_wait () from /lib64/libpthread.so.0
#2  0x00007f604f172253 in eglMakeCurrent () from /lib64/libEGL_mesa.so.0
#0  0x00007f60540f473d in __lll_lock_wait () at /lib64/libpthread.so.0
#1  0x00007f60540edce9 in pthread_mutex_lock () at /lib64/libpthread.so.0
#2  0x00007f604f172253 in eglMakeCurrent () at /lib64/libEGL_mesa.so.0
#3  0x00007f6053ca8f30 in  () at /lib64/libEGL.so.1
#4  0x00007f60543932d5 in wlr_egl_make_current (egl=0x56104bee4a00, surface=surface@entry=0x0, buffer_age=buffer_age@entry=0x0)
    at ../render/egl.c:335
#5  0x00007f60543954bd in gles2_texture_destroy (wlr_texture=<optimized out>) at ../render/gles2/texture.c:119
#6  0x00007f60543954bd in gles2_texture_destroy (wlr_texture=<optimized out>) at ../render/gles2/texture.c:111
#7  0x00007f6053c6df27 in gbm_bo_destroy () at /lib64/libgbm.so.1
#8  0x00007f604f186cb8 in dri2_drm_destroy_surface () at /lib64/libEGL_mesa.so.0
#9  0x00007f604f182c96 in dri2_make_current () at /lib64/libEGL_mesa.so.0
#10 0x00007f604f17234d in eglMakeCurrent () at /lib64/libEGL_mesa.so.0
#11 0x00007f6053ca8f30 in  () at /lib64/libEGL.so.1
#12 0x00007f60543932d5 in wlr_egl_make_current (egl=0x56104bee4a00, surface=surface@entry=0x0, buffer_age=buffer_age@entry=0x0)
    at ../render/egl.c:335
#13 0x00007f60543954bd in gles2_texture_destroy (wlr_texture=<optimized out>) at ../render/gles2/texture.c:119
#14 0x00007f60543954bd in gles2_texture_destroy (wlr_texture=<optimized out>) at ../render/gles2/texture.c:111
#15 0x00007f6053c6df27 in gbm_bo_destroy () at /lib64/libgbm.so.1
#16 0x00007f604f186cb8 in dri2_drm_destroy_surface () at /lib64/libEGL_mesa.so.0
#17 0x00007f604f176cef in eglDestroySurface () at /lib64/libEGL_mesa.so.0
#18 0x00007f6054393b06 in wlr_egl_destroy_surface (egl=<optimized out>, surface=<optimized out>) at ../render/egl.c:646
#19 0x00007f6054383477 in finish_drm_surface (surf=0x56104bee8208) at ../backend/drm/renderer.c:121
#20 0x00007f6054383477 in finish_drm_surface (surf=0x56104bee8208) at ../backend/drm/renderer.c:109
#21 0x00007f6054380457 in dealloc_crtc (conn=conn@entry=0x56104bff7700) at ../backend/drm/drm.c:976
#22 0x00007f60543805d9 in drm_connector_cleanup (conn=conn@entry=0x56104bff7700) at ../backend/drm/drm.c:1521
#23 0x00007f605438226d in scan_drm_connectors (drm=<optimized out>) at ../backend/drm/drm.c:1333
#24 0x00007f60543c00fc in wlr_signal_emit_safe (signal=<optimized out>, data=data@entry=0x56104bd4f3e0) at ../util/signal.c:29
#25 0x00007f6054389a57 in udev_event (fd=<optimized out>, mask=<optimized out>, data=0x56104bd4f3e0) at ../backend/session/session.c:52
#26 0x00007f605440736a in wl_event_loop_dispatch () at /lib64/libwayland-server.so.0
#27 0x00007f6054405f25 in wl_display_run () at /lib64/libwayland-server.so.0
#28 0x000056104b3db46d in main (argc=1, argv=0x7ffdeb7a3e28) at ../sway/main.c:400
#1  0x00007f60540edce9 in pthread_mutex_lock () from /lib64/libpthread.so.0
rax            0xfffffffffffffe00  -512
rbx            0x0                 0
rcx            0x7f60540f473d      140051703875389
rdx            0x0                 0
rsi            0x80                128
rdi            0x56104bf51c08      94627993820168
rbp            0x56104bee5440      0x56104bee5440
rsp            0x7ffdeb7a3500      0x7ffdeb7a3500
r8             0x56104bf51c08      94627993820168
r9             0x0                 0
r10            0x1                 1
r11            0x202               514
r12            0x0                 0
r13            0x0                 0
r14            0x56104bf51c00      94627993820160
r15            0x56104bf51c08      94627993820168
rip            0x7f60540edce9      0x7f60540edce9 <pthread_mutex_lock+89>
eflags         0x202               [ IF ]
cs             0x33                51
ss             0x2b                43
ds             0x0                 0
es             0x0                 0
fs             0x0                 0
gs             0x0                 0
The directory where separate debug symbols are searched for is "/usr/lib/debug".
$1 = 2
$2 = 0
$3 = 2377
$4 = 2377
Detaching from program: /usr/bin/sway, process 2377
[Inferior 1 (process 2377) detached]

Apparently the gdb commands aren't captured here and I already rebooted after the last deadlock, but I think everyone gets the idea.

jpeeler · 2019-11-18T18:05:57Z

Is it possible that wlroots actually is an issue here? This is some backtrace output from a stuck kitty:

(gdb) bt
#0  0x00007fae94d393c5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007fae85982e73 in wl_display_read_events () from /lib64/libwayland-client.so.0
#2  0x00007fae85983649 in wl_display_dispatch_queue () from /lib64/libwayland-client.so.0
#3  0x00007fae84ee2b1c in dri2_wl_swap_buffers_with_damage () from /lib64/libEGL_mesa.so.0
#4  0x00007fae84ed1ef6 in eglSwapBuffers () from /lib64/libEGL_mesa.so.0
#5  0x00007fae859dabe7 in swapBuffersEGL (window=0x379e420) at glfw/egl_context.c:232
#6  0x00007fae859c92b3 in glfwSwapBuffers (handle=<error reading variable: DW_FORM_strp pointing outside of .debug_str section [in module /usr/lib/debug/usr/lib64/kitty/kitty/fast_data_types.so-0.14.6-12.fc30.x86_64.debug]>) at glfw/context.c:657

ascent12 transferred this issue from swaywm/sway Oct 15, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sway 1.2 freezing (but not crashing) after monitors power back on #1857

sway 1.2 freezing (but not crashing) after monitors power back on #1857

jpeeler commented Oct 1, 2019 •

edited by emersion

Loading

Emantor commented Oct 2, 2019

dinnymate commented Oct 2, 2019

cpdef commented Oct 2, 2019

jpeeler commented Oct 2, 2019

Emantor commented Oct 2, 2019

jpeeler commented Oct 2, 2019

Emantor commented Oct 2, 2019

dinnymate commented Oct 6, 2019

ascent12 commented Oct 6, 2019

jpeeler commented Oct 14, 2019

dinnymate commented Oct 15, 2019 •

edited

Loading

ascent12 commented Oct 15, 2019

emersion commented Oct 15, 2019

ascent12 commented Oct 15, 2019

emersion commented Oct 22, 2019

jpeeler commented Oct 30, 2019

jpeeler commented Nov 13, 2019

jpeeler commented Nov 18, 2019 •

edited

Loading

sway 1.2 freezing (but not crashing) after monitors power back on #1857

sway 1.2 freezing (but not crashing) after monitors power back on #1857

Comments

jpeeler commented Oct 1, 2019 • edited by emersion Loading

Emantor commented Oct 2, 2019

dinnymate commented Oct 2, 2019

cpdef commented Oct 2, 2019

jpeeler commented Oct 2, 2019

Emantor commented Oct 2, 2019

jpeeler commented Oct 2, 2019

Emantor commented Oct 2, 2019

dinnymate commented Oct 6, 2019

ascent12 commented Oct 6, 2019

jpeeler commented Oct 14, 2019

dinnymate commented Oct 15, 2019 • edited Loading

ascent12 commented Oct 15, 2019

emersion commented Oct 15, 2019

ascent12 commented Oct 15, 2019

emersion commented Oct 22, 2019

jpeeler commented Oct 30, 2019

jpeeler commented Nov 13, 2019

jpeeler commented Nov 18, 2019 • edited Loading

jpeeler commented Oct 1, 2019 •

edited by emersion

Loading

dinnymate commented Oct 15, 2019 •

edited

Loading

jpeeler commented Nov 18, 2019 •

edited

Loading