Skip to content
This repository has been archived by the owner on Nov 1, 2021. It is now read-only.

backend/drm: handle unplug event #2575

Merged
merged 4 commits into from
Apr 27, 2021
Merged

Conversation

emersion
Copy link
Member

@emersion emersion commented Dec 27, 2020

This is half of the GPU hotplug work, see #2423.

To test:

# MINOR=1 triggers removal for card1
udevadm trigger --verbose --type=devices --action=remove --subsystem-match=drm --property-match="MINOR=1"

@neon64
Copy link
Contributor

neon64 commented Jan 9, 2021

Is there a good way of debugging why a kernel module is still in use? Unfortunately with this patch after manually sending a remove udev event, and ls /dev/dri/card* showing only /dev/dri/card0 (my iGPU), I get

rmmod: ERROR: Module nouveau is in use

which is weird because it doesn't look like this patch is doing much different compared to my last PR which allowed rmmod fine (although that PR was admittedly against wlroots from several months ago)

@emersion
Copy link
Member Author

emersion commented Jan 9, 2021

lsmod will tell you know a kernel module is still in use. For instance, this indicates drm_kms_helper is still used by amdgpu:

Module                  Size  Used by
drm_kms_helper        274432  1 amdgpu

In my testing, rmmod -f was sometimes necessary because the "Used by" column indicated nothing and lsof /dev/dri/card0 revealed nothing.

@sochotnicky
Copy link

sochotnicky commented Jan 15, 2021

FWIW, I figured I'd see if I can help test this and I hit a segfault when triggering remove.

BT:

#0  xdg_toplevel_handle_set_title (client=<optimized out>, resource=<optimized out>, title=0x56044988513c "root@hostname:~") at ../wlroots-9999/types/xdg_shell/wlr_xdg_toplevel.c:247
        surface = 0x0
        tmp = 0x5604497f89c0 "root@hostname:~"
#1  0x00007f7a0ece7d3d in  () at /usr/lib64/libffi.so.7
#2  0x00007f7a0ece72a4 in  () at /usr/lib64/libffi.so.7
#3  0x00007f7a0f3f093d in  () at /usr/lib64/libwayland-server.so.0
#4  0x00007f7a0f3ecbb7 in  () at /usr/lib64/libwayland-server.so.0
#5  0x00007f7a0f3ee9c2 in wl_event_loop_dispatch () at /usr/lib64/libwayland-server.so.0
#6  0x00007f7a0f3ecda5 in wl_display_run () at /usr/lib64/libwayland-server.so.0
#7  0x00005604477ad1d0 in server_run (server=<optimized out>) at ../sway-9999/sway/server.c:247
#8  0x00005604477a19e6 in main (argc=3, argv=0x7ffe9fd3d378) at ../sway-9999/sway/main.c:431
        verbose = 0
        debug = 1
        validate = 0
        allow_unsupported_gpu = 1
        long_options =
            {{name = 0x5604477f1484 "help", has_arg = 0, flag = 0x0, val = 104}, {name = 0x5604477f4d69 "config", has_arg = 1, flag = 0x0, val = 99}, {name = 0x5604477f1489 "validate", has_arg = 0, flag = 0x0, val = 67}, {name = 0x5604477f1492 "debug", has_arg = 0, flag = 0x0, val = 100}, {name = 0x5604477f13e8 "version", has_arg = 0, flag = 0x0, val = 118}, {name = 0x5604477f0557 "verbose", has_arg = 0, flag = 0x0, val = 86}, {name = 0x5604477f1498 "get-socketpath", has_arg = 0, flag = 0x0, val = 112}, {name = 0x5604477f14a7 "unsupported-gpu", has_arg = 0, flag = 0x0, val = 117}, {name = 0x5604477f14b7 "my-next-gpu-wont-be-nvidia", has_arg = 0, flag = 0x0, val = 117}, {name = 0x0, has_arg = 0, flag = 0x0, val = 0}}
        config_path = 0x0
        usage = 0x5604477f1828 "Usage: sway [options] [command]\n\n  -h, --help", ' ' <repeats 13 times>, "Show help message and quit.\n  -c, --config <config>  Specify a config file.\n  -C, --validate         Check the validity of the config file, th"...
        c = <optimized out>

I ran udevadm trigger in root shell inside alacritty. I have set alacritty so that it updates title depending on command being run. Guessing some racecondition happened.

This is with current git repo HEAD of both wlroots & sway. This is with i915 + nouveau (libglvnd patch from https://gitlab.freedesktop.org/glvnd/libglvnd/-/merge_requests/235 included to make sway start). If needed I can add debuginfos for the libffi etc or other information that would help

EDIT: NVM, PEBKAC - I triggered with MINOR=0 instead of MINOR=1 to remove the secondary GPU. That actually works .. So kudos from my end anyway.

@emersion
Copy link
Member Author

This sounds unrelated. Please open a separate issue.

@neon64
Copy link
Contributor

neon64 commented Jan 17, 2021

so I've implemented the other half of this here: https://github.com/neon64/wlroots/tree/drm-hotplug . Should I make a separate PR to wlroots which includes/depends on the changes in this one?

@emersion
Copy link
Member Author

Should I make a separate PR to wlroots which includes/depends on the changes in this one?

Nice! Sure, feel free to open a new PR.

Have you run into any bug with this PR?

@neon64
Copy link
Contributor

neon64 commented Jan 17, 2021

Have you run into any bug with this PR?

I still haven't fully diagnosed why nouveau is still in-use unfortunately. lsof /dev/dri* shows only Intel stuff, and lsmod shows no modules depending on nouveau, but also the refcount is 5, not 0, so I think rmmod -f isn't safe to do.

nouveau              2359296  5
mxm_wmi                16384  1 nouveau
ttm                   114688  3 drm_vram_helper,drm_ttm_helper,nouveau
wmi                    36864  5 intel_wmi_thunderbolt,thinkpad_wmi,wmi_bmof,mxm_wmi,nouveau
video                  53248  3 thinkpad_acpi,i915,nouveau
i2c_algo_bit           16384  2 i915,nouveau
drm_kms_helper        274432  4 drm_vram_helper,vboxvideo,i915,nouveau
drm                   569344  30 drm_kms_helper,drm_vram_helper,vboxvideo,drm_ttm_helper,i915,ttm,nouveau
agpgart                53248  4 intel_gtt,ttm,nouveau,drm

After some more testing, I've worked out that if, before trying to destroy the card1 backend, I pull out the HDMI cable, then the refcount drops from 5 to 1, and then after running sudo udevadm trigger --verbose --action=remove --subsystem-match=drm --property-match="MINOR=1", the refcount drops from 1 to 0, and rmmod works.

If I don't disconnect the physical cable before using udevadm, then the refcount stays at 5 as long as sway is open. So my theory is that there's something missing in the cleanup code for outputs / drm-connectors, if you don't disconnect the output before the backend destroy.

May or may not be relevant from sway.log:

00:00:21.104 [DEBUG] [wlr] [backend/drm/drm.c:1064] connector HDMI-A-1: De-allocating CRTC 3
00:00:21.104 [ERROR] [wlr] [backend/drm/legacy.c:42] connector HDMI-A-1: Failed to set DPMS property: Permission denied
00:00:21.154 [ERROR] [wlr] [backend/session/logind.c:120] Failed to release device '9': Device not taken

@emersion
Copy link
Member Author

If I don't disconnect the physical cable before using udevadm, then the refcount stays at 5 as long as sway is open. So my theory is that there's something missing in the cleanup code for outputs / drm-connectors, if you don't disconnect the output before the backend destroy.

Hmm. Yeah, if the commit fails, we don't destroy the FBs in dealloc_crtc (it stops at drm_crtc_commit). Does it work if you comment the return for drm_crtc_commit failures?

@neon64
Copy link
Contributor

neon64 commented Jan 26, 2021

Does it work if you comment the return for drm_crtc_commit failures?

Yep that appears to work - refcount drops to zero and rmmod nouveau works.
Is there a downside to continuing through the rest of the function if the commit fails?

@emersion
Copy link
Member Author

The main thing I'm worried about is if this happens prior to shutdown and we try to re-use the CRTC for another connector later on. The kernel would return an error in this case (EINVAL for atomic).

We could only ignore errors when we're shutting down the backend, but not sure it's worth it. In all cases, something will be printed to the logs.

In any case, feel free to submit a pull request.

@emersion
Copy link
Member Author

Pushed the fix for EPERM.

@emersion emersion merged commit 4839664 into swaywm:master Apr 27, 2021
@emersion emersion deleted the drm-unplug branch April 27, 2021 07:11
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants