[227 regression] rfkill socket job hangs and blocks other jobs (still present in 228) #1579

coling · 2015-10-15T15:51:27Z

Not sure if this is related to the reworked rfkill behavour in v227 or not. Some similar symptoms seem to date back to v225.

From v227 release notes:

      systemd-rfkill has been reworked to become a singleton
      service that is activated through /dev/rfkill on each rfkill
      state change and saves the settings to disk. This way,
      systemd-rfkill is now compatible with devices that exist
      only intermittendly, and even restores state if the previous
      system shutdown was abrupt rather than clean.

I enable and disable my wifi fairly regularly (whenever I'm "docked" to my USB hub which has an ethernet adapter in it).

I've found on several occasions, I would return to my machine and be unable to unlock my session (screen shield fails to rise - just frozen).

Switching to tty2 to login via text also fails.

Thankfully I have the debug shell enabled for now, and can use it to do some commands.

What is typically happening there is that there are a few jobs in the queue and that is somehow blocking dbus activation of fprintd (finger print daemon for auth - not used on my system but it would be pulled in by gdm/pam for unlocking my session)

    sh-4.3# systemctl list-jobs
    JOB UNIT TYPE STATE
    6668 systemd-rfkill.socket stop waiting
    6675 fprintd.service start waiting

    2 jobs listed.

If I cancel the rfkill.socket job, the rest of the queue seems to progress OK.

If I then list the jobs again, I invariably get this kind of state:

    [colin@jimmy ~]$ systemctl list-jobs
    JOB UNIT TYPE STATE
    7336 systemd-rfkill.service start waiting
    7340 sys-devices-virtual-misc-rfkill.device start running

    2 jobs listed.

This will presumably time out at some point and then I get:

    [colin@jimmy log (master)]$ systemctl list-jobs
    JOB UNIT TYPE STATE
    7432 systemd-rfkill.socket stop waiting

    1 jobs listed.

If I were to lock my screen now, I'd likely get this problem back.

I'm not 100% sure how long this has been going on, but I've had problems with my /boot (ESP partition) automounting for a while. It would eventually timeout and jobs would get stuck in the queue. This possibly dates back to v225 or earlier.

    Oct 14 21:33:04 jimmy systemd[1]: sys-devices-virtual-misc-rfkill.device: Job sys-devices-virtual-misc-rfkill.device/start timed out.
    Oct 14 21:33:04 jimmy systemd[1]: Timed out waiting for device sys-devices-virtual-misc-rfkill.device.
    Oct 14 21:33:04 jimmy systemd[1]: Dependency failed for Load/Save RF Kill Switch Status.
    Oct 14 21:33:04 jimmy systemd[1]: systemd-rfkill.service: Job systemd-rfkill.service/start failed with result 'dependency'.
    Oct 14 21:33:04 jimmy systemd[1]: systemd-rfkill.socket: Unit is bound to inactive unit sys-devices-virtual-misc-rfkill.device. Stopping, too.
    Oct 14 21:33:04 jimmy systemd[1]: sys-devices-virtual-misc-rfkill.device: Job sys-devices-virtual-misc-rfkill.device/start failed with result 'timeout'.
    Oct 14 21:33:04 jimmy systemd[1]: Looping too fast. Throttling execution a little.
    Oct 14 21:33:05 jimmy systemd[1]: Looping too fast. Throttling execution a little.

Not really sure what's causing the "looping too fast." thing, but I think it's very much related. My current boot was fine for several days (booted on 12th with wifi disabled and kept it disabled until the 14th - after enabling it, the above errors were printed and the "Looping too fast" thing has been present ever since.

The text was updated successfully, but these errors were encountered:

coling · 2015-10-16T12:59:55Z

FTR, as I'm also looking at issue #1505 and as some issues that, while not as exposed as the recent one, have affected me since v225, I decided to revert bbc2908 as well. So far so good - not experienced the lockups or any stalled jobs. Will see how long this lasts and report back. Could just be I'm seeing another effect of the overall but in #1505.

coling · 2015-10-16T13:15:14Z

For reference, the issues we've seen with the "Looping too fast" problem have been present since v225 which ties in with the bbc2908 commit. Could be coincidental of course, but seems to tie in so far.

Downstream Mageia bug which is closed but I suspect it's not actually "fixed", just hard to always reproduce.

coling · 2015-10-17T09:18:15Z

OK, after more testing, this doesn't seem related to bbc2908 after all (still occurred after suspend/resume cycle with that reverted). I suspect it's not actually related to #1505 after all :(

poettering · 2015-10-19T18:42:32Z

The looping too fast thing most likely was fixed by 8046c45. Do you already have that in your test system?

poettering · 2015-10-19T18:46:34Z

This output of yours:

    sh-4.3# systemctl list-jobs
    JOB UNIT TYPE STATE
    6668 systemd-rfkill.socket stop waiting
    6675 fprintd.service start waiting

    2 jobs listed.

is really weird. There should always be at least one job that is running if there's one that is waiting. After all jobs wait for other jobs that currently run or wait. And if there are no others to wait for then they should be running...

I really wonder how you managed to the get system into this state. The only resonable explanation I could come up with is that your system did a reload while something was queued. Hence the deps changed, and what was previously dep-cycle-free is no longer. We currently don't deal with this case nicely (but we really should). Do you have a "systemctl daemon-reload" (or equivalent) anywhere in your normal codepaths that could trigger this?

coling · 2015-10-19T19:03:55Z

8046c45 is definitely part of my v227 build so it's not fully fixed.

Re the list-jobs output, I'm not 100% sure about reload events but I do remember seeing serialisation issues related to fds so perhaps you're on the money here... Can't look up the journal just now to get more info, but I don't see that error on latest boot (which did have similar list-jobs output until I masked various units to get a half-working system for now. The multiple jobs in waiting state is very much tied to the Looping too fast error state however.

coling · 2015-10-20T08:07:07Z

So looking further, there are times when a reload has occurred.

    Oct 16 09:43:16 jimmy systemd[1]: Started Fingerprint Authentication Daemon.
    Oct 16 09:56:09 jimmy systemd[1]: Reloading.
    Oct 16 09:56:09 jimmy systemd[1]: Failed to enumerate devices: Invalid argument
    Oct 16 09:56:09 jimmy systemd[1]: Failed to reload: Invalid argument

The first looping too fast seemed to occur for me when coming back from suspend. The rfkill device unit was obviously still waiting when we went to sleep, and then immediately timed out after coming back from suspend and cascaded to the other rfkill units:

    Oct 16 23:11:47 jimmy systemd[1]: Reached target Sleep.
    Oct 16 23:11:47 jimmy systemd[1]: Starting Suspend...
    Oct 17 10:12:07 jimmy systemd[1]: Time has been changed
    Oct 17 10:12:07 jimmy systemd[1]: sys-devices-virtual-misc-rfkill.device: Job sys-devices-virtual-misc-rfkill.device/start timed out.
    Oct 17 10:12:07 jimmy systemd[1]: Timed out waiting for device sys-devices-virtual-misc-rfkill.device.
    Oct 17 10:12:07 jimmy systemd[1]: Dependency failed for Load/Save RF Kill Switch Status.
    Oct 17 10:12:07 jimmy systemd[1]: systemd-rfkill.service: Job systemd-rfkill.service/start failed with result 'dependency'.
    Oct 17 10:12:07 jimmy systemd[1]: systemd-rfkill.socket: Unit is bound to inactive unit sys-devices-virtual-misc-rfkill.device. Stopping, too.
    Oct 17 10:12:07 jimmy systemd[1]: sys-devices-virtual-misc-rfkill.device: Job sys-devices-virtual-misc-rfkill.device/start failed with result 'timeout'.
    Oct 17 10:12:07 jimmy systemd[1]: Started Suspend.
    Oct 17 10:12:07 jimmy systemd[1]: sleep.target: Unit not needed anymore. Stopping.
    Oct 17 10:12:08 jimmy systemd[1]: Looping too fast. Throttling execution a little.

Although I cannot recall fully, it was about now that I think I switched to tty9 and cancelld the rkfkill.socket job:

    Oct 17 10:12:09 jimmy systemd[1]: Looping too fast. Throttling execution a little.
    Oct 17 10:12:11 jimmy systemd[1]: Looping too fast. Throttling execution a little.
    Oct 17 10:12:12 jimmy systemd[1]: Looping too fast. Throttling execution a little.
    Oct 17 10:12:13 jimmy systemd[1]: Looping too fast. Throttling execution a little.
    Oct 17 10:12:15 jimmy systemd[1]: Looping too fast. Throttling execution a little.
    Oct 17 10:12:16 jimmy systemd[1]: Looping too fast. Throttling execution a little.
    Oct 17 10:12:17 jimmy systemd[1]: Looping too fast. Throttling execution a little.
    Oct 17 10:12:18 jimmy systemd[1]: Looping too fast. Throttling execution a little.
    Oct 17 10:12:20 jimmy systemd[1]: Looping too fast. Throttling execution a little.
    Oct 17 10:12:21 jimmy systemd[1]: Looping too fast. Throttling execution a little.
    Oct 17 10:12:22 jimmy systemd[1]: Looping too fast. Throttling execution a little.
    Oct 17 10:12:24 jimmy systemd[1]: Looping too fast. Throttling execution a little.
    Oct 17 10:12:25 jimmy systemd[1]: Looping too fast. Throttling execution a little.
    Oct 17 10:12:26 jimmy systemd[1]: Looping too fast. Throttling execution a little.
    Oct 17 10:12:28 jimmy systemd[1]: Looping too fast. Throttling execution a little.
    Oct 17 10:12:29 jimmy systemd[1]: Looping too fast. Throttling execution a little.
    Oct 17 10:12:30 jimmy systemd[1]: Starting Fingerprint Authentication Daemon...
    Oct 17 10:12:30 jimmy systemd[1]: Stopped target Sleep.
    Oct 17 10:12:30 jimmy systemd[1]: Reached target Suspend.
    Oct 17 10:12:30 jimmy systemd[1]: suspend.target: Unit is bound to inactive unit systemd-suspend.service. Stopping, too.
    Oct 17 10:12:30 jimmy systemd[1]: Stopped target Suspend.
    Oct 17 10:12:30 jimmy systemd[1]: Started Fingerprint Authentication Daemon.

As you can see it took me about 20s to list the jobs, and cancel the appropriate one, then fprintd service could continue happily along with the rest of the ~~boot~~ resume from suspend process.

coling · 2015-10-21T08:51:00Z

Is there any more debugging I can do? I can probably spend some time looking into this this evening if you can hint at where to sprinkle some additional debug.

poettering · 2015-10-21T10:17:37Z

if you manage to reproduce the "looping too fast" thing, try connecting with gdb to PID 1 and see which event causes it to wakeup all the time. for that set a breakpoint on source_dispatch and look at s->description which hopefully contains a short string that tells you which source is being triggered there...

and if that doesn't work, trace through the function and see which callback is ultimately invoked by source_dispatch...

coling · 2015-11-09T10:49:55Z

I tried to reproduce this at the hackfest with lastest git, but failed. Today I return to my office and up the issue pops. I suspect it might be related to the fact that I plug a USB Network interface in then disable Wifi (via software). Not sure why that would trigger it, but as it refused to do it at the hackfest, this is the only major difference.

I'll see what debug I can extract now.

coling · 2015-11-09T12:00:33Z

I'm not really a massive gdb expert, but the loops seems to happen quickly because sd_event_wait() keeps waking up very quickly - as far as I understand it this is where the epoll_wait exists and thus it should be here that causes things to wait. In this case there are two events that wake it up (m==2) and the s->description value appears to be appears to be "socket-port-io" and "manager-notify".

This kinda makes sense as it's the systemd-rfkill.socket unit that seems to be clogging up the queue.

In terms of actually dispatching, the only event that seems to get dispatched is a "bus-input" one, I'm guessing that the some kind of information about the socket unit is being updated on the bus in some capacity, so the actual dispatching part may not be too important.

Sadly my X session died when debugging at this point, so I'm not sure if this event was the result of a sd_event_wait() call or if it was from sd_event_prepare(), but I suspect the former (the wait()).

Does this help in anyway? I'm not really fully clued up here but obviously want to kill off this problem, so any further debugging tips when I get the system in this state again would be greatly appreciated.

For the avoidance of doubt, I'm not running kdbus.

coling · 2015-11-23T09:48:41Z

My machine is currently in this state but I cannot keep it there forever. Is there any further debug I can extract? This state has only occurred twice since returning from the systemd conference, so quite hard to reproduce (it was more frequent before so I could perhaps roll back a version or two to debug it further if needed)

coling · 2015-11-27T09:29:33Z

Still present in v228.

I was able to reproduce on a fresh boot by doing:

Boot with wifi enabled (just talking software here, not hardware keyboard shortcut)
Turn off wifi
systemctl daemon-reload
Turn on wifi

That was all I needed to get things into this state it seems (tho' I may have been lucky).

coling · 2015-12-14T12:13:09Z

Looking at ohsix's Gist regarding perf usage, I've been trying to record values using that (which is likely better(?) than gdb for this kind of tracing

Most of the s-> values in source_dispatch seem empty/blank (s->description is empty string, s->type is 0x0, s->pending is 0x0) Some exceptions are s->n_ref which is 0x4 and s->event which is 0xffffffff.

The latter is most interesting. However what I really don't understand is that the assert in source_dispatch() should catch this:

    assert(s->pending || s->type == SOURCE_EXIT);

SOURCE_EXIT is not 0x0 and if s->pending is 0x0 and s->type is 0x0 then this assert should be met and get triggered but it seems it is not.

So I'm not 100% sure if perf is reading the enum values correctly or not, or if there is just something fundamentally weird here :s Anyway help is very much appreciated!

coling · 2015-12-14T12:26:49Z

Hmm, when things are working normally it's kinda hard to get the values out properly too. I suspect perf is not operating as expected here... :( Last comment can likely be disregarded.

pocek · 2015-12-18T11:37:04Z

You might get some ideas about sd-event debugging from this gdb script which was used in context of 8046c45.

coling · 2015-12-18T12:41:36Z

Thanks, that is incredibly useful.

I was able to get some decent debug out of this.

Unsurprisingly, (and as per my comment in November 9th) it's the socket-port-io that's going mental:

    manager-signal e=1 pri=-5 pi=0 fd=5
    socket-port-io e=1 pri=0 pi=0 fd=28
    socket-port-io e=1 pri=0 pi=0 fd=28
    socket-port-io e=1 pri=0 pi=0 fd=28
    manager-notify e=1 pri=-7 pi=0 fd=20
    socket-port-io e=1 pri=0 pi=0 fd=28
    manager-notify e=1 pri=-7 pi=0 fd=20
    socket-port-io e=1 pri=0 pi=0 fd=28
    socket-port-io e=1 pri=0 pi=0 fd=28
    socket-port-io e=1 pri=0 pi=0 fd=28
    socket-port-io e=1 pri=0 pi=0 fd=28
    manager-notify e=1 pri=-7 pi=0 fd=20
    socket-port-io e=1 pri=0 pi=0 fd=28
    socket-port-io e=1 pri=0 pi=0 fd=28
    socket-port-io e=1 pri=0 pi=0 fd=28
    socket-port-io e=1 pri=0 pi=0 fd=28
    socket-port-io e=1 pri=0 pi=0 fd=28
    manager-notify e=1 pri=-7 pi=0 fd=20
    socket-port-io e=1 pri=0 pi=0 fd=28
    socket-port-io e=1 pri=0 pi=0 fd=28
    socket-port-io e=1 pri=0 pi=0 fd=28
    socket-port-io e=1 pri=0 pi=0 fd=28
    manager-notify e=1 pri=-7 pi=0 fd=20
    socket-port-io e=1 pri=0 pi=0 fd=28
    manager-notify e=1 pri=-7 pi=0 fd=20
    socket-port-io e=1 pri=0 pi=0 fd=28
    socket-port-io e=1 pri=0 pi=0 fd=28
    manager-notify e=1 pri=-7 pi=0 fd=20
    socket-port-io e=1 pri=0 pi=0 fd=28
    socket-port-io e=1 pri=0 pi=0 fd=28
    socket-port-io e=1 pri=0 pi=0 fd=28
    manager-notify e=1 pri=-7 pi=0 fd=20
    socket-port-io e=1 pri=0 pi=0 fd=28
    manager-notify e=1 pri=-7 pi=0 fd=20
    socket-port-io e=1 pri=0 pi=0 fd=28
    socket-port-io e=1 pri=0 pi=0 fd=28
    socket-port-io e=1 pri=0 pi=0 fd=28
    manager-notify e=1 pri=-7 pi=0 fd=20
    socket-port-io e=1 pri=0 pi=0 fd=28
    manager-notify e=1 pri=-7 pi=0 fd=20
    socket-port-io e=1 pri=0 pi=0 fd=28
    socket-port-io e=1 pri=0 pi=0 fd=28

Now that I know (a little) about gdb scripting, I'll try and extract more info about what sockets are actually being monitored here and why the events are constantly triggering.

Thanks again, that was super useful!

coling · 2015-12-18T16:56:15Z

OK, so debugging further and into socket_dispatch_io(), the unit in question is indeed systemd-rfkill.socket

The loop, ultimately calls socket_enter_running() every time and continues.

Printing out p->socket->meta, I get this in GDB:

    {manager = 0x55de31dec0f0, type = UNIT_SOCKET, load_state = UNIT_LOADED, merged_into = 0x0, id = 0x55de31ecd920 "systemd-rfkill.socket", instance = 0x0, names = 0x55de31e08640, dependencies = {
        0x55de31e15c60, 0x0, 0x0, 0x55de31e94a70, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x55de31e4e0a0, 0x0, 0x55de31e945c0, 0x55de31e92c40, 0x0, 0x55de31e14cd0, 0x0, 0x0, 0x0, 0x0, 0x55de31e15d80, 0x0}, 
      requires_mounts_for = 0x55de31ec90c0, description = 0x55de31e89300 "Load/Save RF Kill Switch Status /dev/rfkill Watch", documentation = 0x55de31eb81f0, 
      fragment_path = 0x55de31ebed90 "/usr/lib/systemd/system/systemd-rfkill.socket", source_path = 0x0, dropin_paths = 0x0, fragment_mtime = 1448391882000000, source_mtime = 0, dropin_mtime = 0, 
      job = 0x55de31decac0, nop_job = 0x0, match_bus_slot = 0x0, job_timeout = 0, job_timeout_action = FAILURE_ACTION_NONE, job_timeout_reboot_arg = 0x0, refs = 0x0, conditions = 0x0, asserts = 0x0, 
      condition_timestamp = {realtime = 1450095404143676, monotonic = 1929484}, assert_timestamp = {realtime = 1450095404143678, monotonic = 1929485}, inactive_exit_timestamp = {realtime = 1450095404143692, 
        monotonic = 1929499}, active_enter_timestamp = {realtime = 1450095404143692, monotonic = 1929499}, active_exit_timestamp = {realtime = 0, monotonic = 0}, inactive_enter_timestamp = {realtime = 0, 
        monotonic = 0}, slice = {unit = 0x0, refs_next = 0x0, refs_prev = 0x0}, units_by_type_next = 0x55de31ea1db0, units_by_type_prev = 0x0, has_requires_mounts_for_next = 0x0, 
      has_requires_mounts_for_prev = 0x0, load_queue_next = 0x0, load_queue_prev = 0x0, dbus_queue_next = 0x0, dbus_queue_prev = 0x0, cleanup_queue_next = 0x0, cleanup_queue_prev = 0x0, gc_queue_next = 0x0, 
      gc_queue_prev = 0x0, cgroup_queue_next = 0x0, cgroup_queue_prev = 0x0, cgroup_netclass_next = 0x0, cgroup_netclass_prev = 0x0, pids = 0x0, gc_marker = 124510, load_error = 0, auto_stop_ratelimit = {
        interval = 10000000, begin = 146439309846, burst = 16, num = 1}, unit_file_state = _UNIT_FILE_STATE_INVALID, unit_file_preset = -1, cpuacct_usage_base = 0, cgroup_path = 0x0, cgroup_realized_mask = 0, 
      cgroup_subtree_mask = CGROUP_MASK_PIDS, cgroup_members_mask = 0, cgroup_inotify_wd = -1, cgroup_netclass_id = 0, on_failure_job_mode = JOB_REPLACE, stop_when_unneeded = false, default_dependencies = false, 
      refuse_manual_start = false, refuse_manual_stop = false, allow_isolate = false, ignore_on_isolate = false, condition_result = true, assert_result = true, transient = false, in_load_queue = false, 
      in_dbus_queue = false, in_cleanup_queue = false, in_gc_queue = false, in_cgroup_queue = false, sent_dbus_new_signal = true, no_gc = false, in_audit = false, cgroup_realized = false, 
      cgroup_members_mask_valid = true, cgroup_subtree_mask_valid = true, coldplugged = true}

The most interesting bit at first glance here is:

unit_file_state = _UNIT_FILE_STATE_INVALID
unit_file_preset = -1

This seems invalid to me.

Somewhat stuck now so further hints as to where to look would be really appreciated.

poettering · 2016-02-10T17:32:24Z

Hmm, so, does your /dev/rfkill device disappear when you hit the rfkill buttons? i.e. does "udevadm monitor -u" show them go away of come back if you git the keys?

If figure the sd-event busy looping is another instance of #2467.

So my educated guess about what happens is this: the /dev/rfkill disappears on your system, which results in an POLLERR or so on the rfkill device, which means we activate systemd-refkill.service, but that fails since that binds to the device and the device just disappeared. Hence we try to activate it again, which busy loops as pr #2467.

question is what to do about this and if this really is what happens. Please check if the device really disappears (with the udev command mentioned above). Also it would be good if you could strace PID 1 to see if the fd on the rfkill device really results in POLLERR or so. it's not easy to ensure that. try gdb attaching to pid right before issuing the rfkill key press. then set a breakpoint on socket_dispatch_io() and check if that call gets called with an interesting "revents" flag set (which is a combination of POLLIN, POLLERR, POLLOUT, PULLHUP and a couple of others)...

coling · 2016-02-11T10:42:48Z

So with the various systemd units masked (just so I can use my machine!), udevadm monitor -u shows:

Starting with it enabled:
[root@jimmy]$ udevadm monitor -u
monitor will print the received events for:
UDEV - the event which udev sends out after rule processing

Then I disabled it via buttons:

UDEV  [24633.161258] change   /devices/pci0000:00/0000:00:1c.0/0000:01:00.0/ieee80211/phy0/rfkill0 (rfkill)
UDEV  [24633.168794] change   /devices/platform/regulatory.0 (platform)
UDEV  [24633.171334] change   /devices/platform/regulatory.0 (platform)
UDEV  [24633.246688] remove   /devices/pci0000:00/0000:00:1d.0/usb4/4-1/4-1.5/4-1.5:1.0/bluetooth/hci0/rfkill2 (rfkill)
UDEV  [24633.246737] remove   /devices/pci0000:00/0000:00:1d.0/usb4/4-1/4-1.5/4-1.5:1.1 (usb)
UDEV  [24633.246768] remove   /devices/pci0000:00/0000:00:1d.0/usb4/4-1/4-1.5/4-1.5:1.0/bluetooth/hci0 (bluetooth)
UDEV  [24633.248688] remove   /devices/pci0000:00/0000:00:1d.0/usb4/4-1/4-1.5/4-1.5:1.0 (usb)
UDEV  [24633.254513] remove   /devices/pci0000:00/0000:00:1d.0/usb4/4-1/4-1.5 (usb)

Then enable it via buttons:

UDEV  [24650.127915] change   /devices/pci0000:00/0000:00:1c.0/0000:01:00.0/ieee80211/phy0/rfkill0 (rfkill)
UDEV  [24650.604384] add      /devices/pci0000:00/0000:00:1d.0/usb4/4-1/4-1.5 (usb)
UDEV  [24650.608322] add      /devices/pci0000:00/0000:00:1d.0/usb4/4-1/4-1.5/4-1.5:1.0 (usb)
UDEV  [24650.608370] add      /devices/pci0000:00/0000:00:1d.0/usb4/4-1/4-1.5/4-1.5:1.0/bluetooth/hci0 (bluetooth)
UDEV  [24650.608413] add      /devices/pci0000:00/0000:00:1d.0/usb4/4-1/4-1.5/4-1.5:1.1 (usb)
UDEV  [24650.608447] add      /devices/pci0000:00/0000:00:1d.0/usb4/4-1/4-1.5/4-1.5:1.0/bluetooth/hci0/rfkill3 (rfkill)
UDEV  [24653.166122] change   /devices/platform/regulatory.0 (platform)

Disable via buttons again:

UDEV  [24661.111982] change   /devices/pci0000:00/0000:00:1c.0/0000:01:00.0/ieee80211/phy0/rfkill0 (rfkill)
UDEV  [24661.117933] change   /devices/platform/regulatory.0 (platform)
UDEV  [24661.120975] change   /devices/platform/regulatory.0 (platform)
UDEV  [24661.290637] remove   /devices/pci0000:00/0000:00:1d.0/usb4/4-1/4-1.5/4-1.5:1.0/bluetooth/hci0/rfkill3 (rfkill)
UDEV  [24661.292747] remove   /devices/pci0000:00/0000:00:1d.0/usb4/4-1/4-1.5/4-1.5:1.0/bluetooth/hci0 (bluetooth)
UDEV  [24661.292801] remove   /devices/pci0000:00/0000:00:1d.0/usb4/4-1/4-1.5/4-1.5:1.0 (usb)
UDEV  [24661.293440] remove   /devices/pci0000:00/0000:00:1d.0/usb4/4-1/4-1.5/4-1.5:1.1 (usb)
UDEV  [24661.300493] remove   /devices/pci0000:00/0000:00:1d.0/usb4/4-1/4-1.5 (usb)

Enable via buttons again:

UDEV  [24668.127772] change   /devices/pci0000:00/0000:00:1c.0/0000:01:00.0/ieee80211/phy0/rfkill0 (rfkill)
UDEV  [24668.561167] add      /devices/pci0000:00/0000:00:1d.0/usb4/4-1/4-1.5 (usb)
UDEV  [24668.566070] add      /devices/pci0000:00/0000:00:1d.0/usb4/4-1/4-1.5/4-1.5:1.0 (usb)
UDEV  [24668.566114] add      /devices/pci0000:00/0000:00:1d.0/usb4/4-1/4-1.5/4-1.5:1.0/bluetooth/hci0 (bluetooth)
UDEV  [24668.567163] add      /devices/pci0000:00/0000:00:1d.0/usb4/4-1/4-1.5/4-1.5:1.0/bluetooth/hci0/rfkill4 (rfkill)
UDEV  [24668.570651] add      /devices/pci0000:00/0000:00:1d.0/usb4/4-1/4-1.5/4-1.5:1.1 (usb)
UDEV  [24671.171089] change   /devices/platform/regulatory.0 (platform)

If I use GNOME to disable the wifi part:

UDEV  [24681.465915] change   /devices/pci0000:00/0000:00:1c.0/0000:01:00.0/ieee80211/phy0/rfkill0 (rfkill)
UDEV  [24681.472675] change   /devices/platform/regulatory.0 (platform)
UDEV  [24681.478414] change   /devices/platform/regulatory.0 (platform)

Enable wifi again via GNOME:

UDEV  [24687.856696] change   /devices/pci0000:00/0000:00:1c.0/0000:01:00.0/ieee80211/phy0/rfkill0 (rfkill)
UDEV  [24690.905223] change   /devices/platform/regulatory.0 (platform)

Disable wifi again via GNOME:

UDEV  [24700.833732] change   /devices/pci0000:00/0000:00:1c.0/0000:01:00.0/ieee80211/phy0/rfkill0 (rfkill)
UDEV  [24700.847663] change   /devices/platform/regulatory.0 (platform)
UDEV  [24700.851021] change   /devices/platform/regulatory.0 (platform)

Enable wifi again via GNOME:

UDEV  [24708.417558] change   /devices/pci0000:00/0000:00:1c.0/0000:01:00.0/ieee80211/phy0/rfkill0 (rfkill)
UDEV  [24711.406515] change   /devices/platform/regulatory.0 (platform)

If I use GNOME to disable the bluetooth part:

UDEV  [24912.087216] change   /devices/pci0000:00/0000:00:1d.0/usb4/4-1/4-1.5/4-1.5:1.0/bluetooth/hci0/rfkill4 (rfkill)

Enable bluetooth again via GNOME:

UDEV  [24918.033924] change   /devices/pci0000:00/0000:00:1d.0/usb4/4-1/4-1.5/4-1.5:1.0/bluetooth/hci0/rfkill4 (rfkill)

Disable bluetooth again via GNOME:

UDEV  [24926.221045] change   /devices/pci0000:00/0000:00:1d.0/usb4/4-1/4-1.5/4-1.5:1.0/bluetooth/hci0/rfkill4 (rfkill)

Enable bluetooth again via GNOME:

UDEV  [24933.751378] change   /devices/pci0000:00/0000:00:1d.0/usb4/4-1/4-1.5/4-1.5:1.0/bluetooth/hci0/rfkill4 (rfkill)

So the h/w switches do seem to create a new rfkill device for the bluetooth component, but software switches do not. the rfkill for BT started at 2 which makes sense as this was after a fresh boot yesterday and one suspend/resume cycle which I presume has much the same effect.

Now none of this refers to /dev/rfkill, so not sure if the above is really out of the ordinary? Can you clarify if this is what you meant, and whether or not it would still be useful to do the strace/gdb you suggested?

poettering · 2016-02-11T12:07:55Z

Well, all rfkill devices are multiplexed through one /dev/rfkill device, and my guess is that that's failing somehow...

can you try this: run "cat /dev/rfkill" in a terminal. That should spit out some binary noise immediately, and then more each time you hit the rfkill buttons. Does that ever abort or so if you do that continously?

Also, when all rfkill stuff is off, what does "udevadm info /dev/rfkill" show? And if all is on, what does it show then? Please paste the output here...

coling · 2016-02-11T12:21:21Z

can you try this: run "cat /dev/rfkill" in a terminal.

It never aborts. It just stays running and spews as you fiddle. So it does look like the /dev/rfkill device sticks around.

Also, when all rfkill stuff is off, what does "udevadm info /dev/rfkill" show? And if all is on, what does it show then? Please paste the output here...

With the h/w switch disabled (note that the three systemd units for it are still masked):

[root@jimmy ~]# udevadm info /dev/rfkill
P: /devices/virtual/misc/rfkill
N: rfkill
E: DEVNAME=/dev/rfkill
E: DEVPATH=/devices/virtual/misc/rfkill
E: MAJOR=10
E: MINOR=56
E: SUBSYSTEM=misc
E: SYSTEMD_WANTS=systemd-rfkill.socket
E: TAGS=:seat:uaccess:systemd:
E: USEC_INITIALIZED=1173954

With the h/w switch enabled:

[root@jimmy ~]# udevadm info /dev/rfkill
P: /devices/virtual/misc/rfkill
N: rfkill
E: DEVNAME=/dev/rfkill
E: DEVPATH=/devices/virtual/misc/rfkill
E: MAJOR=10
E: MINOR=56
E: SUBSYSTEM=misc
E: SYSTEMD_WANTS=systemd-rfkill.socket
E: TAGS=:uaccess:systemd:seat:
E: USEC_INITIALIZED=1173954

Only difference I can see is that the TAGS are in a different order.

Let me know if you want me to unmask the units again and reproduce the problem before doing any of these commands! Like I said above it does seem to require a daemon-reload to get systemd into the weird state.

Also I didn't have IRC connected, but feel free to message me there in ~1h (need to take the dog out!) Cheers!

poettering · 2016-08-31T11:42:46Z

I am pretty sure this got fixed by #3760 btw. If not please reopen.

coling changed the title ~~rfkill socket job hangs other units~~ rfkill socket job hangs and blocks other jobs Oct 16, 2015

coling changed the title ~~rfkill socket job hangs and blocks other jobs~~ [227 regression] rfkill socket job hangs and blocks other jobs Oct 17, 2015

poettering added pid1 bug 🐛 Programming errors, that need preferential fixing labels Oct 19, 2015

poettering added the needs-reporter-feedback ❓ There's an unanswered question, the reporter needs to answer label Nov 2, 2015

coling changed the title ~~[227 regression] rfkill socket job hangs and blocks other jobs~~ [227 regression] rfkill socket job hangs and blocks other jobs (still present in 228) Nov 27, 2015

keszybz removed the needs-reporter-feedback ❓ There's an unanswered question, the reporter needs to answer label Dec 8, 2015

poettering added the needs-reporter-feedback ❓ There's an unanswered question, the reporter needs to answer label Feb 10, 2016

poettering mentioned this issue Aug 31, 2016

rfkill dead-lock fix #3760

Merged

poettering closed this as completed Aug 31, 2016

AquaJerry mentioned this issue Oct 8, 2020

[Solved] SGO2 Arch: boot stucks at "Load/Save RF Kill Swtich Status" linux-surface/linux-surface#296

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[227 regression] rfkill socket job hangs and blocks other jobs (still present in 228) #1579

[227 regression] rfkill socket job hangs and blocks other jobs (still present in 228) #1579

coling commented Oct 15, 2015

coling commented Oct 16, 2015

coling commented Oct 16, 2015

coling commented Oct 17, 2015

poettering commented Oct 19, 2015

poettering commented Oct 19, 2015

coling commented Oct 19, 2015

coling commented Oct 20, 2015

coling commented Oct 21, 2015

poettering commented Oct 21, 2015

coling commented Nov 9, 2015

coling commented Nov 9, 2015

coling commented Nov 23, 2015

coling commented Nov 27, 2015

coling commented Dec 14, 2015

coling commented Dec 14, 2015

pocek commented Dec 18, 2015

coling commented Dec 18, 2015

coling commented Dec 18, 2015

poettering commented Feb 10, 2016

coling commented Feb 11, 2016

poettering commented Feb 11, 2016

coling commented Feb 11, 2016

poettering commented Aug 31, 2016

[227 regression] rfkill socket job hangs and blocks other jobs (still present in 228) #1579

[227 regression] rfkill socket job hangs and blocks other jobs (still present in 228) #1579

Comments

coling commented Oct 15, 2015

coling commented Oct 16, 2015

coling commented Oct 16, 2015

coling commented Oct 17, 2015

poettering commented Oct 19, 2015

poettering commented Oct 19, 2015

coling commented Oct 19, 2015

coling commented Oct 20, 2015

coling commented Oct 21, 2015

poettering commented Oct 21, 2015

coling commented Nov 9, 2015

coling commented Nov 9, 2015

coling commented Nov 23, 2015

coling commented Nov 27, 2015

coling commented Dec 14, 2015

coling commented Dec 14, 2015

pocek commented Dec 18, 2015

coling commented Dec 18, 2015

coling commented Dec 18, 2015

poettering commented Feb 10, 2016

coling commented Feb 11, 2016

poettering commented Feb 11, 2016

coling commented Feb 11, 2016

poettering commented Aug 31, 2016