New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[227 regression] rfkill socket job hangs and blocks other jobs (still present in 228) #1579
Comments
FTR, as I'm also looking at issue #1505 and as some issues that, while not as exposed as the recent one, have affected me since v225, I decided to revert bbc2908 as well. So far so good - not experienced the lockups or any stalled jobs. Will see how long this lasts and report back. Could just be I'm seeing another effect of the overall but in #1505. |
For reference, the issues we've seen with the "Looping too fast" problem have been present since v225 which ties in with the bbc2908 commit. Could be coincidental of course, but seems to tie in so far. Downstream Mageia bug which is closed but I suspect it's not actually "fixed", just hard to always reproduce. |
The looping too fast thing most likely was fixed by 8046c45. Do you already have that in your test system? |
This output of yours:
is really weird. There should always be at least one job that is running if there's one that is waiting. After all jobs wait for other jobs that currently run or wait. And if there are no others to wait for then they should be running... I really wonder how you managed to the get system into this state. The only resonable explanation I could come up with is that your system did a reload while something was queued. Hence the deps changed, and what was previously dep-cycle-free is no longer. We currently don't deal with this case nicely (but we really should). Do you have a "systemctl daemon-reload" (or equivalent) anywhere in your normal codepaths that could trigger this? |
8046c45 is definitely part of my v227 build so it's not fully fixed. Re the list-jobs output, I'm not 100% sure about reload events but I do remember seeing serialisation issues related to fds so perhaps you're on the money here... Can't look up the journal just now to get more info, but I don't see that error on latest boot (which did have similar list-jobs output until I masked various units to get a half-working system for now. The multiple jobs in waiting state is very much tied to the Looping too fast error state however. |
So looking further, there are times when a reload has occurred.
The first looping too fast seemed to occur for me when coming back from suspend. The rfkill device unit was obviously still waiting when we went to sleep, and then immediately timed out after coming back from suspend and cascaded to the other rfkill units:
Although I cannot recall fully, it was about now that I think I switched to tty9 and cancelld the rkfkill.socket job:
As you can see it took me about 20s to list the jobs, and cancel the appropriate one, then fprintd service could continue happily along with the rest of the |
Is there any more debugging I can do? I can probably spend some time looking into this this evening if you can hint at where to sprinkle some additional debug. |
if you manage to reproduce the "looping too fast" thing, try connecting with gdb to PID 1 and see which event causes it to wakeup all the time. for that set a breakpoint on and if that doesn't work, trace through the function and see which callback is ultimately invoked by source_dispatch... |
I tried to reproduce this at the hackfest with lastest git, but failed. Today I return to my office and up the issue pops. I suspect it might be related to the fact that I plug a USB Network interface in then disable Wifi (via software). Not sure why that would trigger it, but as it refused to do it at the hackfest, this is the only major difference. I'll see what debug I can extract now. |
I'm not really a massive gdb expert, but the loops seems to happen quickly because This kinda makes sense as it's the systemd-rfkill.socket unit that seems to be clogging up the queue. In terms of actually dispatching, the only event that seems to get dispatched is a "bus-input" one, I'm guessing that the some kind of information about the socket unit is being updated on the bus in some capacity, so the actual dispatching part may not be too important. Sadly my X session died when debugging at this point, so I'm not sure if this event was the result of a Does this help in anyway? I'm not really fully clued up here but obviously want to kill off this problem, so any further debugging tips when I get the system in this state again would be greatly appreciated. For the avoidance of doubt, I'm not running kdbus. |
My machine is currently in this state but I cannot keep it there forever. Is there any further debug I can extract? This state has only occurred twice since returning from the systemd conference, so quite hard to reproduce (it was more frequent before so I could perhaps roll back a version or two to debug it further if needed) |
Still present in v228. I was able to reproduce on a fresh boot by doing:
That was all I needed to get things into this state it seems (tho' I may have been lucky). |
Looking at ohsix's Gist regarding perf usage, I've been trying to record values using that (which is likely better(?) than gdb for this kind of tracing Most of the s-> values in source_dispatch seem empty/blank (s->description is empty string, s->type is 0x0, s->pending is 0x0) Some exceptions are s->n_ref which is 0x4 and s->event which is 0xffffffff. The latter is most interesting. However what I really don't understand is that the assert in source_dispatch() should catch this:
SOURCE_EXIT is not 0x0 and if s->pending is 0x0 and s->type is 0x0 then this assert should be met and get triggered but it seems it is not. So I'm not 100% sure if perf is reading the enum values correctly or not, or if there is just something fundamentally weird here :s Anyway help is very much appreciated! |
Hmm, when things are working normally it's kinda hard to get the values out properly too. I suspect perf is not operating as expected here... :( Last comment can likely be disregarded. |
You might get some ideas about sd-event debugging from this gdb script which was used in context of 8046c45. |
Thanks, that is incredibly useful. I was able to get some decent debug out of this. Unsurprisingly, (and as per my comment in November 9th) it's the socket-port-io that's going mental:
Now that I know (a little) about gdb scripting, I'll try and extract more info about what sockets are actually being monitored here and why the events are constantly triggering. Thanks again, that was super useful! |
OK, so debugging further and into socket_dispatch_io(), the unit in question is indeed systemd-rfkill.socket The loop, ultimately calls socket_enter_running() every time and continues. Printing out p->socket->meta, I get this in GDB:
The most interesting bit at first glance here is:
This seems invalid to me. Somewhat stuck now so further hints as to where to look would be really appreciated. |
Hmm, so, does your /dev/rfkill device disappear when you hit the rfkill buttons? i.e. does "udevadm monitor -u" show them go away of come back if you git the keys? If figure the sd-event busy looping is another instance of #2467. So my educated guess about what happens is this: the /dev/rfkill disappears on your system, which results in an POLLERR or so on the rfkill device, which means we activate systemd-refkill.service, but that fails since that binds to the device and the device just disappeared. Hence we try to activate it again, which busy loops as pr #2467. question is what to do about this and if this really is what happens. Please check if the device really disappears (with the udev command mentioned above). Also it would be good if you could strace PID 1 to see if the fd on the rfkill device really results in POLLERR or so. it's not easy to ensure that. try gdb attaching to pid right before issuing the rfkill key press. then set a breakpoint on socket_dispatch_io() and check if that call gets called with an interesting "revents" flag set (which is a combination of POLLIN, POLLERR, POLLOUT, PULLHUP and a couple of others)... |
So with the various systemd units masked (just so I can use my machine!), udevadm monitor -u shows: Starting with it enabled: Then I disabled it via buttons:
Then enable it via buttons:
Disable via buttons again:
Enable via buttons again:
If I use GNOME to disable the wifi part:
Enable wifi again via GNOME:
Disable wifi again via GNOME:
Enable wifi again via GNOME:
If I use GNOME to disable the bluetooth part:
Enable bluetooth again via GNOME:
Disable bluetooth again via GNOME:
Enable bluetooth again via GNOME:
So the h/w switches do seem to create a new rfkill device for the bluetooth component, but software switches do not. the rfkill for BT started at 2 which makes sense as this was after a fresh boot yesterday and one suspend/resume cycle which I presume has much the same effect. Now none of this refers to /dev/rfkill, so not sure if the above is really out of the ordinary? Can you clarify if this is what you meant, and whether or not it would still be useful to do the strace/gdb you suggested? |
Well, all rfkill devices are multiplexed through one /dev/rfkill device, and my guess is that that's failing somehow... can you try this: run "cat /dev/rfkill" in a terminal. That should spit out some binary noise immediately, and then more each time you hit the rfkill buttons. Does that ever abort or so if you do that continously? Also, when all rfkill stuff is off, what does "udevadm info /dev/rfkill" show? And if all is on, what does it show then? Please paste the output here... |
It never aborts. It just stays running and spews as you fiddle. So it does look like the /dev/rfkill device sticks around.
With the h/w switch disabled (note that the three systemd units for it are still masked):
With the h/w switch enabled:
Only difference I can see is that the TAGS are in a different order. Let me know if you want me to unmask the units again and reproduce the problem before doing any of these commands! Like I said above it does seem to require a daemon-reload to get systemd into the weird state. Also I didn't have IRC connected, but feel free to message me there in ~1h (need to take the dog out!) Cheers! |
I am pretty sure this got fixed by #3760 btw. If not please reopen. |
Not sure if this is related to the reworked rfkill behavour in v227 or not. Some similar symptoms seem to date back to v225.
From v227 release notes:
I enable and disable my wifi fairly regularly (whenever I'm "docked" to my USB hub which has an ethernet adapter in it).
I've found on several occasions, I would return to my machine and be unable to unlock my session (screen shield fails to rise - just frozen).
Switching to tty2 to login via text also fails.
Thankfully I have the debug shell enabled for now, and can use it to do some commands.
What is typically happening there is that there are a few jobs in the queue and that is somehow blocking dbus activation of fprintd (finger print daemon for auth - not used on my system but it would be pulled in by gdm/pam for unlocking my session)
If I cancel the rfkill.socket job, the rest of the queue seems to progress OK.
If I then list the jobs again, I invariably get this kind of state:
This will presumably time out at some point and then I get:
If I were to lock my screen now, I'd likely get this problem back.
I'm not 100% sure how long this has been going on, but I've had problems with my /boot (ESP partition) automounting for a while. It would eventually timeout and jobs would get stuck in the queue. This possibly dates back to v225 or earlier.
Not really sure what's causing the "looping too fast." thing, but I think it's very much related. My current boot was fine for several days (booted on 12th with wifi disabled and kept it disabled until the 14th - after enabling it, the above errors were printed and the "Looping too fast" thing has been present ever since.
The text was updated successfully, but these errors were encountered: