-
-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AHCI runtime_pm issues with ata disks on Linux 4.15+ #123
Comments
@alonbl @copyninja |
Because of this issue my load average started spiking with no command returning I had to force switch off the laptop. When I booted to recovery mode also this issue persisted. Finally I removed laptop-mode-tools with great difficulty. It is not possible to use system with this issue forget opening file to blacklist module |
Isn't this https://www.mail-archive.com/search?l=linux-block@vger.kernel.org&q=subject:%22Re%5C%3A+linux%5C-next+scsi%5C-mq+hang+in+suspend%5C-resume%22&o=newest&f=1 all over again? I saw this a few months ago when I played with block multiqueue in some then new kernel, but found out someone forgot to implement power management into it and their fix (https://patchwork.kernel.org/patch/9891133/) wasn't enough to prevent LMT from hanging my machine. I resorted to "dm_mod.use_blk_mq=0 scsi_mod.use_blk_mq=0". |
I don't think the issues are the same. Our runtime-pm module wasn't executing completely. This got fixed in the last release with PR #118 , which then uncovered this issue. The issue is that 'sd' type devices get offlined completely, when runtime pm is applied. So this issue should be reproducible even during normal use case. As was such in my tests and from what @copyninja reported. Similar reports were seen from a Debian user on the Debian BTS too. It could very well be that they are related. But from the mentioned thread, I can't conclude that. As you can see from the logs I've shared, the disks are stopped and never come back. |
So, by the way, does using |
Well it seems this isn't really related to runtime-pm, sorry for the noise, I commented in the middle of debugging. |
You can have
|
You obviously haven't tried unstable's 4.15.4 yet. :-) |
I am on it, since yesterday.
s3nt fr0m a $martph0ne, excuse typ0s
…On 22-Feb-2018 14:17, "Tomáš Janoušek" ***@***.***> wrote:
You obviously haven't tried unstable's 4.15.4 yet. :-)
—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
<#123 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAD_yBW97yyiLMzejUEGypBCGyIFuJnvks5tXSmEgaJpZM4RfVr->
.
|
And do you see "Stopping disk" immediately followed by "Starting disk" in dmesg every time you pull out the charger? Like 3 or 4 times we experienced it not being followed by "Starting disk". :-( |
This was exact same thing happened to me but I don't remember kernel
version. It goes on till finally everything freezes terminal looks ok but
none of command returns.
…On 22-Feb-2018 15:02, "Tomáš Janoušek" ***@***.***> wrote:
And do you see "Stopping disk" immediately followed by "Starting disk" in
dmesg every time you pull out the charger? Like 3 or 4 times we experienced
it not being followed by "Starting disk". :-(
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#123 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAL7gCSCwUClwDj4oh6L2r4hYqTjYnctks5tXTQPgaJpZM4RfVr->
.
|
I have seen that a couple of times since yesterday. My initial thought was that my hardware was going bad. But now that you've reported the same, I have some relief. I tried a couple of [un]plugs today and couldn't reproduce it. I see this issue is intermittent and it is still unclear of what chain of events uncover this bug.
|
@copyninja What you saw was devices being offlined and never recovering, followed by all disk related processes getting to |
Hm, @rickysarraf you have a Stopping followed by remount followed by Starting. But with an SSD, I thought remounting for commit=600 is unnecessary, so I configured that bit out. Not sure if that's what keeps your disks on and my wife's off, it's just an idea. |
(and just for completeness, my own laptop only has NVMe disks, so that's why I can no longer experiment with any of this locally) |
Well. Holding off the I/O at the file systems layer can be a good thing. I know SSDs are fast and don't have the drawback as in rotational disks, but still a large chunk of sequential writes to the SSD are much better. I think I documented it in the FAQs. |
But I've seen them recovering. I could literally hear disk switching off
and on even when it froze completely.
…On 22-Feb-2018 16:03, "Ritesh Raj Sarraf" ***@***.***> wrote:
@copyninja <https://github.com/copyninja> What you saw was devices being
offlined and never recovering, followed by all disk related processes
getting to DI state. That issue was resolved by adding sd to the
blacklist, fixed in 1.72.2 (or in Debian 1.72-2).
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#123 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAL7gNu9qS093LeH-lxevGPIIUlAaDkXks5tXUKOgaJpZM4RfVr->
.
|
@liskin The ordering of the messages about Start/Stop/Remount must be mixed. I have 2 disks, one ssd and the other sshd. |
I hit the bug again and it is pretty annoying. I'm sure there must be data loss involved too, when used in combination with LMT. For now I'm going to revert back to 4.14. @liskin Do you want to report it to the Debian Kernel Team ? I'm curious what best |
well I haven't managed to reproduce it controllably, the only reproducer I have is "let her run around the house with a laptop for half a day" :-( |
This is the best I've been able to capture so far. Since the disk (or the file system), at this point, is inaccessible, the full trace never gets logged to the journal/syslog. And to make things worse, the machine goes into a freeze at this point so I can't even scroll back to the buffer. But it does respond to SysRq. So the kernel is alive and blocked, Where ? Probably the disk is gone, or the file system went haywire. |
Try ssh. I was able to ssh into a machine with nonresponsive GNOME just fine and even sudo su -. |
I have an SSD on Thinkpad X230 on 1.72.2, and sometimes when i plug the power cable in, i observe in dmesg |
With |
4.19.0-rc6, also observed with earlier releases; Gentoo. |
In that case I'm not sure. It could be specific to your device. Can you enable The problem was inherited by |
Also, https://patchwork.kernel.org/patch/10548975/ has been merged some time ago so blacklisting of |
Hi there, I'm probably experiencing the same or similar issue. Recently (after installing LMT) I've noticed that my laptop hangs upon resume when on battery. Through trial and error I found that it takes one switch from AC to battery to make this happen. After that even stopping LMT does not help. Turns out, LMT do set I have added another check for Let me know, if this issue should be filled as a separate bug. |
You must be having a different issue altogether. What kernel are you on ? The fixes on the kernel side have all been upstream lately. On my machine, I suspend/resume more than 10 times in a day, without issues. The best in your case, since you have exact steps to reproduce/trigger the bug, is to enable debug logs and then share them here. |
Do I open a separate issue then? |
Yes, Please. I am inclined to close this issue as the issue on the kernel side seems to have been fixed. But I've kept this open so far, so that others, who face the same issue, can find the information here relevant. |
Hi! I have this string in runtime-pm.conf: My bug on gentoo bugzilla: https://bugs.gentoo.org/689970 -- -- |
Yes. It is a kernel side issue. Please refer to the link that @liskin has mentioned. |
I'm having an issue like this on kernel 5.9.x. My separate, unmounted hard drive is being power managed by laptop_mode_tools and won't stop clicking constantly. I see that this may not be an issue with this module/code in particular but I just wanted to add a comment saying that I'm getting these dmesg messages when laptop_mode_tools is set to manage my hard drive along with constant clicking. Setting |
@BeanieBen9990 Not sure why you are seeing this issue. On one of my other machine, where I have While on AC:
While on BATT:
|
These are serious error occurring because our devices aren't really capable of LPM. Device blacklisting is an issue here because of the same device is referenced through multiple persistent names.
We need a better blacklist mechanism. I've had the following in my runtime-pm blacklist, but it still picks the devices.
because
I only uncovered this issue after applying #118 Thanks @yardenac
The text was updated successfully, but these errors were encountered: