Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Failed to start DSP on Resume (Failure to boot DSP Firmware after resume) #3609

Closed
dblaber opened this issue Apr 21, 2022 · 20 comments
Closed
Labels
DMA memory alloc suspend resume Issues related to suspend resume (e.g. rtcwake)

Comments

@dblaber
Copy link

dblaber commented Apr 21, 2022

Hello all, please help me out if I am not provided right information (any docs, or additional debugging information I can provide). I am having this issue after multiple Sleeps (Suspend to ram) and resumes, where my audio devices are completely gone. It is intermittent, does not happen every resume, but happens often. I see this in the kernel logs when the issue occurs.

4/21/22 7:09 AM kernel sof-audio-pci-intel-cnl 0000:00:1f.3: error: memory alloc failed: -12
4/21/22 7:09 AM kernel sof-audio-pci-intel-cnl 0000:00:1f.3: error: dma prepare for fw loading failed
4/21/22 7:09 AM kernel sof-audio-pci-intel-cnl 0000:00:1f.3: ------------[ DSP dump start ]------------
4/21/22 7:09 AM kernel sof-audio-pci-intel-cnl 0000:00:1f.3: Failed to start DSP
4/21/22 7:09 AM kernel sof-audio-pci-intel-cnl 0000:00:1f.3: fw_state: SOF_FW_BOOT_IN_PROGRESS (2)
4/21/22 7:09 AM kernel sof-audio-pci-intel-cnl 0000:00:1f.3: extended rom status: 0xffffffff 0xffffffff 0xffffffff 0xffffffff 0xffffffff 0xffffffff 0xffffffff 0xffffffff
4/21/22 7:09 AM kernel sof-audio-pci-intel-cnl 0000:00:1f.3: ------------[ DSP dump end ]------------
4/21/22 7:09 AM kernel sof-audio-pci-intel-cnl 0000:00:1f.3: error: failed to boot DSP firmware after resume -12
4/21/22 7:09 AM kernel PM: dpm_run_callback(): pci_pm_resume+0x0/0xe0 returns -12
4/21/22 7:09 AM kernel sof-audio-pci-intel-cnl 0000:00:1f.3: PM: failed to resume async: error -12
4/21/22 7:09 AM kernel usb 1-8: reset high-speed USB device number 3 using xhci_hcd
4/21/22 7:09 AM kernel OOM killer enabled.
4/21/22 7:09 AM kernel Restarting tasks ...
4/21/22 7:09 AM kernel pci_bus 0000:05: Allocating resources
4/21/22 7:09 AM kernel pci_bus 0000:2b: Allocating resources
4/21/22 7:09 AM kernel sof-audio-pci-intel-cnl 0000:00:1f.3: error: failed to load widget HDA0.OUT
4/21/22 7:09 AM kernel sof-audio-pci-intel-cnl 0000:00:1f.3: error: failed setting up DAI widget HDA0.OUT
4/21/22 7:09 AM kernel sof-audio-pci-intel-cnl 0000:00:1f.3: ASoC: error at snd_soc_pcm_dai_prepare on Analog CPU DAI: -19
4/21/22 7:09 AM kernel Analog Playback and Capture: ASoC: __soc_pcm_prepare() failed (-19)
4/21/22 7:09 AM kernel HDA Analog: ASoC: dpcm_be_dai_prepare() failed (-19)
4/21/22 7:09 AM kernel HDA Analog: ASoC: dpcm_fe_dai_prepare() failed (-19)

When this occurs, nothing can seem to restore the audio, reloading kmod sof-audio-pci-intel-cnl does not seem to help. I am not sure how to interpret the error (is it related to memory alloca failed, something with firmware etc).

Here are some additional details:
dmb@thinkpad> uname -a ~
Linux thinkpad.da4.org 5.17.3-arch1-1 thesofproject/sof#1 SMP PREEMPT Thu, 14 Apr 2022 01:18:36 +0000 x86_64 GNU/Linux

[1] dmb@thinkpad> lspci|grep Audio ~
00:1f.3 Audio device: Intel Corporation Comet Lake PCH-LP cAVS

Laptop info (from lshw):
description: Notebook
product: 20S4CTO1WW (LENOVO_MT_20S4_BU_Think_FM_ThinkPad P14s Gen 1)
vendor: LENOVO
version: ThinkPad P14s Gen 1

Here is what initialization looks like when freshly starting:
[130] dmb@thinkpad> sudo dmesg|grep sof-audio-pci-intel-cnl ~
[ 15.744016] sof-audio-pci-intel-cnl 0000:00:1f.3: DSP detected with PCI class/subclass/prog-if info 0x040380
[ 15.746741] sof-audio-pci-intel-cnl 0000:00:1f.3: Digital mics found on Skylake+ platform, using SOF driver
[ 15.746753] sof-audio-pci-intel-cnl 0000:00:1f.3: enabling device (0004 -> 0006)
[ 15.748115] sof-audio-pci-intel-cnl 0000:00:1f.3: DSP detected with PCI class/subclass/prog-if 0x040380
[ 17.121006] sof-audio-pci-intel-cnl 0000:00:1f.3: bound 0000:00:02.0 (ops i915_audio_component_bind_ops [i915])
[ 17.240874] sof-audio-pci-intel-cnl 0000:00:1f.3: use msi interrupt mode
[ 17.276673] sof-audio-pci-intel-cnl 0000:00:1f.3: hda codecs found, mask 5
[ 17.276676] sof-audio-pci-intel-cnl 0000:00:1f.3: using HDA machine driver skl_hda_dsp_generic now
[ 17.276680] sof-audio-pci-intel-cnl 0000:00:1f.3: DMICs detected in NHLT tables: 2
[ 17.279078] sof-audio-pci-intel-cnl 0000:00:1f.3: Firmware info: version 2:1:1-3964a
[ 17.279080] sof-audio-pci-intel-cnl 0000:00:1f.3: Firmware: ABI 3:21:0 Kernel ABI 3:18:0
[ 17.279082] sof-audio-pci-intel-cnl 0000:00:1f.3: warn: FW ABI is more recent than kernel
[ 17.279088] sof-audio-pci-intel-cnl 0000:00:1f.3: unknown sof_ext_man header type 3 size 0x30
[ 17.412434] sof-audio-pci-intel-cnl 0000:00:1f.3: Firmware info: version 2:1:1-3964a
[ 17.412442] sof-audio-pci-intel-cnl 0000:00:1f.3: Firmware: ABI 3:21:0 Kernel ABI 3:18:0
[ 17.412446] sof-audio-pci-intel-cnl 0000:00:1f.3: warn: FW ABI is more recent than kernel
[ 17.452639] sof-audio-pci-intel-cnl 0000:00:1f.3: Topology: ABI 3:21:0 Kernel ABI 3:18:0
[ 17.452646] sof-audio-pci-intel-cnl 0000:00:1f.3: warn: topology ABI is more recent than kernel

thesofproject/sof#5168 seems to be a very similar issue, however it is marked closed with newer kernerl version, however I am using 5.17.x, whereas that bug is referring to 5.16.

Let me know if there is anything else I can provide.

@lgirdwood lgirdwood transferred this issue from thesofproject/sof Apr 22, 2022
@lgirdwood
Copy link
Member

4/21/22 7:09 AM kernel sof-audio-pci-intel-cnl 0000:00:1f.3: error: memory alloc failed: -12
4/21/22 7:09 AM kernel sof-audio-pci-intel-cnl 0000:00:1f.3: error: dma prepare for fw loading failed

@ujfalusi @plbossart fyi - looks like we fail to alloc the code loader DMA buffer ?

@ujfalusi
Copy link
Collaborator

Indeed this looks like a memory allocation error.
@dblaber, can you estimate how many suspend/resume cycles it usually takes to hit this failure?
Can you also check with free -h next time when you hit this to see the memory status?

Also: is this happened before 5.17.3 or it is started with this kernel version?

@dblaber
Copy link
Author

dblaber commented Apr 22, 2022

I am going to see if I can have some script called that logs free -h before or after standby, to get a good idea of state. If I remember correctly though, I remember seeing like 3 or 4 gb of buff/cache(which should mean available) around that time, but will find out for sure.

As far as suspend/resume cycles, I would say maybe around 4 or 5, will find out for sure when I get the memory log.

Not 100% sure about kernel versions, I don't believe I saw issue a month ago, but wasn't actively keeping track of the versions. After getting the memory logs, I can try downgrading to 2.16.x possibly.

@ujfalusi
Copy link
Collaborator

@dblaber, thanks.
fwiw, I'm over 520 iteration of suspend/resume (rtcwake -m mem -s 5) on a laptop here with 5.17.4-artix1-1 and world/sof-firmware 2.1.1-1.
It is a tgl laptop with 32GB RAM, so unlikely to run out of memory in this century, but so far there is no sign of anything eating it up (running kde-plasma).

@plbossart
Copy link
Member

@ujfalusi @lgirdwood We've had a number of 'out of memory issues' recently.

see #3584 and #3530

That's starting to be a pattern I am afraid, even with newer kernels.

@ujfalusi
Copy link
Collaborator

@plbossart, I see. I will build a 5.17. stable on Monday with memory leaking debug (arch/artix kernels do not have this enabled, but they are close to upstream) and see if it finds anything. #3584 looks like tgl, but the other two are cnl (which I don't have around).
I'm only aware of one memleak, but it is introduced for 5.18 and it is already fixed and it was not related to DMA page alloc.

@dblaber
Copy link
Author

dblaber commented Apr 23, 2022

Just reproduced it now.

BEFORE SLEEP Fri Apr 22 04:58:11 PM EDT 2022 --
total used free shared buff/cache available
Mem: 15655 11455 276 2725 3923 1025
Swap: 8191 4242 3949

AFTER SLEEP Fri Apr 22 04:58:20 PM EDT 2022 --
total used free shared buff/cache available
Mem: 15655 11273 720 2538 3661 1522
Swap: 8191 4770 3421

Sorry, columns don't align and no fixed width fonts/mode. This was 11 sleep/resumes before this happened, uptime of 13 hours and 15 minutes.

Although not much free, I imagine over 1 gb should not cause issues (a lot is swapped, buff cache 3.6 gb which I assume is usable).

Ironically, I also see a iwlwifi microcode error right after the sof-audio error:

[15560.989477] iwlwifi 0000:00:14.3: Microcode SW error detected. Restarting 0x0.
[15560.989565] iwlwifi 0000:00:14.3: Start IWL Error Log Dump:
[15560.989569] iwlwifi 0000:00:14.3: Transport status: 0x0000004A, valid: 6
[15560.989574] iwlwifi 0000:00:14.3: Loaded firmware version: 68.01d30b0c.0 QuZ-a0-hr-b0-68.ucode

Seems like iwlwifi somehow recovered itself eventually though.

@dblaber
Copy link
Author

dblaber commented Apr 23, 2022

Attached full kernel log
kern_log.txt

@ujfalusi
Copy link
Collaborator

@dblaber, thanks for the report!
You can tell github to keep the log intact by:
```
text to present as is
```
I have spent the weekend by trying to reproduce this or at least to get some pointers which can hint what might happening.
Let's start with the straight forward runs.
Log in to KDE, start firefox, open youtube and start playing a long 'video' (some 6 hours long blues compilation), 5.17.4-artix1-1,
piperwire
The script:

#!/bin/bash

test="rtcwake -m mem -s5"
iteration=0

while :
do
        $test || break
        ((iteration++))

        echo "Iteration: ${iteration}"
        sleep 2
        free | grep -B1 Mem
        sleep 3
done

echo "Failed at interation ${iteration}"
  1. audio modules are loaded
Iteration: 1
               total        used        free      shared  buff/cache   available
Mem:        32594584     1251384    30224688      276516     1118512    30666828

Iteration: 2438
               total        used        free      shared  buff/cache   available
Mem:        32594584     1301184    29245220      279556     2048180    30597552 (loss: 69276)

Iteration: 2987
               total        used        free      shared  buff/cache   available
Mem:        32594584     1414240    28931592      279600     2248752    30480844 (loss: 185984)
  1. audio modules removed before logging in, pw is using dummy output
Iteration: 1
               total        used        free      shared  buff/cache   available
Mem:        32594584     1184072    30349768      250052     1060744    30758408

Iteration: 1365
               total        used        free      shared  buff/cache   available
Mem:        32594584     1321852    29923752      258396     1348980    30606448 (loss: 151960)

These run overnight, while I was not around, so I don't have exact matches, but 2987/1365 suspend resume cycle is quite massive. Unfortunately the distro kernel does not have debugging modules, so I went to custom kernel.
kmemleak, kasan, DMA API debug kept silent for the runs, not a single report.
When I killed sddm and run the tests it was really a different story:
w/ audio modules after 3729 iteration: loss 179604
w/o audio modules after 2473 iteration: loss 38288
Initially (after 120 runs) it looked like that 5.17-rc1 was not loosing memory at the rate of how 5.17.4 does and narrowed it to 5.17-rc6 where it starts, bisected between rc1 and rc6, but it sounded fishy already. It do pointed to a bogus commit (how would you say loss of 6000 is good or bad?), anyways, re-run things on 5.17-rc1 and it lost memory in a same rate as 5.17.4, so the weekend was wasted for nothing.
Just a note: 5.16.20 after 377 iterations reported a loss of 12664, so not good either.

Still, we are looking at snd_dma_alloc_pages() which is allocating memory for DMA use, it is likely calling dma_alloc_noncontiguous() (it does on my laptop), if this allocation fails there is still a fallback via snd_dma_sg_fallback_alloc() if you have SND_DMA_SGBUF (can you run zcat /proc/config.gz | grep SND_DMA_SGBUF to check?) which should also fail to make the alloc_pages() fail.
There is one thing I can see in there which is not quite right: if we indeed took the fallback for allocation we are still going to use the non fallback free and based on the code, the fallback must use it's own implementation of fallback free (snd_dma_sg_fallback_free).

Long story short: I'm not closer to find the reason for these allocation errors :(

@ujfalusi
Copy link
Collaborator

FYI: https://lore.kernel.org/alsa-devel/20220425122814.751-1-peter.ujfalusi@linux.intel.com/
this might help in case a fallback is taken for allocations, but I can not get that one triggered on my system(s) so far, but as I say in the commit message, the fallback is added for a reason, it must have been happening...

@ujfalusi
Copy link
Collaborator

@dblaber, can you see if 5.17.4 is fixing your issue? a fallback patch got backported from mainline, I think that will fix it.

@dblaber
Copy link
Author

dblaber commented Apr 25, 2022 via email

@AVAtarMod
Copy link

Hi, dear SOF project team. I have this issue too (after resume from suspend to disk audio driver is stopped).
Kernel info: Linux version 5.17.1-1-default on openSUSE Tumbleweed
Kernel log: kernel_err.log

@plbossart
Copy link
Member

@AVAtarMod you need to upgrade to 5.17.4 per the comments above and check if this is fixed.

@AVAtarMod
Copy link

@plbossart I'll check later due kernel acpi bug (kernel version is 5.17.4), because of this I have kernel panic

@dblaber
Copy link
Author

dblaber commented Apr 29, 2022

So far, have not reproduced this yet on 5.17.4, will give another week of testing, due to the nature of intermittent issues.

@ujfalusi
Copy link
Collaborator

@dblaber, thank you for the information! Let's wait a bit more time to make sure it is gone.

@AVAtarMod
Copy link

I upgrade kernel to 5.17.4, and issue was gone. Thanks!

@dblaber
Copy link
Author

dblaber commented Jun 18, 2022

I am not sure if this is a seperate issue, a regression, or if i should be opening a new bug report for this, but I am now on kernel 5.18.3-arch1-1 and am seeing this again:
[78173.348423] sof-audio-pci-intel-cnl 0000:00:1f.3: ------------[ DSP dump start ]------------
[78173.348432] sof-audio-pci-intel-cnl 0000:00:1f.3: Firmware boot failure due to timeout
[78173.348437] sof-audio-pci-intel-cnl 0000:00:1f.3: fw_state: SOF_FW_BOOT_IN_PROGRESS (2)
[78173.348513] sof-audio-pci-intel-cnl 0000:00:1f.3: invalid header size 0xffffffff. FW oops is bogus
[78173.348518] sof-audio-pci-intel-cnl 0000:00:1f.3: unexpected fault 0x00000000 trace 0x00000000
[78173.348523] sof-audio-pci-intel-cnl 0000:00:1f.3: ------------[ DSP dump end ]------------
[78173.348526] sof-audio-pci-intel-cnl 0000:00:1f.3: error: failed to boot DSP firmware after resume -5
[78173.348531] sof-audio-pci-intel-cnl 0000:00:1f.3: PM: dpm_run_callback(): pci_pm_restore+0x0/0xe0 returns -5
[78173.348569] sof-audio-pci-intel-cnl 0000:00:1f.3: PM: failed to restore async: error -5

Not 100% sure, I think this was after a suspend to disk resume.

@ujfalusi
Copy link
Collaborator

@dblaber, it is likely due to IMR boot failure when coming out from hibernate (suspend to disk), like this:
thesofproject/sof#5892

@marc-hb marc-hb added the suspend resume Issues related to suspend resume (e.g. rtcwake) label Jul 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
DMA memory alloc suspend resume Issues related to suspend resume (e.g. rtcwake)
Projects
None yet
Development

No branches or pull requests

6 participants