SoundWire: TGL: IO error in rt711_jack_detect_handler #3459

plbossart · 2022-02-25T16:49:10Z

@bardliao @shumingfan FYI, that's a new one.

https://sof-ci.01.org/linuxpr/PR3454/build7233/devicetest/?model=TGLU_RVP_SDW&testcase=check-suspend-resume-with-capture-5

[   29.522674] kernel: soundwire_intel soundwire_intel.link.0: IO transfer timed out, cmd 3 device 1 addr b921 len 1
[   29.522680] kernel: soundwire sdw-master-0: trf on Slave 1 failed:-110 write addr b921 count 0
[   29.522741] kernel: IO error in rt711_jack_detect_handler, ret -110

dmesg.log

The text was updated successfully, but these errors were encountered:

plbossart · 2022-02-25T16:50:24Z

not sure if this is related to #3401

plbossart · 2022-02-28T22:49:35Z

Not able to reproduce with short tests on Dell SKU 0A3E that uses RT711 on link1

TPLG=/lib/firmware/intel/sof-tplg/sof-tgl-rt715-rt711-rt1308-mono.tplg  ~/sof-test/test-case/check-suspend-resume-with-audio.sh -l 5 -m capture

@bardliao can you try to see if this bug can be reproduced on your TGL devices with IPC3?
Not sure why we see this consistently on TGLU_RVP_SDW now.

bardliao · 2022-03-01T12:13:22Z

Not able to reproduce with short tests on Dell SKU 0A3E that uses RT711 on link1
TPLG=/lib/firmware/intel/sof-tplg/sof-tgl-rt715-rt711-rt1308-mono.tplg  ~/sof-test/test-case/check-suspend-resume-with-audio.sh -l 5 -m capture
@bardliao can you try to see if this bug can be reproduced on your TGL devices with IPC3? Not sure why we see this consistently on TGLU_RVP_SDW now.

@plbossart I tested it on my Dell SKU 0A3E and it passed. I run the test twice and both iterations are passed.

A race condition can happen when the peripheral devices are already suspended but a hardware peripheral attaches to the link before we turn it off. In that case, the rt711 driver schedules workqueues, which can lead to transactions happening when the hardware is turned off. This patch makes sure the workqueues are scheduled only when the peripheral devices have completed their resume routine, or are already at full power. BugLink: thesofproject#3459 Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>

marc-hb · 2022-03-08T17:55:44Z

P1 because it's happening consistently:

https://sof-ci.01.org/sofpr/PR5485/build12238/devicetest/?model=TGLU_RVP_SDW&testcase=check-suspend-resume-with-playback-5
https://sof-ci.01.org/softestpr/PR871/build1005/devicetest/?model=TGLU_RVP_SDW&testcase=check-suspend-resume-with-playback-5
https://sof-ci.01.org/sofpr/PR5475/build12242/devicetest/?model=TGLU_RVP_SDW&testcase=check-suspend-resume-with-playback-5

These test failures happened on different units.

What changed recently? Big kernel merge?

Interesting that searching for something as specific as rt711_jack_detect_handler returns 11 entries no less (8 closed)
https://github.com/thesofproject/linux/issues?q=rt711_jack_detect_handler

plbossart · 2022-03-09T01:31:25Z

@marc-hb can you help find the first report of a failure? I am down to a spurious pm_runtime resume after a system suspend, no sure what is happening.

marc-hb · 2022-03-10T04:55:08Z

I think this came with 5.17, maybe with

Merge/sound upstream 20220228 #3461 ?

According to the TCR database, the very first check-suspend-playback/capture failure on TGLU_RVP_SDW happened on January 29th:

This was with

linux/for-next/74cc53cf/linux-image-5.17.0-rc1-daily-default-20220128-0-13653-g74cc53cf59b6_5.17.0-rc1-linux-for-next-74cc53cf_amd64.deb
sof/main/40d2f6688367/fw/debug/sof-tgl.ri

[2022-01-29T20:15:10.418Z] 2022-01-29 20:15:09 UTC Sub-Test: [REMOTE_COMMAND] Run the command: rtcwake -m mem -s 5
[2022-01-29T20:15:10.418Z] rtcwake: wakeup from "mem" using /dev/rtc0 at Sat Jan 29 20:15:15 2022
[2022-01-29T20:15:22.710Z] 2022-01-29 20:15:16 UTC Sub-Test: [REMOTE_COMMAND] sleep for 5
[2022-01-29T20:15:22.710Z] 2022-01-29 20:15:21 UTC Sub-Test: [REMOTE_INFO] Check for the kernel log status
[2022-01-29T20:15:22.710Z] declare -- cmd="journalctl_cmd --since=@1643487301"
[2022-01-29T20:15:22.710Z] 2022-01-29 20:15:21 UTC [ERROR] Caught kernel log error
[2022-01-29T20:15:22.710Z] ===========================>>
[2022-01-29T20:15:22.710Z] [ 8113.494143] kernel: soundwire_intel soundwire_intel.link.0: IO transfer timed out, cmd 3 device 1 addr b921 len 1
[2022-01-29T20:15:22.710Z] [ 8113.494149] kernel: soundwire sdw-master-0: trf on Slave 1 failed:-110 write addr b921 count 0
[2022-01-29T20:15:22.710Z] [ 8113.494221] kernel: IO error in rt711_jack_detect_handler, ret -110
[2022-01-29T20:15:22.710Z] <<===========================
[2022-01-29T20:15:22.710Z] 2022-01-29 20:15:21 UTC Sub-Test: [REMOTE_ERROR] Caught error in kernel log
[2022-01-29T20:15:22.710Z] 2022-01-29 20:15:21 UTC Sub-Test: [REMOTE_ERROR] Starting func_exit_handler(), exit status=1, FUNCNAME stack:
[2022-01-29T20:15:22.710Z] 2022-01-29 20:15:21 UTC Sub-Test: [REMOTE_ERROR]  die()  @  /home/ubuntu/sof-test/test-case/../case-lib/lib.sh
[2022-01-29T20:15:22.710Z] 2022-01-29 20:15:21 UTC Sub-Test: [REMOTE_ERROR]  main()  @  /home/ubuntu/sof-test/test-case/check-suspend-resume.sh:91

Starting from January 29th I see one of two daily-tests failing: PASS, FAIL, PASS, FAIL,... There's no branch information in the TCR database but it looks like the topic/sof-dev daily was the one passing and the for-next daily was the one failing.

From February 2nd to February 24th, daily tests are consistently passing again.

After February 24th, daily tests are consistently failing except this one (which was on 5.16):

topic/sof-dev/96221091-2/linux-image-5.16.0-rc1-daily-default-20220227-0_5.16.0-rc1-topic-sof-dev-96221091-2_amd64.deb
ab715d8/fw/zephyr_debug_key/sof-tgl.ri

The first SOF Pull Requests started failing these two tests on TGLU_RVP_SDW on March 1st and then they almost never passed. #3461 was merged on Feb 28th

ujfalusi · 2022-03-10T08:59:40Z

@marc-hb, @bardliao, @plbossart, I think the reason for the error is:
https://sof-ci.01.org/linuxpr/PR3461/build7243/devicetest/?model=TGLU_RVP_SDW&testcase=check-suspend-resume-with-playback-5
When it works the log looks like this:

[ 1398.567313] kernel: rt711 sdw:0:025d:0711:00: rt711_calibration calibration complete, ret=0
[ 1398.568250] kernel: rt711 sdw:0:025d:0711:00: in rt711_jack_init enable
[ 1398.568332] kernel: rt711 sdw:0:025d:0711:00: [rt711_sdw_write] 3501 <= 0003
[ 1398.568335] kernel: rt711 sdw:0:025d:0711:00: rt711_io_init hw_init complete
[ 1398.568337] kernel: rt711 sdw:0:025d:0711:00: sdw_handle_slave_status: signaling initialization completion for Slave 1
...
[ 1398.577647] kernel: rt711 sdw:0:025d:0711:00: rt711_interrupt_callback control_port_stat=4
[ 1398.577977] kernel: soundwire_intel soundwire_intel.link.0: Slave status change: 0x20
[ 1398.818037] kernel: soundwire_intel soundwire_intel.link.0: Slave status change: 0x40
[ 1398.818252] kernel: rt711 sdw:0:025d:0711:00: Slave impl defined interrupt
[ 1398.818254] kernel: rt711 sdw:0:025d:0711:00: rt711_interrupt_callback control_port_stat=4
[ 1398.818658] kernel: soundwire_intel soundwire_intel.link.0: Slave status change: 0x20
[ 1399.072288] kernel: rt711 sdw:0:025d:0711:00: [rt711_sdw_read] b921 0000 => 80000000
[ 1399.073103] kernel: rt711 sdw:0:025d:0711:00: [rt711_sdw_read] 7520 85a0 9c20 aca0 => 00000800
[ 1399.073916] kernel: rt711 sdw:0:025d:0711:00: [rt711_sdw_read] 7520 85a0 9c20 aca0 => 00000030
[ 1399.073919] kernel: rt711 sdw:0:025d:0711:00: in rt711_jack_detect_handler, jack_type=0x3
[ 1399.073922] kernel: rt711 sdw:0:025d:0711:00: in rt711_jack_detect_handler, btn_type=0x0

[ 1401.551250] kernel: soundwire sdw-master-1: clock stop prep/de-prep done slave:15
[ 1401.966736] kernel: soundwire sdw-master-0: clock stop prep/de-prep done slave:15
[ 1401.966958] kernel: soundwire_intel soundwire_intel.link.0: intel_link_power_down: powering down all links

When it fails:

[ 1404.335718] kernel: rt711 sdw:0:025d:0711:00: in rt711_jack_init enable
[ 1404.335858] kernel: rt711 sdw:0:025d:0711:00: [rt711_sdw_write] 3501 <= 0003
[ 1404.335863] kernel: rt711 sdw:0:025d:0711:00: rt711_io_init hw_init complete
[ 1404.335865] kernel: rt711 sdw:0:025d:0711:00: sdw_handle_slave_status: signaling initialization completion for Slave 1
[ 1404.336071] kernel: soundwire_intel soundwire_intel.link.0: intel_link_power_down: powering down all links
...
[ 1404.785030] kernel: ACPI: EC: interrupt blocked
[ 1409.162488] kernel: ACPI: EC: interrupt unblocked
[ 1409.316458] kernel: soundwire_intel soundwire_intel.link.0: IO transfer timed out, cmd 3 device 1 addr b921 len 1
[ 1409.316466] kernel: soundwire sdw-master-0: trf on Slave 1 failed:-110 write addr b921 count 0
[ 1409.316538] kernel: IO error in rt711_jack_detect_handler, ret -110

...

[ 1410.107908] kernel: soundwire_intel soundwire_intel.link.0: intel_link_power_up: powering up all links
[ 1410.107914] kernel: soundwire_intel soundwire_intel.link.0: intel_link_power_up: first link up, programming SYNCPRD
[ 1410.108260] kernel: rt711 sdw:0:025d:0711:00: sdw_modify_slave_status: initializing enumeration and init completion for Slave 1
[ 1410.109150] kernel: soundwire_intel soundwire_intel.link.0: Slave status change: 0x2
[ 1410.109156] kernel: soundwire sdw-master-0: Slave attached, programming device number
[ 1410.109384] kernel: soundwire sdw-master-0: SDW Slave Addr: 20025d071100
[ 1410.109385] kernel: soundwire sdw-master-0: SDW Slave class_id 0x00, mfg_id 0x025d, part_id 0x0711, unique_id 0x0, version 0x2
[ 1410.109387] kernel: soundwire sdw-master-0: Slave already registered, reusing dev_num:1
...
[ 1410.576437] kernel: rt711 sdw:0:025d:0711:00: in rt711_jack_init enable
[ 1410.576540] kernel: rt711 sdw:0:025d:0711:00: [rt711_sdw_write] 3501 <= 0003
[ 1410.576542] kernel: rt711 sdw:0:025d:0711:00: rt711_io_init hw_init complete
[ 1410.576544] kernel: rt711 sdw:0:025d:0711:00: sdw_handle_slave_status: signaling initialization completion for Slave 1
...
[ 1410.586529] kernel: rt711 sdw:0:025d:0711:00: rt711_interrupt_callback control_port_stat=4
[ 1410.586872] kernel: soundwire_intel soundwire_intel.link.0: Slave status change: 0x20
[ 1410.606525] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ipc rx: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1410.606581] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ipc rx done: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1410.814363] kernel: rt711 sdw:0:025d:0711:00: Slave impl defined interrupt
[ 1410.814366] kernel: rt711 sdw:0:025d:0711:00: rt711_interrupt_callback control_port_stat=4
[ 1411.069309] kernel: rt711 sdw:0:025d:0711:00: [rt711_sdw_read] b921 0000 => 80000000
[ 1411.070401] kernel: rt711 sdw:0:025d:0711:00: [rt711_sdw_read] 7520 85a0 9c20 aca0 => 00000800
[ 1411.071528] kernel: rt711 sdw:0:025d:0711:00: [rt711_sdw_read] 7520 85a0 9c20 aca0 => 00000030
[ 1411.071534] kernel: rt711 sdw:0:025d:0711:00: in rt711_jack_detect_handler, jack_type=0x3
[ 1411.071538] kernel: rt711 sdw:0:025d:0711:00: in rt711_jack_detect_handler, btn_type=0x0
[ 1414.140735] kernel: soundwire sdw-master-0: clock stop prep/de-prep done slave:15
[ 1414.140987] kernel: soundwire sdw-master-1: clock stop prep/de-prep done slave:15
[ 1414.141163] kernel: soundwire_intel soundwire_intel.link.1: intel_link_power_down: powering down all links

So in case of error print the in rt711_jack_init enable means that jack_detect_work is scheduled to run in ~250ms, the sdw core got power off and the rt711_jack_detect_handler() tries to read 0xb921 register, which is volatile, so it must reach out to the bus, but it is down, so we get -ETIMEDOUT back as error code.
Note that later on the jack detect works again when the sequence is correct (the sdw links are not powered down right away).

When there is no error we also receive interrupt, which is expected or not, I don't know.

We might be missing a cancel work calls in case the sdw is disabling the link or we need a check if the link is up at all?

keqiaozhang · 2022-03-10T13:24:42Z

@plbossart @marc-hb @ujfalusi It took me more than 6 hours to bisect this issue, lucky me, I found the bad commit:
e38f9ff

diff --git a/drivers/acpi/scan.c b/drivers/acpi/scan.c
index 2c80765670bc..5991dddbc9ce 100644
--- a/drivers/acpi/scan.c
+++ b/drivers/acpi/scan.c
@@ -1340,11 +1340,11 @@ static void acpi_set_pnp_ids(acpi_handle handle, struct acpi_device_pnp *pnp,
                if (info->valid & ACPI_VALID_HID) {
                        acpi_add_id(pnp, info->hardware_id.string);
                        pnp->type.platform_id = 1;
-               }
-               if (info->valid & ACPI_VALID_CID) {
-                       cid_list = &info->compatible_id_list;
-                       for (i = 0; i < cid_list->count; i++)
-                               acpi_add_id(pnp, cid_list->ids[i].string);
+                       if (info->valid & ACPI_VALID_CID) {
+                               cid_list = &info->compatible_id_list;
+                               for (i = 0; i < cid_list->count; i++)
+                                       acpi_add_id(pnp, cid_list->ids[i].string);

I tried to revert this commit on tip of sof-dev and verified this issue again, 3*30 iterations passed, no reproductions.

ujfalusi · 2022-03-10T13:49:57Z

@keqiaozhang, wow!

What I have found that the issue is really easy to reproduce on tglu-rvp-sdw:

aplay -Dplughw:0,2 -fdat /dev/urandom &
sleep 1
sudo rtcwake -m mem -s 5

Don't forget to killall aplay

# aplay -l
**** List of PLAYBACK Hardware Devices ****
card 0: sofsoundwire [sof-soundwire], device 0: Jack Out (*) []
  Subdevices: 1/1
  Subdevice #0: subdevice #0
card 0: sofsoundwire [sof-soundwire], device 2: Speaker (*) []
  Subdevices: 0/1
  Subdevice #0: subdevice #0
card 0: sofsoundwire [sof-soundwire], device 5: HDMI 1 (*) []
  Subdevices: 1/1
  Subdevice #0: subdevice #0
card 0: sofsoundwire [sof-soundwire], device 6: HDMI 2 (*) []
  Subdevices: 1/1
  Subdevice #0: subdevice #0
card 0: sofsoundwire [sof-soundwire], device 7: HDMI 3 (*) []
  Subdevices: 1/1
  Subdevice #0: subdevice #0
card 0: sofsoundwire [sof-soundwire], device 8: HDMI 4 (*) []
  Subdevices: 1/1
  Subdevice #0: subdevice #0
card 1: Device [USB Audio Device], device 0: USB Audio [USB Audio]
  Subdevices: 1/1
  Subdevice #0: subdevice #0

If you would try plughw:0,0, it never happens, only if you use a PCM which does not have the rt711 in path.

Can you see if this is working as well? For me this is 100% reproduction rate.

keqiaozhang · 2022-03-10T14:12:59Z

@ujfalusi , I changed test script and only test hw:0,2 with suspend/resume. I did a lot of test, no reproductions after reverting e38f9ff. I'm pretty sure about the result.

ujfalusi · 2022-03-10T14:14:56Z

@keqiaozhang, I have checked as well and indeed the kernel with revered e38f9ff is fine.

Let me attach the two dmesg, which is recorded before executing the rtcwake.
sof-dev: sof-dev.log
with revert: e38f9ff63e6d403f8e52302d223e3c5c110872ee_reverted.log

@plbossart, @bardliao FYI.

Thanks @keqiaozhang !!

plbossart · 2022-03-10T14:49:53Z

yes, the issue is when we are playing on the speakers while the jack out path controlled by rt711 is suspended.
And the problem is that we resume the jack path which is already suspended - still no idea why it's related to this CID patch.

marc-hb · 2022-03-10T17:45:43Z

still no idea why it's related to this CID patch.

https://en.wikipedia.org/wiki/Butterfly_effect
The double pendulum video is amazing.

ujfalusi · 2022-03-10T19:53:51Z

@plbossart, #3509 will show what is skipped after that patch, when the results are in just:
dmesg | grep acpi_set_pnp_ids

The patch should exclude some CIDs in case the HID is not set for the device, right? So those CIDs are not going to be acpi_add_id().

The issue I think is that the rt711 got woken resumed without sdw resume. Later on I think it is a bit different since there sdw is resuming and then the rt711. Or not.

Let's wait for the results and check the dmesg, I hope it will give some hints.

Bdw: is this only happening with RVP? Do we have other device with rt711 (not the rt711-sdca)? Can it be broken ACPI table?

marc-hb · 2022-03-10T20:15:06Z

still no idea why it's related to this CID patch.

Is it worth comparing the output of acpidump with and without e38f9ff maybe? I diff-ed two acpidumps recently for a totally different reason and that wasn't too hard even with my ACPI ignorance.

https://wiki.ubuntu.com/Kernel/Reference/ACPITricksAndTips

ujfalusi · 2022-03-10T20:22:23Z

We have a list on TGLU_RVP_SDW device:

[    0.578019] kernel: ACPI: acpi_set_pnp_ids: CID without HID for hardware_id.string: (null), count: 2
[    0.578020] kernel: ACPI: acpi_set_pnp_ids: CID #0: string: PRP00001
[    0.578021] kernel: ACPI: acpi_set_pnp_ids: CID #1: string: PNP0A05
[    0.585528] kernel: ACPI: acpi_set_pnp_ids: CID without HID for hardware_id.string: (null), count: 2
[    0.585529] kernel: ACPI: acpi_set_pnp_ids: CID #0: string: PRP00001
[    0.585530] kernel: ACPI: acpi_set_pnp_ids: CID #1: string: PNP0A05

These are not acpi_add_id()-ed, but still not explaining what is going on as the same set is excluded on CML_RVP_SDW and there things are fine. We should also print out what is added for clarity?

plbossart · 2022-03-10T20:46:00Z

I think this is the problem, the only two devices on RVP that don't have an _HID are audio things (SoundWire and USB Audio offload).
I think the patch to reject those devices is invalid or needs to grand-father in such exceptions to the rule.

           Device (SNDW)
            {
                Name (_ADR, 0x40000000)  // _ADR: Address
                Name (_CID, Package (0x02)  // _CID: Compatible ID
                {
                    "PRP00001", 
                    "PNP0A05" /* Generic Container Device */
                })

     Device (UAOL)
            {
                Name (_ADR, 0x50000000)  // _ADR: Address
                Name (_CID, Package (0x02)  // _CID: Compatible ID
                {
                    "PRP00001", 
                    "PNP0A05" /* Generic Container Device */
                })

In theory it's not valid to have both _ADR and _CID, but that has been the practice for many moons already.

plbossart · 2022-03-10T20:51:17Z

These are not acpi_add_id()-ed, but still not explaining what is going on as the same set is excluded on CML_RVP_SDW and there things are fine.

@ujfalusi things are fine because CML_RVP_SDW uses a different RT700 codec where we don't schedule a jack detection workqueue. It's a combination of

a) doing a pm_runtime_resume before suspend (which is really not necessary if the codec is already pm_runtime suspended)
AND
b) the codec initialization scheduling a jack detection workqueue. This part is ONLY done for the RT711.

plbossart · 2022-03-10T21:00:48Z

Here's the TGL_RVP_SDW DSDT for reference.

dsdt.dsl.txt

marc-hb · 2022-03-11T05:53:18Z

This failure has been impacting every single pull request and daily run for about 10 days. It's happening only for a specific test on a very specific product. While it seems far from resolution, its reproduction seems to be under control, it even has been bisected. Can we SKIP testing suspend/resume only on that particular product until this is fixed? We already know it keeps failing every time.

ujfalusi · 2022-03-11T06:49:45Z

These are not acpi_add_id()-ed, but still not explaining what is going on as the same set is excluded on CML_RVP_SDW and there things are fine.

@ujfalusi things are fine because CML_RVP_SDW uses a different RT700 codec where we don't schedule a jack detection workqueue.

Right, but there is ADLP_RVP_SDW, which have the exact rt711 and it schedules jack detection, yet it passes.

It's a combination of

a) doing a pm_runtime_resume before suspend (which is really not necessary if the codec is already pm_runtime suspended)

Hmm, but the driver cancels the delayed works on suspend.
Hmm x2:
the driver cancels the works on suspend if rt711->hw_init is true.
the rt711_update_status() sets rt711->hw_init to false if the status changes to SDW_SLAVE_UNATTACHED

What we see in the logs is that the hardware got initialized, jack is scheduled then all sdw links got disabled, there could be a runtime suspend after this, it is not visible, but it would leave the works in flight.

AND
b) the codec initialization scheduling a jack detection workqueue. This part is ONLY done for the RT711.

It is not that uncommon to try to detect the jack state after resume (the codec was off and the user could inserted / removed / changed the connected head*).

[ Upstream commit 6d9f2da ] commit e38f9ff ("ACPI: scan: Do not add device IDs from _CID if _HID is not valid") exposes a race condition on a TGL RVP device leading to a timeout. The detailed analysis shows the RT711 codec driver scheduling a jack detection workqueue while attaching during a spurious pm_runtime resume, and the work function happens to be scheduled after the manager device is suspended. The direct link between this ACPI patch and a spurious pm_runtime resume is not obvious; the most likely explanation is that a change in the ACPI device linked list management modifies the order in which the pm_runtime device status is checked and exposes a race condition that was probably present for a very long time, but was not identified. We already have a check in the .prepare stage, where we will resume to full power from specific clock-stop modes. In all other cases, we don't need to resume to full power by default. Adding the SMART_SUSPEND flag prevents the spurious resume from happening. BugLink: thesofproject/linux#3459 Fixes: 029bfd1 ("soundwire: intel: conditionally exit clock stop mode on system suspend") Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com> Reviewed-by: Ranjani Sridharan <ranjani.sridharan@linux.intel.com> Reviewed-by: Rander Wang <rander.wang@intel.com> Signed-off-by: Bard Liao <yung-chuan.liao@linux.intel.com> Link: https://lore.kernel.org/r/20220420023241.14335-2-yung-chuan.liao@linux.intel.com Signed-off-by: Vinod Koul <vkoul@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>

[ Upstream commit 6d9f2da ] commit e38f9ff ("ACPI: scan: Do not add device IDs from _CID if _HID is not valid") exposes a race condition on a TGL RVP device leading to a timeout. The detailed analysis shows the RT711 codec driver scheduling a jack detection workqueue while attaching during a spurious pm_runtime resume, and the work function happens to be scheduled after the manager device is suspended. The direct link between this ACPI patch and a spurious pm_runtime resume is not obvious; the most likely explanation is that a change in the ACPI device linked list management modifies the order in which the pm_runtime device status is checked and exposes a race condition that was probably present for a very long time, but was not identified. We already have a check in the .prepare stage, where we will resume to full power from specific clock-stop modes. In all other cases, we don't need to resume to full power by default. Adding the SMART_SUSPEND flag prevents the spurious resume from happening. BugLink: thesofproject/linux#3459 Fixes: 029bfd1 ("soundwire: intel: conditionally exit clock stop mode on system suspend") Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com> Reviewed-by: Ranjani Sridharan <ranjani.sridharan@linux.intel.com> Reviewed-by: Rander Wang <rander.wang@intel.com> Signed-off-by: Bard Liao <yung-chuan.liao@linux.intel.com> Link: https://lore.kernel.org/r/20220420023241.14335-2-yung-chuan.liao@linux.intel.com Signed-off-by: Vinod Koul <vkoul@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org> (cherry picked from commit fd18fb38d6a4c726390835d45aaef08e0ef6ccd9) Signed-off-by: Jack Vogel <jack.vogel@oracle.com>

BugLink: https://bugs.launchpad.net/bugs/1982968 [ Upstream commit 6d9f2da ] commit e38f9ff ("ACPI: scan: Do not add device IDs from _CID if _HID is not valid") exposes a race condition on a TGL RVP device leading to a timeout. The detailed analysis shows the RT711 codec driver scheduling a jack detection workqueue while attaching during a spurious pm_runtime resume, and the work function happens to be scheduled after the manager device is suspended. The direct link between this ACPI patch and a spurious pm_runtime resume is not obvious; the most likely explanation is that a change in the ACPI device linked list management modifies the order in which the pm_runtime device status is checked and exposes a race condition that was probably present for a very long time, but was not identified. We already have a check in the .prepare stage, where we will resume to full power from specific clock-stop modes. In all other cases, we don't need to resume to full power by default. Adding the SMART_SUSPEND flag prevents the spurious resume from happening. BugLink: thesofproject/linux#3459 Fixes: 029bfd1 ("soundwire: intel: conditionally exit clock stop mode on system suspend") Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com> Reviewed-by: Ranjani Sridharan <ranjani.sridharan@linux.intel.com> Reviewed-by: Rander Wang <rander.wang@intel.com> Signed-off-by: Bard Liao <yung-chuan.liao@linux.intel.com> Link: https://lore.kernel.org/r/20220420023241.14335-2-yung-chuan.liao@linux.intel.com Signed-off-by: Vinod Koul <vkoul@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Kamal Mostafa <kamal@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>

[ Upstream commit e557bca ] In typical use cases, the peripheral becomes pm_runtime active as a result of the ALSA/ASoC framework starting up a DAI. The parent/child hierarchy guarantees that the manager device will be fully resumed beforehand. There is however a corner case where the manager device may become pm_runtime active, but without ALSA/ASoC requesting any functionality from the peripherals. In this case, the hardware peripheral device will report as ATTACHED and its initialization routine will be executed. If this initialization routine initiates any sort of deferred processing, there is a possibility that the manager could suspend without the peripheral suspend sequence being invoked: from the pm_runtime framework perspective, the peripheral is *already* suspended. To avoid such disconnects between hardware state and pm_runtime state, this patch adds an asynchronous pm_request_resume() upon successful attach/initialization which will result in the proper resume/suspend sequence to be followed on the peripheral side. BugLink: thesofproject/linux#3459 Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com> Reviewed-by: Ranjani Sridharan <ranjani.sridharan@linux.intel.com> Reviewed-by: Rander Wang <rander.wang@intel.com> Signed-off-by: Bard Liao <yung-chuan.liao@linux.intel.com> Link: https://lore.kernel.org/r/20220420023241.14335-4-yung-chuan.liao@linux.intel.com Signed-off-by: Vinod Koul <vkoul@kernel.org> Stable-dep-of: c40d6b3 ("soundwire: fix enumeration completion") Signed-off-by: Sasha Levin <sashal@kernel.org>

[ Upstream commit e557bca49b812908f380c56b5b4b2f273848b676 ] In typical use cases, the peripheral becomes pm_runtime active as a result of the ALSA/ASoC framework starting up a DAI. The parent/child hierarchy guarantees that the manager device will be fully resumed beforehand. There is however a corner case where the manager device may become pm_runtime active, but without ALSA/ASoC requesting any functionality from the peripherals. In this case, the hardware peripheral device will report as ATTACHED and its initialization routine will be executed. If this initialization routine initiates any sort of deferred processing, there is a possibility that the manager could suspend without the peripheral suspend sequence being invoked: from the pm_runtime framework perspective, the peripheral is *already* suspended. To avoid such disconnects between hardware state and pm_runtime state, this patch adds an asynchronous pm_request_resume() upon successful attach/initialization which will result in the proper resume/suspend sequence to be followed on the peripheral side. BugLink: thesofproject/linux#3459 Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com> Reviewed-by: Ranjani Sridharan <ranjani.sridharan@linux.intel.com> Reviewed-by: Rander Wang <rander.wang@intel.com> Signed-off-by: Bard Liao <yung-chuan.liao@linux.intel.com> Link: https://lore.kernel.org/r/20220420023241.14335-4-yung-chuan.liao@linux.intel.com Signed-off-by: Vinod Koul <vkoul@kernel.org> Stable-dep-of: c40d6b3249b1 ("soundwire: fix enumeration completion") Signed-off-by: Sasha Levin <sashal@kernel.org>

Source: Kernel.org MR: 127843 Type: Integration Disposition: Backport from git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable linux-5.10.y ChangeID: 7d949774e7c1a93980db4961b898313ec09744c3 Description: [ Upstream commit e557bca ] In typical use cases, the peripheral becomes pm_runtime active as a result of the ALSA/ASoC framework starting up a DAI. The parent/child hierarchy guarantees that the manager device will be fully resumed beforehand. There is however a corner case where the manager device may become pm_runtime active, but without ALSA/ASoC requesting any functionality from the peripherals. In this case, the hardware peripheral device will report as ATTACHED and its initialization routine will be executed. If this initialization routine initiates any sort of deferred processing, there is a possibility that the manager could suspend without the peripheral suspend sequence being invoked: from the pm_runtime framework perspective, the peripheral is *already* suspended. To avoid such disconnects between hardware state and pm_runtime state, this patch adds an asynchronous pm_request_resume() upon successful attach/initialization which will result in the proper resume/suspend sequence to be followed on the peripheral side. BugLink: thesofproject/linux#3459 Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com> Reviewed-by: Ranjani Sridharan <ranjani.sridharan@linux.intel.com> Reviewed-by: Rander Wang <rander.wang@intel.com> Signed-off-by: Bard Liao <yung-chuan.liao@linux.intel.com> Link: https://lore.kernel.org/r/20220420023241.14335-4-yung-chuan.liao@linux.intel.com Signed-off-by: Vinod Koul <vkoul@kernel.org> Stable-dep-of: c40d6b3 ("soundwire: fix enumeration completion") Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Armin Kuster <akuster@mvista.com>

[ Upstream commit e557bca ] In typical use cases, the peripheral becomes pm_runtime active as a result of the ALSA/ASoC framework starting up a DAI. The parent/child hierarchy guarantees that the manager device will be fully resumed beforehand. There is however a corner case where the manager device may become pm_runtime active, but without ALSA/ASoC requesting any functionality from the peripherals. In this case, the hardware peripheral device will report as ATTACHED and its initialization routine will be executed. If this initialization routine initiates any sort of deferred processing, there is a possibility that the manager could suspend without the peripheral suspend sequence being invoked: from the pm_runtime framework perspective, the peripheral is *already* suspended. To avoid such disconnects between hardware state and pm_runtime state, this patch adds an asynchronous pm_request_resume() upon successful attach/initialization which will result in the proper resume/suspend sequence to be followed on the peripheral side. BugLink: thesofproject/linux#3459 Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com> Reviewed-by: Ranjani Sridharan <ranjani.sridharan@linux.intel.com> Reviewed-by: Rander Wang <rander.wang@intel.com> Signed-off-by: Bard Liao <yung-chuan.liao@linux.intel.com> Link: https://lore.kernel.org/r/20220420023241.14335-4-yung-chuan.liao@linux.intel.com Signed-off-by: Vinod Koul <vkoul@kernel.org> Stable-dep-of: c40d6b3 ("soundwire: fix enumeration completion") Signed-off-by: Sasha Levin <sashal@kernel.org> (cherry picked from commit 7996facaf0ee3829386aa388183f6fffab95e1af) Signed-off-by: Jack Vogel <jack.vogel@oracle.com>

stable inclusion from stable-v5.10.190 commit 7d949774e7c1a93980db4961b898313ec09744c3 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I928UI Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=7d949774e7c1a93980db4961b898313ec09744c3 -------------------------------- [ Upstream commit e557bca ] In typical use cases, the peripheral becomes pm_runtime active as a result of the ALSA/ASoC framework starting up a DAI. The parent/child hierarchy guarantees that the manager device will be fully resumed beforehand. There is however a corner case where the manager device may become pm_runtime active, but without ALSA/ASoC requesting any functionality from the peripherals. In this case, the hardware peripheral device will report as ATTACHED and its initialization routine will be executed. If this initialization routine initiates any sort of deferred processing, there is a possibility that the manager could suspend without the peripheral suspend sequence being invoked: from the pm_runtime framework perspective, the peripheral is *already* suspended. To avoid such disconnects between hardware state and pm_runtime state, this patch adds an asynchronous pm_request_resume() upon successful attach/initialization which will result in the proper resume/suspend sequence to be followed on the peripheral side. BugLink: thesofproject/linux#3459 Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com> Reviewed-by: Ranjani Sridharan <ranjani.sridharan@linux.intel.com> Reviewed-by: Rander Wang <rander.wang@intel.com> Signed-off-by: Bard Liao <yung-chuan.liao@linux.intel.com> Link: https://lore.kernel.org/r/20220420023241.14335-4-yung-chuan.liao@linux.intel.com Signed-off-by: Vinod Koul <vkoul@kernel.org> Stable-dep-of: c40d6b3 ("soundwire: fix enumeration completion") Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: sanglipeng <sanglipeng1@jd.com>

stable inclusion from stable-5.10.190 commit 7d949774e7c1a93980db4961b898313ec09744c3 category: bugfix issue: #I9D31L CVE: NA Signed-off-by: wanxiaoqing <wanxiaoqing@huawei.com> --------------------------------------- [ Upstream commit e557bca49b812908f380c56b5b4b2f273848b676 ] In typical use cases, the peripheral becomes pm_runtime active as a result of the ALSA/ASoC framework starting up a DAI. The parent/child hierarchy guarantees that the manager device will be fully resumed beforehand. There is however a corner case where the manager device may become pm_runtime active, but without ALSA/ASoC requesting any functionality from the peripherals. In this case, the hardware peripheral device will report as ATTACHED and its initialization routine will be executed. If this initialization routine initiates any sort of deferred processing, there is a possibility that the manager could suspend without the peripheral suspend sequence being invoked: from the pm_runtime framework perspective, the peripheral is *already* suspended. To avoid such disconnects between hardware state and pm_runtime state, this patch adds an asynchronous pm_request_resume() upon successful attach/initialization which will result in the proper resume/suspend sequence to be followed on the peripheral side. BugLink: thesofproject/linux#3459 Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com> Reviewed-by: Ranjani Sridharan <ranjani.sridharan@linux.intel.com> Reviewed-by: Rander Wang <rander.wang@intel.com> Signed-off-by: Bard Liao <yung-chuan.liao@linux.intel.com> Link: https://lore.kernel.org/r/20220420023241.14335-4-yung-chuan.liao@linux.intel.com Signed-off-by: Vinod Koul <vkoul@kernel.org> Stable-dep-of: c40d6b3249b1 ("soundwire: fix enumeration completion") Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: wanxiaoqing <wanxiaoqing@huawei.com>

plbossart added the SDW Applies to SoundWire bus for codec connection label Feb 25, 2022

plbossart added the TGL Applies to Tiger Lake platform label Feb 25, 2022

bardliao mentioned this issue Mar 7, 2022

Audio: volume: Optimize source and sink buffers usage thesofproject/sof#5452

Merged

marc-hb added the P1 Blocker bugs or important features label Mar 8, 2022

marc-hb mentioned this issue Mar 9, 2022

Revert "check-sof-logger: disable dma_nudge() workaround for stuck DMA issue #4333 thesofproject/sof-test#876

Merged

marc-hb mentioned this issue Mar 10, 2022

Merge/sound upstream 20220228 #3461

Merged

marc-hb mentioned this issue Mar 11, 2022

xtensa-build-zephyr.py: un-hardcode execute_command() wrapper thesofproject/sof#5516

Merged

marc-hb added the suspend resume Issues related to suspend resume (e.g. rtcwake) label Jul 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SoundWire: TGL: IO error in rt711_jack_detect_handler #3459

SoundWire: TGL: IO error in rt711_jack_detect_handler #3459

plbossart commented Feb 25, 2022

plbossart commented Feb 25, 2022

plbossart commented Feb 28, 2022

bardliao commented Mar 1, 2022

marc-hb commented Mar 8, 2022 •

edited

plbossart commented Mar 9, 2022

marc-hb commented Mar 10, 2022 •

edited

ujfalusi commented Mar 10, 2022

keqiaozhang commented Mar 10, 2022

ujfalusi commented Mar 10, 2022 •

edited

keqiaozhang commented Mar 10, 2022

ujfalusi commented Mar 10, 2022

plbossart commented Mar 10, 2022

marc-hb commented Mar 10, 2022 •

edited

ujfalusi commented Mar 10, 2022

marc-hb commented Mar 10, 2022

ujfalusi commented Mar 10, 2022

plbossart commented Mar 10, 2022 •

edited

plbossart commented Mar 10, 2022

plbossart commented Mar 10, 2022

marc-hb commented Mar 11, 2022

ujfalusi commented Mar 11, 2022 •

edited

SoundWire: TGL: IO error in rt711_jack_detect_handler #3459

SoundWire: TGL: IO error in rt711_jack_detect_handler #3459

Comments

plbossart commented Feb 25, 2022

plbossart commented Feb 25, 2022

plbossart commented Feb 28, 2022

bardliao commented Mar 1, 2022

marc-hb commented Mar 8, 2022 • edited

plbossart commented Mar 9, 2022

marc-hb commented Mar 10, 2022 • edited

ujfalusi commented Mar 10, 2022

keqiaozhang commented Mar 10, 2022

ujfalusi commented Mar 10, 2022 • edited

keqiaozhang commented Mar 10, 2022

ujfalusi commented Mar 10, 2022

plbossart commented Mar 10, 2022

marc-hb commented Mar 10, 2022 • edited

ujfalusi commented Mar 10, 2022

marc-hb commented Mar 10, 2022

ujfalusi commented Mar 10, 2022

plbossart commented Mar 10, 2022 • edited

plbossart commented Mar 10, 2022

plbossart commented Mar 10, 2022

marc-hb commented Mar 11, 2022

ujfalusi commented Mar 11, 2022 • edited

marc-hb commented Mar 8, 2022 •

edited

marc-hb commented Mar 10, 2022 •

edited

ujfalusi commented Mar 10, 2022 •

edited

marc-hb commented Mar 10, 2022 •

edited

plbossart commented Mar 10, 2022 •

edited

ujfalusi commented Mar 11, 2022 •

edited