[BUG] I/O error when testing multiple pause/resume, XRUN nothing to copy #7759

keqiaozhang · 2023-06-07T08:27:23Z

Describe the bug
Observed this issue in CI tests, it happens on ADL/TGL IPC4 platform. But there're no errors in dmesg or mtrace. The reproduce rate is about 50%.

To Reproduce
~/sof-test/test-case/multiple-pause-resume.sh -r 50

Reproduction Rate
50%

mtrace

[  111.337111] <err> dma_dw_common: dw_dma_get_status: xrun detected
[  111.337125] <wrn> dai_comp: dai_common_copy: comp:1 0x40001 dai_zephyr_copy(): nothing to copy
[  111.338111] <err> dma_dw_common: dw_dma_get_status: xrun detected
[  111.338146] <wrn> dai_comp: dai_common_copy: comp:1 0x40001 dai_zephyr_copy(): nothing to copy
[  111.339111] <err> dma_dw_common: dw_dma_get_status: xrun detected
[  111.339133] <wrn> dai_comp: dai_common_copy: comp:1 0x40001 dai_zephyr_copy(): nothing to copy
[  111.340113] <err> dma_dw_common: dw_dma_get_status: xrun detected
[  111.340128] <wrn> dai_comp: dai_common_copy: comp:1 0x40001 dai_zephyr_copy(): nothing to copy
[  111.341111] <err> dma_dw_common: dw_dma_get_status: xrun detected
[  111.341146] <wrn> dai_comp: dai_common_copy: comp:1 0x40001 dai_zephyr_copy(): nothing to copy
[  111.342111] <err> dma_dw_common: dw_dma_get_status: xrun detected
[  111.342133] <wrn> dai_comp: dai_common_copy: comp:1 0x40001 dai_zephyr_copy(): nothing to copy
[  111.343111] <err> dma_dw_common: dw_dma_get_status: xrun detected
[  111.343125] <wrn> dai_comp: dai_common_copy: comp:1 0x40001 dai_zephyr_copy(): nothing to copy

Environment

Branch name and commit hash of the 2 repositories: sof (firmware/topology) and linux (kernel driver).
- Kernel: {17b5d3eee2c6}
- SOF: {1dd1939f05ef}
Name of the platform(s) on which the bug is observed.
- Platform:TGLU_UP_HDA_IPC4ZPH TGLU_RVP_SDW_IPC4ZPH ADLP_SKU0B00_SDCA_IPC4ZPH

Screenshots or console output

#+                                                 | 00%
(49/50) pcm'HDMI3' cmd'aplay' id'2': Wait for 43 ms before pause

=== PAUSE ===                                                             
(49/50) pcm'HDMI3' cmd'aplay' id'2': Wait for 25 ms before resume
                                                                           

#+                                                 | 00%
(50/50) pcm'HDMI3' cmd'aplay' id'2': Wait for 25 ms before pause
 
#+                                                 | 00%
(50/50) pcm'HDMI3' cmd'aplay' id'2': Wait for 29 ms before pause

=== PAUSE ===                                                             
(50/50) pcm'HDMI3' cmd'aplay' id'2': Wait for 22 ms before resume

2023-06-05 22:45:44 UTC [REMOTE_INFO] pipeline: HDMI1 with aplay
2023-06-05 22:45:44 UTC [REMOTE_INFO] pipeline: HDMI3 with aplay
2023-06-05 22:45:44 UTC [REMOTE_INFO] Check expect exit status
declare -- cmd="journalctl_cmd --since=@1686005122"
2023-06-05 22:45:45 UTC [REMOTE_ERROR] pause resume PID 37566 had non-zero exit status
2023-06-05 22:45:45 UTC [REMOTE_ERROR] Starting func_exit_handler(), exit status=1, FUNCNAME stack:
2023-06-05 22:45:45 UTC [REMOTE_ERROR]  die()  @  /home/ubuntu/sof-test/test-case/../case-lib/lib.sh
2023-06-05 22:45:45 UTC [REMOTE_ERROR]  main()  @  /home/ubuntu/sof-test/test-case/multiple-pause-resume.sh:191
2023-06-05 22:45:46 UTC [REMOTE_INFO] pkill -TERM -f mtrace-reader.py
2023-06-05 22:45:46 UTC [REMOTE_INFO] ktime=4293 sof-test PID=36343: ending
2023-06-05 22:45:46 UTC [REMOTE_INFO] Test Result: FAIL!

dmesg.txt

mtrace.txt

The text was updated successfully, but these errors were encountered:

fredoh9 · 2023-06-27T17:42:47Z

For today, TGLU_RVP_SDW_IPC4ZPH and TGLU_RVP_SDW_IPC4ZPH reproduced this issue. Truely P1.
Intel internal daily test: planresultdetail/28251

marc-hb · 2023-06-29T23:09:03Z

Still happening with a very high reproduction rate, I see it in about 1/2 or 1/3 of the test results I look at. For instance this one: https://sof-ci.01.org/sofpr/PR7878/build10245/devicetest

fredoh9 · 2023-07-25T17:58:02Z

This is not 100% but It happens with really high reproduction rate, almost daily

kv2019i · 2023-08-09T09:22:29Z

Reproduced this locally, debug ongoing.

kv2019i · 2023-08-14T18:08:17Z

The error is raised in Linux pcm_lib.c:wait_for_avail() with hw_ptr not being updated for 100ms. So far no idea what is causing this.

kv2019i · 2023-08-15T14:27:02Z

This log entry is now at the focal point:
[ 111.337111] dma_dw_common: dw_dma_get_status: xrun detected

I've added debugging and the sequence of events seems to be:

two pipelines running (e.g. SDW capture or playback, but can be DMIC as well, or some other combo)
PCM1 is released from pause
PCM2 is stopped and closed very soon after previous step
DMA for PCM1 stops getting data from DAI
kernel driver times out in wait_for_avail()
- the wait-delay is randomized in multiple-pause-resume test so the error is not always reported as a test failure as -EIO error is not always reported to ALSA level (ALSA closed PCM1 just before the timeout elapses)

Still no clue what causes PCM1 DMA to stop. There's seemingly no change to the hardware programming when this occurs.

fredoh9 · 2023-08-15T18:10:35Z

Very similar failure found today with check-suspend-resume-with-playback-5.sh in ADLP_SKU0B00_SDCA_IPC4ZPH
There are tons of "nothing to copy" but there is NO, "dma_dw_common: dw_dma_get_status: xrun detected"

Intel Internal Daily test link: planresultdetail/30115?model=ADLP_SKU0B00_SDCA_IPC4ZPH&testcase=check-suspend-resume-with-playback-5

kv2019i · 2023-08-16T07:32:48Z

Ok thanks @fredoh9 that is interesting. The "nothing to copy" are not necessarily a sign of problem. At least with SDW, it takes time after DMA has started and before the codec starts emitting data, so there's a period time (especially after paused->released transition) when FW is complaining about no data, but this seems to be normal (still need to check kernel side codec handling).

I noticed this bug does not happen if I relaxed the pause/resume timing:
~/sof-test/test-case/multiple-pause-resume.sh -r 50 -i 200 -a 400

... I'm starting to think the delay in getting data after paused->released with SDW, is somehow causing this bug to appear. The test case usually has a very fast pace of pause/release cycles, and this somehow causes the failure to happen.

kv2019i · 2023-08-16T18:17:07Z

I'm no longer sure this is a SOF issue (or a kernel issue). It seems the SDW capture stream is unable to recover from paused state, it a playback SDW stream has been closed during this time. FYI @bardliao @plbossart @ranj063

I can now reproduce this with following simple recipe:

sh> arecord -Dplughw:0,1 -vv -r48000 -c2 -fS16_LE -i /dev/null
pause the stream with space and switch to another terminal
sh> aplay -vv -Dplughw:0,0 -d 3 -traw /dev/zero
switch back to terminal where arecord is running, unpause

This hits following error 100% of the time:

=== PAUSE ===
#+                                                 00%
arecord: pcm_read:2153: read error: Input/output error

With instrumented FW build, I can see FW starts the released stream normally, but no data is received from the codec.

fredoh9 · 2023-08-16T18:22:14Z

With multiple-pause-resume-50.sh, the problem is reproduced in almost daily.

And similar failure is found today again with check-suspend-resume-with-playback-5.sh in ADLP_SKU0B00_SDCA_IPC4ZPH.

Intel internal daily test link:
planresultdetail/30191?model=ADLP_SKU0B00_SDCA_IPC4ZPH&testcase=check-suspend-resume-with-playback-5

plbossart · 2023-08-16T18:39:46Z

wait I am confused, I thought there were non-SDW platforms affected? Even earlier this bug made references to TGLU_UP_HDA_IPC4ZPH

Are we now saying this is a SoundWire only issue? That'd be an interesting point indeed.

fredoh9 · 2023-08-16T18:54:01Z

This is NOT SoundWire only problem, but looks SDW platform is easier to reproduce.

3 days ago, TGLU_UP_HDA_IPC4ZPH had the problem.
planresultdetail/30040?model=TGLU_UP_HDA_IPC4ZPH&testcase=multiple-pause-resume-50

kv2019i · 2023-08-17T08:18:27Z

@fredoh9 @plbossart I'd say we have to separate bugs here and the case that has a VERY high reproduction rate is specific to SDW. This can be identified by user-space (arecord) getting -EIO error, and FW mtrace showing "xrun detected".

This case on HDA (plan 30040) seem to be different case. Test case is same, but here the error happens when executing pause
and the error is:
aplay: do_pause:1567: pause push error: File descriptor in bad state

So both the action where error happen, and resulting error message, are different. So I'd suggest we tag this as SDW only and file a different ticket to track the HDA issue.

kv2019i · 2023-08-17T14:56:50Z

Even simpler test case to cause the fail (with sof-tgl-rt711-4ch.tplg):

bug7759.sh

--cut--
set -x
arecord -Dplughw:0,1 -r48000 -c2 -fS16_LE -i /dev/null -d15 || echo arecord failed with $? &
aplay -Dplughw:0,0 -d2 -r48000 -c2 -fS16_LE -traw /dev/zero
sleep 3
--cut--

Output when running with sof-tgl-rt711-4ch.tplg:

$ ./bug7759.sh
++ aplay -Dplughw:0,0 -d2 -r48000 -c2 -fS16_LE -traw /dev/zero
++ arecord -Dplughw:0,1 -r48000 -c2 -fS16_LE -i /dev/null -d15
Playing raw data '/dev/zero' : Signed 16 bit Little Endian, Rate 48000 Hz, Recording WAVE '/dev/null' : StereoSigned 16 bit Little Endian, 
Rate 48000 Hz, Stereo
++ sleep 3
arecord: pcm_read:2153: read error: Input/output error
++ echo arecord failed with 1
arecord failed with 1

@bardliao @plbossart do you see the same? It seems just stopping playback to codec will kill capture.

kv2019i · 2023-08-17T17:55:22Z

Some bisect testing:

test with IPC3 (same kernel version) and SOF2.2 -> error not seen, so kernel soundwire stack seems fine
test IPC4 and SOF2.6.1 -> error not seen!

Will try to do a bisect next based on the SOF FW version.

kv2019i · 2023-08-17T18:22:40Z

Bisected to Zephyr update in commit bf6ae6d . Will proceed to bisect the change on Zephyr side.

plbossart · 2023-08-17T21:34:16Z

@kv2019i is this a TGL-only issue? I can't reproduce the problem on MTL

ubuntu@jf-mtlp-rvp-sdw-2:~$ bash bug7759.sh 
+ aplay -Dplughw:0,0 -d2 -r48000 -c2 -fS16_LE -traw /dev/zero
+ arecord -Dplughw:0,1 -r48000 -c2 -fS16_LE -i /dev/null -d15
Playing raw data '/dev/zero' : Signed 16 bit Little Endian, Rate 48000 Hz, Stereo
Recording WAVE '/dev/null' : Signed 16 bit Little Endian, Rate 48000 Hz, Stereo
+ sleep 3

[ 66.398274] sof-audio-pci-intel-mtl 0000:00:1f.3: Booted firmware version: 2.6.99.1

plbossart · 2023-08-17T21:41:34Z

I can reproduce this on ADL - looks like a cAVS 2.5 issue then?

kv2019i · 2023-08-18T08:47:09Z

@plbossart Now found the rootcause and have a fix in review. The problem is in code common to all Intel platforms, but for some reason this has triggered only on cAVS2.5 platforms. This is related to configuring SDW I/O pin ownership -> explains why data flow stopped when one ALH stream was closed.

Let's see if the fix passes testing.

Refcounting is used to track ALH block usage and to call alh_claim_ownership()/alh_release_ownership() accordingly. This is however incorrectly done on ALH instance basis, which means when one instance is released, ownership can be released even though one ALH instance is still active. Fix the logic by tracking ALH usage as a global property which matches the alh_claim_ownership/alh_release_ownership semantics. Link: thesofproject/sof#7759 Signed-off-by: Kai Vehmanen <kai.vehmanen@linux.intel.com>

Refcounting is used to track ALH block usage and to call alh_claim_ownership()/alh_release_ownership() accordingly. This is however incorrectly done on ALH instance basis, which means when one instance is released, ownership can be released even though one ALH instance is still active. Fix the logic by tracking ALH usage as a global property which matches the alh_claim_ownership/alh_release_ownership semantics. (cherry picked from commit f764e7e) Original-Link: thesofproject/sof#7759 Original-Signed-off-by: Kai Vehmanen <kai.vehmanen@linux.intel.com> GitOrigin-RevId: f764e7e Change-Id: I92fcb23c199a6a7e74c412836a4cda4f1c647796 Reviewed-on: https://chromium-review.googlesource.com/c/chromiumos/third_party/zephyr/+/4806730 Tested-by: Fabio Baltieri <fabiobaltieri@google.com> Tested-by: ChromeOS Prod (Robot) <chromeos-ci-prod@chromeos-bot.iam.gserviceaccount.com> Commit-Queue: Fabio Baltieri <fabiobaltieri@google.com> Reviewed-by: Fabio Baltieri <fabiobaltieri@google.com>

kv2019i · 2023-08-23T16:35:16Z

@keqiaozhang @fredoh9 Should be fixed via #8085 now. I've done two test runs with the fix included and both were fine. But please check and close if good.

keqiaozhang · 2023-08-25T06:03:07Z

Confirmed that this issue has been fixed by #8085. Closing this bug.

Refcounting is used to track ALH block usage and to call alh_claim_ownership()/alh_release_ownership() accordingly. This is however incorrectly done on ALH instance basis, which means when one instance is released, ownership can be released even though one ALH instance is still active. Fix the logic by tracking ALH usage as a global property which matches the alh_claim_ownership/alh_release_ownership semantics. Link: thesofproject/sof#7759 Signed-off-by: Kai Vehmanen <kai.vehmanen@linux.intel.com>

keqiaozhang added bug Something isn't working as expected P2 Critical bugs or normal features TGL Applies to Tiger Lake ADL Applies to Alder Lake platform IPC4 Issues observed with IPC4 (same IPC as Windows) labels Jun 7, 2023

fredoh9 mentioned this issue Jun 14, 2023

[BUG] multiple-pause-resume.sh failed input/output error, mtrace has many xrun in ADLP_SKU0B00_SDCA #7804

Closed

fredoh9 changed the title ~~[BUG] I/O error when testing multiple pause/resume~~ [BUG] I/O error when testing multiple pause/resume, XRUN nothing to copy Jun 23, 2023

marc-hb added P1 Blocker bugs or important features Intel Linux Daily tests This issue can be found in internal Linux daily tests and removed P2 Critical bugs or normal features labels Jun 23, 2023

marc-hb mentioned this issue Jun 30, 2023

Refine logger thesofproject/sof-test#1065

Closed

lgirdwood added this to the v2.7 milestone Jul 3, 2023

kv2019i self-assigned this Aug 8, 2023

kv2019i mentioned this issue Aug 14, 2023

ipc4: set_dx: ignoring logger status notifications #8029

Merged

kv2019i mentioned this issue Aug 18, 2023

drivers: dai: alh: fix refcount logic for ALH ownership zephyrproject-rtos/zephyr#61632

Merged

mengdonglin added the Fix Submitted Fix has been submitted, to be fully verified label Aug 22, 2023

kv2019i mentioned this issue Aug 23, 2023

west.yml: update Zephyr to b92e749c3e5f #8085

Merged

keqiaozhang closed this as completed Aug 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] I/O error when testing multiple pause/resume, XRUN nothing to copy #7759

[BUG] I/O error when testing multiple pause/resume, XRUN nothing to copy #7759

keqiaozhang commented Jun 7, 2023 •

edited

fredoh9 commented Jun 27, 2023 •

edited

marc-hb commented Jun 29, 2023

fredoh9 commented Jul 25, 2023

kv2019i commented Aug 9, 2023

kv2019i commented Aug 14, 2023

kv2019i commented Aug 15, 2023

fredoh9 commented Aug 15, 2023

kv2019i commented Aug 16, 2023

kv2019i commented Aug 16, 2023 •

edited

fredoh9 commented Aug 16, 2023

plbossart commented Aug 16, 2023

fredoh9 commented Aug 16, 2023 •

edited

kv2019i commented Aug 17, 2023

kv2019i commented Aug 17, 2023 •

edited

kv2019i commented Aug 17, 2023

kv2019i commented Aug 17, 2023

plbossart commented Aug 17, 2023 •

edited

plbossart commented Aug 17, 2023

kv2019i commented Aug 18, 2023

kv2019i commented Aug 23, 2023

keqiaozhang commented Aug 25, 2023

[BUG] I/O error when testing multiple pause/resume, XRUN nothing to copy #7759

[BUG] I/O error when testing multiple pause/resume, XRUN nothing to copy #7759

Comments

keqiaozhang commented Jun 7, 2023 • edited

fredoh9 commented Jun 27, 2023 • edited

marc-hb commented Jun 29, 2023

fredoh9 commented Jul 25, 2023

kv2019i commented Aug 9, 2023

kv2019i commented Aug 14, 2023

kv2019i commented Aug 15, 2023

fredoh9 commented Aug 15, 2023

kv2019i commented Aug 16, 2023

kv2019i commented Aug 16, 2023 • edited

fredoh9 commented Aug 16, 2023

plbossart commented Aug 16, 2023

fredoh9 commented Aug 16, 2023 • edited

kv2019i commented Aug 17, 2023

kv2019i commented Aug 17, 2023 • edited

kv2019i commented Aug 17, 2023

kv2019i commented Aug 17, 2023

plbossart commented Aug 17, 2023 • edited

plbossart commented Aug 17, 2023

kv2019i commented Aug 18, 2023

kv2019i commented Aug 23, 2023

keqiaozhang commented Aug 25, 2023

keqiaozhang commented Jun 7, 2023 •

edited

fredoh9 commented Jun 27, 2023 •

edited

kv2019i commented Aug 16, 2023 •

edited

fredoh9 commented Aug 16, 2023 •

edited

kv2019i commented Aug 17, 2023 •

edited

plbossart commented Aug 17, 2023 •

edited