Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] I/O error when testing multiple pause/resume, XRUN nothing to copy #7759

Closed
keqiaozhang opened this issue Jun 7, 2023 · 21 comments
Closed
Assignees
Labels
ADL Applies to Alder Lake platform bug Something isn't working as expected Fix Submitted Fix has been submitted, to be fully verified Intel Linux Daily tests This issue can be found in internal Linux daily tests IPC4 Issues observed with IPC4 (same IPC as Windows) P1 Blocker bugs or important features TGL Applies to Tiger Lake
Milestone

Comments

@keqiaozhang
Copy link
Collaborator

keqiaozhang commented Jun 7, 2023

Describe the bug
Observed this issue in CI tests, it happens on ADL/TGL IPC4 platform. But there're no errors in dmesg or mtrace. The reproduce rate is about 50%.

To Reproduce
~/sof-test/test-case/multiple-pause-resume.sh -r 50

Reproduction Rate
50%

mtrace

[  111.337111] <err> dma_dw_common: dw_dma_get_status: xrun detected
[  111.337125] <wrn> dai_comp: dai_common_copy: comp:1 0x40001 dai_zephyr_copy(): nothing to copy
[  111.338111] <err> dma_dw_common: dw_dma_get_status: xrun detected
[  111.338146] <wrn> dai_comp: dai_common_copy: comp:1 0x40001 dai_zephyr_copy(): nothing to copy
[  111.339111] <err> dma_dw_common: dw_dma_get_status: xrun detected
[  111.339133] <wrn> dai_comp: dai_common_copy: comp:1 0x40001 dai_zephyr_copy(): nothing to copy
[  111.340113] <err> dma_dw_common: dw_dma_get_status: xrun detected
[  111.340128] <wrn> dai_comp: dai_common_copy: comp:1 0x40001 dai_zephyr_copy(): nothing to copy
[  111.341111] <err> dma_dw_common: dw_dma_get_status: xrun detected
[  111.341146] <wrn> dai_comp: dai_common_copy: comp:1 0x40001 dai_zephyr_copy(): nothing to copy
[  111.342111] <err> dma_dw_common: dw_dma_get_status: xrun detected
[  111.342133] <wrn> dai_comp: dai_common_copy: comp:1 0x40001 dai_zephyr_copy(): nothing to copy
[  111.343111] <err> dma_dw_common: dw_dma_get_status: xrun detected
[  111.343125] <wrn> dai_comp: dai_common_copy: comp:1 0x40001 dai_zephyr_copy(): nothing to copy

Environment

  1. Branch name and commit hash of the 2 repositories: sof (firmware/topology) and linux (kernel driver).

  2. Name of the platform(s) on which the bug is observed.

    • Platform:TGLU_UP_HDA_IPC4ZPH TGLU_RVP_SDW_IPC4ZPH ADLP_SKU0B00_SDCA_IPC4ZPH

Screenshots or console output

#+                                                 | 00%
(49/50) pcm'HDMI3' cmd'aplay' id'2': Wait for 43 ms before pause

=== PAUSE ===                                                             
(49/50) pcm'HDMI3' cmd'aplay' id'2': Wait for 25 ms before resume
                                                                           

#+                                                 | 00%
(50/50) pcm'HDMI3' cmd'aplay' id'2': Wait for 25 ms before pause
 
#+                                                 | 00%
(50/50) pcm'HDMI3' cmd'aplay' id'2': Wait for 29 ms before pause

=== PAUSE ===                                                             
(50/50) pcm'HDMI3' cmd'aplay' id'2': Wait for 22 ms before resume

2023-06-05 22:45:44 UTC [REMOTE_INFO] pipeline: HDMI1 with aplay
2023-06-05 22:45:44 UTC [REMOTE_INFO] pipeline: HDMI3 with aplay
2023-06-05 22:45:44 UTC [REMOTE_INFO] Check expect exit status
declare -- cmd="journalctl_cmd --since=@1686005122"
2023-06-05 22:45:45 UTC [REMOTE_ERROR] pause resume PID 37566 had non-zero exit status
2023-06-05 22:45:45 UTC [REMOTE_ERROR] Starting func_exit_handler(), exit status=1, FUNCNAME stack:
2023-06-05 22:45:45 UTC [REMOTE_ERROR]  die()  @  /home/ubuntu/sof-test/test-case/../case-lib/lib.sh
2023-06-05 22:45:45 UTC [REMOTE_ERROR]  main()  @  /home/ubuntu/sof-test/test-case/multiple-pause-resume.sh:191
2023-06-05 22:45:46 UTC [REMOTE_INFO] pkill -TERM -f mtrace-reader.py
2023-06-05 22:45:46 UTC [REMOTE_INFO] ktime=4293 sof-test PID=36343: ending
2023-06-05 22:45:46 UTC [REMOTE_INFO] Test Result: FAIL!

dmesg.txt

mtrace.txt

@keqiaozhang keqiaozhang added bug Something isn't working as expected P2 Critical bugs or normal features TGL Applies to Tiger Lake ADL Applies to Alder Lake platform IPC4 Issues observed with IPC4 (same IPC as Windows) labels Jun 7, 2023
@fredoh9 fredoh9 changed the title [BUG] I/O error when testing multiple pause/resume [BUG] I/O error when testing multiple pause/resume, XRUN nothing to copy Jun 23, 2023
@marc-hb marc-hb added P1 Blocker bugs or important features Intel Linux Daily tests This issue can be found in internal Linux daily tests and removed P2 Critical bugs or normal features labels Jun 23, 2023
@fredoh9
Copy link
Contributor

fredoh9 commented Jun 27, 2023

For today, TGLU_RVP_SDW_IPC4ZPH and TGLU_RVP_SDW_IPC4ZPH reproduced this issue. Truely P1.
Intel internal daily test: planresultdetail/28251

@marc-hb
Copy link
Collaborator

marc-hb commented Jun 29, 2023

Still happening with a very high reproduction rate, I see it in about 1/2 or 1/3 of the test results I look at. For instance this one: https://sof-ci.01.org/sofpr/PR7878/build10245/devicetest

@lgirdwood lgirdwood added this to the v2.7 milestone Jul 3, 2023
@fredoh9
Copy link
Contributor

fredoh9 commented Jul 25, 2023

This is not 100% but It happens with really high reproduction rate, almost daily

@kv2019i kv2019i self-assigned this Aug 8, 2023
@kv2019i
Copy link
Collaborator

kv2019i commented Aug 9, 2023

Reproduced this locally, debug ongoing.

@kv2019i
Copy link
Collaborator

kv2019i commented Aug 14, 2023

The error is raised in Linux pcm_lib.c:wait_for_avail() with hw_ptr not being updated for 100ms. So far no idea what is causing this.

@kv2019i
Copy link
Collaborator

kv2019i commented Aug 15, 2023

This log entry is now at the focal point:
[ 111.337111] dma_dw_common: dw_dma_get_status: xrun detected

I've added debugging and the sequence of events seems to be:

  • two pipelines running (e.g. SDW capture or playback, but can be DMIC as well, or some other combo)
  • PCM1 is released from pause
  • PCM2 is stopped and closed very soon after previous step
  • DMA for PCM1 stops getting data from DAI
  • kernel driver times out in wait_for_avail()
    - the wait-delay is randomized in multiple-pause-resume test so the error is not always reported as a test failure as -EIO error is not always reported to ALSA level (ALSA closed PCM1 just before the timeout elapses)

Still no clue what causes PCM1 DMA to stop. There's seemingly no change to the hardware programming when this occurs.

@fredoh9
Copy link
Contributor

fredoh9 commented Aug 15, 2023

Very similar failure found today with check-suspend-resume-with-playback-5.sh in ADLP_SKU0B00_SDCA_IPC4ZPH
There are tons of "nothing to copy" but there is NO, "dma_dw_common: dw_dma_get_status: xrun detected"

Intel Internal Daily test link: planresultdetail/30115?model=ADLP_SKU0B00_SDCA_IPC4ZPH&testcase=check-suspend-resume-with-playback-5

@kv2019i
Copy link
Collaborator

kv2019i commented Aug 16, 2023

Ok thanks @fredoh9 that is interesting. The "nothing to copy" are not necessarily a sign of problem. At least with SDW, it takes time after DMA has started and before the codec starts emitting data, so there's a period time (especially after paused->released transition) when FW is complaining about no data, but this seems to be normal (still need to check kernel side codec handling).

I noticed this bug does not happen if I relaxed the pause/resume timing:
~/sof-test/test-case/multiple-pause-resume.sh -r 50 -i 200 -a 400

... I'm starting to think the delay in getting data after paused->released with SDW, is somehow causing this bug to appear. The test case usually has a very fast pace of pause/release cycles, and this somehow causes the failure to happen.

@kv2019i
Copy link
Collaborator

kv2019i commented Aug 16, 2023

I'm no longer sure this is a SOF issue (or a kernel issue). It seems the SDW capture stream is unable to recover from paused state, it a playback SDW stream has been closed during this time. FYI @bardliao @plbossart @ranj063

I can now reproduce this with following simple recipe:

sh> arecord -Dplughw:0,1 -vv -r48000 -c2 -fS16_LE -i /dev/null
pause the stream with space and switch to another terminal
sh> aplay -vv -Dplughw:0,0 -d 3 -traw /dev/zero
switch back to terminal where arecord is running, unpause

This hits following error 100% of the time:

=== PAUSE ===
#+                                                 00%
arecord: pcm_read:2153: read error: Input/output error

With instrumented FW build, I can see FW starts the released stream normally, but no data is received from the codec.

@fredoh9
Copy link
Contributor

fredoh9 commented Aug 16, 2023

With multiple-pause-resume-50.sh, the problem is reproduced in almost daily.

And similar failure is found today again with check-suspend-resume-with-playback-5.sh in ADLP_SKU0B00_SDCA_IPC4ZPH.

Intel internal daily test link:
planresultdetail/30191?model=ADLP_SKU0B00_SDCA_IPC4ZPH&testcase=check-suspend-resume-with-playback-5

@plbossart
Copy link
Member

wait I am confused, I thought there were non-SDW platforms affected? Even earlier this bug made references to TGLU_UP_HDA_IPC4ZPH

Are we now saying this is a SoundWire only issue? That'd be an interesting point indeed.

@fredoh9
Copy link
Contributor

fredoh9 commented Aug 16, 2023

This is NOT SoundWire only problem, but looks SDW platform is easier to reproduce.

3 days ago, TGLU_UP_HDA_IPC4ZPH had the problem.
planresultdetail/30040?model=TGLU_UP_HDA_IPC4ZPH&testcase=multiple-pause-resume-50

@kv2019i
Copy link
Collaborator

kv2019i commented Aug 17, 2023

@fredoh9 @plbossart I'd say we have to separate bugs here and the case that has a VERY high reproduction rate is specific to SDW. This can be identified by user-space (arecord) getting -EIO error, and FW mtrace showing "xrun detected".

This case on HDA (plan 30040) seem to be different case. Test case is same, but here the error happens when executing pause
and the error is:
aplay: do_pause:1567: pause push error: File descriptor in bad state

So both the action where error happen, and resulting error message, are different. So I'd suggest we tag this as SDW only and file a different ticket to track the HDA issue.

@kv2019i
Copy link
Collaborator

kv2019i commented Aug 17, 2023

Even simpler test case to cause the fail (with sof-tgl-rt711-4ch.tplg):

bug7759.sh

--cut--
set -x
arecord -Dplughw:0,1 -r48000 -c2 -fS16_LE -i /dev/null -d15 || echo arecord failed with $? &
aplay -Dplughw:0,0 -d2 -r48000 -c2 -fS16_LE -traw /dev/zero
sleep 3
--cut--

Output when running with sof-tgl-rt711-4ch.tplg:

$ ./bug7759.sh
++ aplay -Dplughw:0,0 -d2 -r48000 -c2 -fS16_LE -traw /dev/zero
++ arecord -Dplughw:0,1 -r48000 -c2 -fS16_LE -i /dev/null -d15
Playing raw data '/dev/zero' : Signed 16 bit Little Endian, Rate 48000 Hz, Recording WAVE '/dev/null' : StereoSigned 16 bit Little Endian, 
Rate 48000 Hz, Stereo
++ sleep 3
arecord: pcm_read:2153: read error: Input/output error
++ echo arecord failed with 1
arecord failed with 1

@bardliao @plbossart do you see the same? It seems just stopping playback to codec will kill capture.

@kv2019i
Copy link
Collaborator

kv2019i commented Aug 17, 2023

Some bisect testing:

  • test with IPC3 (same kernel version) and SOF2.2 -> error not seen, so kernel soundwire stack seems fine
  • test IPC4 and SOF2.6.1 -> error not seen!

Will try to do a bisect next based on the SOF FW version.

@kv2019i
Copy link
Collaborator

kv2019i commented Aug 17, 2023

Bisected to Zephyr update in commit bf6ae6d . Will proceed to bisect the change on Zephyr side.

@plbossart
Copy link
Member

plbossart commented Aug 17, 2023

@kv2019i is this a TGL-only issue? I can't reproduce the problem on MTL

ubuntu@jf-mtlp-rvp-sdw-2:~$ bash bug7759.sh 
+ aplay -Dplughw:0,0 -d2 -r48000 -c2 -fS16_LE -traw /dev/zero
+ arecord -Dplughw:0,1 -r48000 -c2 -fS16_LE -i /dev/null -d15
Playing raw data '/dev/zero' : Signed 16 bit Little Endian, Rate 48000 Hz, Stereo
Recording WAVE '/dev/null' : Signed 16 bit Little Endian, Rate 48000 Hz, Stereo
+ sleep 3

[ 66.398274] sof-audio-pci-intel-mtl 0000:00:1f.3: Booted firmware version: 2.6.99.1

@plbossart
Copy link
Member

I can reproduce this on ADL - looks like a cAVS 2.5 issue then?

@kv2019i
Copy link
Collaborator

kv2019i commented Aug 18, 2023

@plbossart Now found the rootcause and have a fix in review. The problem is in code common to all Intel platforms, but for some reason this has triggered only on cAVS2.5 platforms. This is related to configuring SDW I/O pin ownership -> explains why data flow stopped when one ALH stream was closed.

Let's see if the fix passes testing.

@mengdonglin mengdonglin added the Fix Submitted Fix has been submitted, to be fully verified label Aug 22, 2023
carlescufi pushed a commit to zephyrproject-rtos/zephyr that referenced this issue Aug 23, 2023
Refcounting is used to track ALH block usage and to
call alh_claim_ownership()/alh_release_ownership() accordingly.
This is however incorrectly done on ALH instance basis, which
means when one instance is released, ownership can be released
even though one ALH instance is still active.

Fix the logic by tracking ALH usage as a global property
which matches the alh_claim_ownership/alh_release_ownership
semantics.

Link: thesofproject/sof#7759
Signed-off-by: Kai Vehmanen <kai.vehmanen@linux.intel.com>
coreboot-org-bot pushed a commit to coreboot/zephyr-cros that referenced this issue Aug 23, 2023
Refcounting is used to track ALH block usage and to
call alh_claim_ownership()/alh_release_ownership() accordingly.
This is however incorrectly done on ALH instance basis, which
means when one instance is released, ownership can be released
even though one ALH instance is still active.

Fix the logic by tracking ALH usage as a global property
which matches the alh_claim_ownership/alh_release_ownership
semantics.

(cherry picked from commit f764e7e)

Original-Link: thesofproject/sof#7759
Original-Signed-off-by: Kai Vehmanen <kai.vehmanen@linux.intel.com>
GitOrigin-RevId: f764e7e
Change-Id: I92fcb23c199a6a7e74c412836a4cda4f1c647796
Reviewed-on: https://chromium-review.googlesource.com/c/chromiumos/third_party/zephyr/+/4806730
Tested-by: Fabio Baltieri <fabiobaltieri@google.com>
Tested-by: ChromeOS Prod (Robot) <chromeos-ci-prod@chromeos-bot.iam.gserviceaccount.com>
Commit-Queue: Fabio Baltieri <fabiobaltieri@google.com>
Reviewed-by: Fabio Baltieri <fabiobaltieri@google.com>
@kv2019i
Copy link
Collaborator

kv2019i commented Aug 23, 2023

@keqiaozhang @fredoh9 Should be fixed via #8085 now. I've done two test runs with the fix included and both were fine. But please check and close if good.

@keqiaozhang
Copy link
Collaborator Author

Confirmed that this issue has been fixed by #8085. Closing this bug.

piotrnarajowski pushed a commit to piotrnarajowski/zephyr that referenced this issue Aug 31, 2023
Refcounting is used to track ALH block usage and to
call alh_claim_ownership()/alh_release_ownership() accordingly.
This is however incorrectly done on ALH instance basis, which
means when one instance is released, ownership can be released
even though one ALH instance is still active.

Fix the logic by tracking ALH usage as a global property
which matches the alh_claim_ownership/alh_release_ownership
semantics.

Link: thesofproject/sof#7759
Signed-off-by: Kai Vehmanen <kai.vehmanen@linux.intel.com>
dztang pushed a commit to dztang/zephyr that referenced this issue Apr 23, 2024
Refcounting is used to track ALH block usage and to
call alh_claim_ownership()/alh_release_ownership() accordingly.
This is however incorrectly done on ALH instance basis, which
means when one instance is released, ownership can be released
even though one ALH instance is still active.

Fix the logic by tracking ALH usage as a global property
which matches the alh_claim_ownership/alh_release_ownership
semantics.

Link: thesofproject/sof#7759
Signed-off-by: Kai Vehmanen <kai.vehmanen@linux.intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ADL Applies to Alder Lake platform bug Something isn't working as expected Fix Submitted Fix has been submitted, to be fully verified Intel Linux Daily tests This issue can be found in internal Linux daily tests IPC4 Issues observed with IPC4 (same IPC as Windows) P1 Blocker bugs or important features TGL Applies to Tiger Lake
Projects
None yet
Development

No branches or pull requests

7 participants