-
Notifications
You must be signed in to change notification settings - Fork 299
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] I/O error when testing multiple pause/resume, XRUN nothing to copy #7759
Comments
For today, TGLU_RVP_SDW_IPC4ZPH and TGLU_RVP_SDW_IPC4ZPH reproduced this issue. Truely P1. |
Still happening with a very high reproduction rate, I see it in about 1/2 or 1/3 of the test results I look at. For instance this one: https://sof-ci.01.org/sofpr/PR7878/build10245/devicetest |
This is not 100% but It happens with really high reproduction rate, almost daily |
Reproduced this locally, debug ongoing. |
The error is raised in Linux pcm_lib.c:wait_for_avail() with hw_ptr not being updated for 100ms. So far no idea what is causing this. |
This log entry is now at the focal point: I've added debugging and the sequence of events seems to be:
Still no clue what causes PCM1 DMA to stop. There's seemingly no change to the hardware programming when this occurs. |
Very similar failure found today with check-suspend-resume-with-playback-5.sh in ADLP_SKU0B00_SDCA_IPC4ZPH Intel Internal Daily test link: planresultdetail/30115?model=ADLP_SKU0B00_SDCA_IPC4ZPH&testcase=check-suspend-resume-with-playback-5 |
Ok thanks @fredoh9 that is interesting. The "nothing to copy" are not necessarily a sign of problem. At least with SDW, it takes time after DMA has started and before the codec starts emitting data, so there's a period time (especially after paused->released transition) when FW is complaining about no data, but this seems to be normal (still need to check kernel side codec handling). I noticed this bug does not happen if I relaxed the pause/resume timing: ... I'm starting to think the delay in getting data after paused->released with SDW, is somehow causing this bug to appear. The test case usually has a very fast pace of pause/release cycles, and this somehow causes the failure to happen. |
I'm no longer sure this is a SOF issue (or a kernel issue). It seems the SDW capture stream is unable to recover from paused state, it a playback SDW stream has been closed during this time. FYI @bardliao @plbossart @ranj063 I can now reproduce this with following simple recipe: sh> arecord -Dplughw:0,1 -vv -r48000 -c2 -fS16_LE -i /dev/null This hits following error 100% of the time:
With instrumented FW build, I can see FW starts the released stream normally, but no data is received from the codec. |
With multiple-pause-resume-50.sh, the problem is reproduced in almost daily. And similar failure is found today again with check-suspend-resume-with-playback-5.sh in ADLP_SKU0B00_SDCA_IPC4ZPH. Intel internal daily test link: |
wait I am confused, I thought there were non-SDW platforms affected? Even earlier this bug made references to TGLU_UP_HDA_IPC4ZPH Are we now saying this is a SoundWire only issue? That'd be an interesting point indeed. |
This is NOT SoundWire only problem, but looks SDW platform is easier to reproduce. 3 days ago, TGLU_UP_HDA_IPC4ZPH had the problem. |
@fredoh9 @plbossart I'd say we have to separate bugs here and the case that has a VERY high reproduction rate is specific to SDW. This can be identified by user-space (arecord) getting -EIO error, and FW mtrace showing "xrun detected". This case on HDA (plan 30040) seem to be different case. Test case is same, but here the error happens when executing pause So both the action where error happen, and resulting error message, are different. So I'd suggest we tag this as SDW only and file a different ticket to track the HDA issue. |
Even simpler test case to cause the fail (with sof-tgl-rt711-4ch.tplg): bug7759.sh
Output when running with sof-tgl-rt711-4ch.tplg:
@bardliao @plbossart do you see the same? It seems just stopping playback to codec will kill capture. |
Some bisect testing:
Will try to do a bisect next based on the SOF FW version. |
Bisected to Zephyr update in commit bf6ae6d . Will proceed to bisect the change on Zephyr side. |
@kv2019i is this a TGL-only issue? I can't reproduce the problem on MTL
[ 66.398274] sof-audio-pci-intel-mtl 0000:00:1f.3: Booted firmware version: 2.6.99.1 |
I can reproduce this on ADL - looks like a cAVS 2.5 issue then? |
@plbossart Now found the rootcause and have a fix in review. The problem is in code common to all Intel platforms, but for some reason this has triggered only on cAVS2.5 platforms. This is related to configuring SDW I/O pin ownership -> explains why data flow stopped when one ALH stream was closed. Let's see if the fix passes testing. |
Refcounting is used to track ALH block usage and to call alh_claim_ownership()/alh_release_ownership() accordingly. This is however incorrectly done on ALH instance basis, which means when one instance is released, ownership can be released even though one ALH instance is still active. Fix the logic by tracking ALH usage as a global property which matches the alh_claim_ownership/alh_release_ownership semantics. Link: thesofproject/sof#7759 Signed-off-by: Kai Vehmanen <kai.vehmanen@linux.intel.com>
Refcounting is used to track ALH block usage and to call alh_claim_ownership()/alh_release_ownership() accordingly. This is however incorrectly done on ALH instance basis, which means when one instance is released, ownership can be released even though one ALH instance is still active. Fix the logic by tracking ALH usage as a global property which matches the alh_claim_ownership/alh_release_ownership semantics. (cherry picked from commit f764e7e) Original-Link: thesofproject/sof#7759 Original-Signed-off-by: Kai Vehmanen <kai.vehmanen@linux.intel.com> GitOrigin-RevId: f764e7e Change-Id: I92fcb23c199a6a7e74c412836a4cda4f1c647796 Reviewed-on: https://chromium-review.googlesource.com/c/chromiumos/third_party/zephyr/+/4806730 Tested-by: Fabio Baltieri <fabiobaltieri@google.com> Tested-by: ChromeOS Prod (Robot) <chromeos-ci-prod@chromeos-bot.iam.gserviceaccount.com> Commit-Queue: Fabio Baltieri <fabiobaltieri@google.com> Reviewed-by: Fabio Baltieri <fabiobaltieri@google.com>
@keqiaozhang @fredoh9 Should be fixed via #8085 now. I've done two test runs with the fix included and both were fine. But please check and close if good. |
Confirmed that this issue has been fixed by #8085. Closing this bug. |
Refcounting is used to track ALH block usage and to call alh_claim_ownership()/alh_release_ownership() accordingly. This is however incorrectly done on ALH instance basis, which means when one instance is released, ownership can be released even though one ALH instance is still active. Fix the logic by tracking ALH usage as a global property which matches the alh_claim_ownership/alh_release_ownership semantics. Link: thesofproject/sof#7759 Signed-off-by: Kai Vehmanen <kai.vehmanen@linux.intel.com>
Refcounting is used to track ALH block usage and to call alh_claim_ownership()/alh_release_ownership() accordingly. This is however incorrectly done on ALH instance basis, which means when one instance is released, ownership can be released even though one ALH instance is still active. Fix the logic by tracking ALH usage as a global property which matches the alh_claim_ownership/alh_release_ownership semantics. Link: thesofproject/sof#7759 Signed-off-by: Kai Vehmanen <kai.vehmanen@linux.intel.com>
Describe the bug
Observed this issue in CI tests, it happens on ADL/TGL IPC4 platform. But there're no errors in dmesg or mtrace. The reproduce rate is about 50%.
To Reproduce
~/sof-test/test-case/multiple-pause-resume.sh -r 50
Reproduction Rate
50%
mtrace
Environment
Branch name and commit hash of the 2 repositories: sof (firmware/topology) and linux (kernel driver).
Name of the platform(s) on which the bug is observed.
Screenshots or console output
dmesg.txt
mtrace.txt
The text was updated successfully, but these errors were encountered: