-
Notifications
You must be signed in to change notification settings - Fork 133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SoundWire] intel_resume: MCP_CONTROL_HW_RST not cleared at iteration (dev_err logs) #3012
Comments
Weekly stress test(test result 4894) found same problem with CML_RVP_SDW.
|
same warnings seen on Intel daily test 5356 (started on July 19). The next daily tests after that will rely on #3006 which was merged on July 20. If we see this error again then my theory of a clock stop issue was wrong... Let's see! |
Seems my theory was wrong, this error still happens with the latest kernel, however this is the first time I managed to see it on Dell SKU 09c6 |
Seen by @bardliao (c.f. #3098 (comment)) |
Observed this issue again when running suspend/resume stress testing on CML_SKU0983_SDW. Test ID:8054.
|
Linux: 8c3ae47
|
Not sure how we are going to diagnose this further, this problem seems to happen only on CML and only on a monthly/bimonthly basis - and it's not critical anyways. |
Reproduction of the month has arrived
|
Again in https://sof-ci.01.org/sofpr/PR5379/build12076/devicetest/?model=TGLU_RVP_SDW&testcase=check-suspend-resume-with-playback-5 , this time on TGL. Is this becoming more frequent? EDIT again https://sof-ci.01.org/sofpr/PR5485/build12238/devicetest
|
Found same issue in internal daily test in 2022/2/18. It was reproduced in suspend/resume test with Sorry for late report, I was looking at one of the devices for errors then found this errors too. |
@keqiaozhang @XiaoyunWu6666 @marc-hb @greg-intel I haven't seen this report in a long time, is there a way to scan CI/test reports to see when we last say this string "MCP_CONTROL_HW_RST is not cleared" |
The most recent was on March 28 daily 11433?model=CML_SKU0983_SDW&testcase=check-suspend-resume-50
Start Time: 2022-03-28 21:29:00 UTC Complete list (I searched only
|
Can this be downgraded to a warning? |
if it's a warning we never see a notice from CI. That's why we explicitly added a dev_err to force such reports to show with a FAIL. |
I'd love CI to report warnings and not just errors at some point in the future and it does not seem impossible but it unfortunately seems to require a lot of non standard customizations @greg-intel and others please correct me but Jenkins has a narrow, hardcoded and binary vision of the world: PASS or FAIL and that's it. Same for Github Actions, Quickbuild etc. The only successful workaround that I found (and that I've implemented in a couple places like thesofproject/sof@05dded7b7b0f#diff-b71166ed0f585913318ed46933ff9b12901e211de3ac88c40de03f0a944c0ae0R42) is to build or test twice: once with Anyway we don't even fail tests on firmware ERRORs yet, so we have a some way to go thesofproject/sof-test#799 |
Although github has a handful of status', they don't seem to be very customizable. I'm sure there's a plugin, but I'm anti-plugin. I am curious, how many people check Jenkins multiple times every day? Marc is basically right. With one infrequently used option to add a state:
|
Thanks @greg-intel
Extremely few I bet but that wasn't my point. My point was: to transmit a new, 3rd, immediate WARN status along our very long error reporting chain then all the links in the chain probably need to be aware of that new state and Jenkins is one pretty big link in the chain. Same for the SKIP state. Off topic sorry. |
Observed this issue on ADLP_SKU0B00_SDCA. Test Report ID:13588
|
Still happening in today's daily 13945?model=CML_SKU0983_SDW&testcase=check-suspend-resume-50 ( |
Found new one in CML_RVP_SDW |
Found ADLP_BRYA_SDW with stable-v2.2. |
PR #4225 change the way the reset is done, so it's super interesting if we see this error after April 17, 2023 in our test results. |
Seen again in Intel daily tests planresultdetail/39159?model=TGLU_SKU0A32_SDCA-ipc3&testcase=check-runtime-pm-status-15
|
the report from last week is due to other problems and fixed by #4896 |
Just seen again, re-opening to be on the safe side.
|
This comment was marked as off-topic.
This comment was marked as off-topic.
Humm, just one report on CML_RVP_SDW in over 14 months, that is going to be tricky to solve.... |
Another one: It's coming back! |
from https://sof-ci.01.org/softestpr/PR1220/build635/devicetest/index.html?model=CML_RVP_SDW-ipc3&testcase=check-runtime-pm-status, it looks like the first error started with the AMD machine driver rework I am not sure how this commit might have interfered with the low-level reset sequence on CML. @bardliao @marc-hb @ssavati can you double-check if this bisect information seems correct? |
I tried to reproduce this manually on one of the test devices jf-cml-rvp-sdw-3. So far no luck. I am already at 150 cycles of multiple-pipelines + pm_runtime suspend/resume and nothing logged. |
I will stop the tests for now, unless we can reproduce it with a simple sequence we are not going to make progress or be able to bisect. Gah. |
This always had a low reproduction rate, didn't it? So I doubt this can be bisected. |
description
sdw_cdns_clock_stop failed: MCP_CONTROL_HW_RST is not cleared when check-suspend-resume-50 on CML_SKU0983_SDW
happen in inner daily 4778 model=CML_SKU0983_SDW testcase=check-suspend-resume-50
[console log]
This is not a new issue, the dev_err logs above were added on purpose to help root-cause #2606 which has occurred randomly for the last 6 months, so far only on CML_RVP.
EDIT: later seen on TGL too.
The text was updated successfully, but these errors were encountered: