-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG][TGL] max98373 sdw:1:019f:8373:00:7 initialization not complete #3638
Comments
I think this is the same issue as #3063
|
Really unclear what causes the second device to remain unattached - but clearly there's already a hardware problem earlier since devices are not supposed to fall off the bus. |
Observed this issue again on ADLP_BRYA_SDW. |
|
The pattern seems to be that the device '7' (Slave 5) falls off the bus first, and then never shows up again. This device '7' is not a hardware issue, it's likely due to the SoundWire enumeration arbitration. When there are identical devices, the one with the largest unique_id wins and will be handled first. Why it loses sync is really unclear. This should not happen! |
Same error in https://sof-ci.01.org/linuxpr/PR3708/build250/devicetest/?model=TGLU_RVP_SDW&testcase=check-suspend-resume-5, but this time with a different codec.
|
This helper should help identify cases where devices fall off the bus and don't resync. BugLink: thesofproject#3638 Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
This helper should help identify cases where devices fall off the bus and don't resync. BugLink: thesofproject#3638 Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
This helper should help identify cases where devices fall off the bus and don't resync. BugLink: thesofproject#3638 Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
This helper should help identify cases where devices fall off the bus and don't resync. BugLink: #3638 Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
This helper should help identify cases where devices fall off the bus and don't resync. BugLink: #3638 Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com> Reviewed-by: Bard Liao <yung-chuan.liao@linux.intel.com> Reviewed-by: Rander Wang <rander.wang@intel.com>
This helper should help identify cases where devices fall off the bus and don't resync. BugLink: #3638 Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com> Reviewed-by: Bard Liao <yung-chuan.liao@linux.intel.com> Reviewed-by: Rander Wang <rander.wang@intel.com>
This helper should help identify cases where devices fall off the bus and don't resync. BugLink: thesofproject#3638 Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com> Reviewed-by: Rander Wang <rander.wang@intel.com> Signed-off-by: Bard Liao <yung-chuan.liao@linux.intel.com>
This helper should help identify cases where devices fall off the bus and don't resync. BugLink: #3638 Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com> Reviewed-by: Bard Liao <yung-chuan.liao@linux.intel.com> Reviewed-by: Rander Wang <rander.wang@intel.com>
This helper should help identify cases where devices fall off the bus and don't resync. BugLink: #3638 Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com> Reviewed-by: Bard Liao <yung-chuan.liao@linux.intel.com> Reviewed-by: Rander Wang <rander.wang@intel.com>
Ah, that's super interesting. Now that we added the PING status, we can check what the real status is.
This means bits 8 and 12 are set. That means Slave4 and Slave6 are attached. This suggests that the device reports as UNATTACHED but does not transition to the soft reset stage, and keeps reporting as ATTACHED with the original device number. In other words, either
That is of course not a logic that we thought was even possible. What we may need to do is that if a device reports as UNATTACHED we double-check the ping status and ignore that status change. @bardliao what do you think? |
In configurations with two amplifiers on the same link, the Intel CI reports occasional enumeration/initialization timeouts during suspend-resume stress tests, where one of the two amplifiers becomes UNATTACHED immediately after being enumerated. This problem was reported both with Maxim and Realtek codecs, which pointed initially to a problem with status handling on the Intel side. The Cadence IP integrated on Intel platforms throws an interrupt when the status changes, and the information is kept with sticky bits until cleared. We initially added more checks to make sure the edge detection did not miss any transition, but that did not improve the results significantly. With the recent addition of the read_ping_status() callback, we were able to show that the status in sticky bits is shown as UNATTACHED even though the PING frames show the problematic device as ATTACHED. That completely breaks the entire logic where we assumed that a peripheral would always re-attach as device0. The resume timeouts make sense in that in those cases, the enumeration/initialization never happens a second time. One possible explanation is that this problem typically happens when a bus clash is reported, so it could very well be that the detection is fooled by a transient electrical issue or conflict between two peripherals. This patch conditionally double-checks the status reported in the sticky bits with the actual PING frame status. If the peripheral reports as attached in PING frames, the early detection based on sticky bits is discarded. Note that this patch only corrects issues of false positives on the manager side. If the peripheral lost and regain sync, then it would report as attached on Device0. A peripheral that would not reset its dev_num would not be compliant with the MIPI specification. BugLink: thesofproject#3638 BugLink: thesofproject#3325 Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
|
@bardliao sdw_handle_slave_status() is called when there is an interrupt signalling a change of status, which is detected when the PING frame status. If the device didn't update its PING frame status there would be no interrupt and no status handling. That said, we could add a check that the device is indeed attach. In other words both for attach and detach events we would check the PING frame status to make sure the hardware detection didn't report a bad event. |
The interrupt could be triggered by max98373 sdw:2:019f:8373:00:3 (Slave 6), right?
I agree the unattached status is a false positive. But I am not sure if checking ping status at the same place will fix the issue or not. It seems to be some kind of timing issue to me.
|
This helper should help identify cases where devices fall off the bus and don't resync. BugLink: #3638 Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com> Reviewed-by: Bard Liao <yung-chuan.liao@linux.intel.com> Reviewed-by: Rander Wang <rander.wang@intel.com>
In configurations with two amplifiers on the same link, the Intel CI reports occasional enumeration/initialization timeouts during suspend-resume stress tests, where one of the two amplifiers becomes UNATTACHED immediately after being enumerated. This problem was reported both with Maxim and Realtek codecs, which pointed initially to a problem with status handling on the Intel side. The Cadence IP integrated on Intel platforms throws an interrupt when the status changes, and the information is kept with sticky bits until cleared. We initially added more checks to make sure the edge detection did not miss any transition, but that did not improve the results significantly. With the recent addition of the read_ping_status() callback, we were able to show that the status in sticky bits is shown as UNATTACHED even though the PING frames show the problematic device as ATTACHED. That completely breaks the entire logic where we assumed that a peripheral would always re-attach as device0. The resume timeouts make sense in that in those cases, the enumeration/initialization never happens a second time. One possible explanation is that this problem typically happens when a bus clash is reported, so it could very well be that the detection is fooled by a transient electrical issue or conflict between two peripherals. This patch conditionally double-checks the status reported in the sticky bits with the actual PING frame status. If the peripheral reports as attached in PING frames, the early detection based on sticky bits is discarded. Note that this patch only corrects issues of false positives on the manager side. If the peripheral lost and regain sync, then it would report as attached on Device0. A peripheral that would not reset its dev_num would not be compliant with the MIPI specification. BugLink: #3638 BugLink: #3325 Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
This helper should help identify cases where devices fall off the bus and don't resync. BugLink: thesofproject#3638 Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com> Reviewed-by: Rander Wang <rander.wang@intel.com> Signed-off-by: Bard Liao <yung-chuan.liao@linux.intel.com> Acked-By: Vinod Koul <vkoul@kernel.org> Link: https://lore.kernel.org/r/20220714011043.46059-5-yung-chuan.liao@linux.intel.com Signed-off-by: Mark Brown <broonie@kernel.org>
Merge series from Bard Liao <yung-chuan.liao@linux.intel.com>: we've been stuck with problems in the dual-amplifier configurations where one of the two devices seems to become UNATTACHED and never regains sync, see thesofproject#3638. This is a rather infrequent issue that may happen once or twice per month, but still it remains a concern. One possibility is that the device does lose sync but somehow our hardware detection fails to see it resync. This series just adds a basic read directly from the PING frames to help confirm if yes/no the device regain sync.
In configurations with two amplifiers on the same link, the Intel CI reports occasional enumeration/initialization timeouts during suspend-resume stress tests, where one of the two amplifiers becomes UNATTACHED immediately after being enumerated. This problem was reported both with Maxim and Realtek codecs, which pointed initially to a problem with status handling on the Intel side. The Cadence IP integrated on Intel platforms throws an interrupt when the status changes, and the information is kept with sticky bits until cleared. We initially added more checks to make sure the edge detection did not miss any transition, but that did not improve the results significantly. With the recent addition of the read_ping_status() callback, we were able to show that the status in sticky bits is shown as UNATTACHED even though the PING frames show the problematic device as ATTACHED. That completely breaks the entire logic where we assumed that a peripheral would always re-attach as device0. The resume timeouts make sense in that in those cases, the enumeration/initialization never happens a second time. One possible explanation is that this problem typically happens when a bus clash is reported, so it could very well be that the detection is fooled by a transient electrical issue or conflict between two peripherals. This patch conditionally double-checks the status reported in the sticky bits with the actual PING frame status. If the peripheral reports as attached in PING frames, the early detection based on sticky bits is discarded. Note that this patch only corrects issues of false positives on the manager side. If the peripheral lost and regain sync, then it would report as attached on Device0. A peripheral that would not reset its dev_num would not be compliant with the MIPI specification. BugLink: thesofproject#3638 BugLink: thesofproject#3325 Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com> Reviewed-by: Rander Wang <rander.wang@intel.com> Signed-off-by: Bard Liao <yung-chuan.liao@linux.intel.com>
Observed this issue again in CI stress test:http://sof-ci.sh.intel.com/#/result/planresultdetail/14917?model=ADLP_BRYA_SDW_ZEPHYR&testcase=check-suspend-resume-with-capture-100 |
In configurations with two amplifiers on the same link, the Intel CI reports occasional enumeration/initialization timeouts during suspend-resume stress tests, where one of the two amplifiers becomes UNATTACHED immediately after being enumerated. This problem was reported both with Maxim and Realtek codecs, which pointed initially to a problem with status handling on the Intel side. The Cadence IP integrated on Intel platforms throws an interrupt when the status changes, and the information is kept with sticky bits until cleared. We initially added more checks to make sure the edge detection did not miss any transition, but that did not improve the results significantly. With the recent addition of the read_ping_status() callback, we were able to show that the status in sticky bits is shown as UNATTACHED even though the PING frames show the problematic device as ATTACHED. That completely breaks the entire logic where we assumed that a peripheral would always re-attach as device0. The resume timeouts make sense in that in those cases, the enumeration/initialization never happens a second time. One possible explanation is that this problem typically happens when a bus clash is reported, so it could very well be that the detection is fooled by a transient electrical issue or conflict between two peripherals. This patch conditionally double-checks the status reported in the sticky bits with the actual PING frame status. If the peripheral reports as attached in PING frames, the early detection based on sticky bits is discarded. Note that this patch only corrects issues of false positives on the manager side. If the peripheral lost and regain sync, then it would report as attached on Device0. A peripheral that would not reset its dev_num would not be compliant with the MIPI specification. BugLink: #3638 BugLink: #3325 Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com> Reviewed-by: Rander Wang <rander.wang@intel.com> Reviewed-by: Bard Liao <yung-chuan.liao@linux.intel.com>
In configurations with two amplifiers on the same link, the Intel CI reports occasional enumeration/initialization timeouts during suspend-resume stress tests, where one of the two amplifiers becomes UNATTACHED immediately after being enumerated. This problem was reported both with Maxim and Realtek codecs, which pointed initially to a problem with status handling on the Intel side. The Cadence IP integrated on Intel platforms throws an interrupt when the status changes, and the information is kept with sticky bits until cleared. We initially added more checks to make sure the edge detection did not miss any transition, but that did not improve the results significantly. With the recent addition of the read_ping_status() callback, we were able to show that the status in sticky bits is shown as UNATTACHED even though the PING frames show the problematic device as ATTACHED. That completely breaks the entire logic where we assumed that a peripheral would always re-attach as device0. The resume timeouts make sense in that in those cases, the enumeration/initialization never happens a second time. One possible explanation is that this problem typically happens when a bus clash is reported, so it could very well be that the detection is fooled by a transient electrical issue or conflict between two peripherals. This patch conditionally double-checks the status reported in the sticky bits with the actual PING frame status. If the peripheral reports as attached in PING frames, the early detection based on sticky bits is discarded. Note that this patch only corrects issues of false positives on the manager side. If the peripheral lost and regain sync, then it would report as attached on Device0. A peripheral that would not reset its dev_num would not be compliant with the MIPI specification. BugLink: #3638 BugLink: #3325 Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com> Reviewed-by: Rander Wang <rander.wang@intel.com> Reviewed-by: Bard Liao <yung-chuan.liao@linux.intel.com>
I am going to close this bug since we have fixes from Cirrus Logic that should solve the issue with more than one device per link. Of course if we see this problem again we can reopen as needed. |
Describe the bug
max98373 codec driver initialization not completed when testing
check-suspend-resume-with-capture-5
.To Reproduce
~/sof-test/test-case/check-suspend-resume-with-audio.sh -l 5 -m capture
Reproduction Rate
Hard to tell now, CI observed this issue for the first time and no reproductions by manual check so far..
Expected behavior
sof can resume form S3
Impact
codec driver initialization failed.
Environment
Screenshots or console output
dmesg
dmesg.txt
logger.txt
The text was updated successfully, but these errors were encountered: