[BUG] [rt700] SDW Alert happens before codec is enumerated #2344

bardliao · 2020-08-05T09:44:00Z

Describe the bug
We can see sdw IO transfer timed out when we do suspend test. And the reason of sdw IO transfer timed out is that an alert is rised before the codec is enumerated.

To Reproduce

run sudo rtcwake -m mem -s 5.

Reproduce rate
more than 50%

Expected result
No issue on the suspend test.

Actual result
See IO transfer timed out errors in dmesg.

jf-cml-rvp-sdw-2 kernel: [ 7295.502075] rt700 sdw:1:25d:700:0: sdw_modify_slave_status: initializing completion for Slave 1
...
jf-cml-rvp-sdw-2 kernel: [ 7296.545696] intel-sdw intel-sdw.1: IO transfer timed out, cmd 2 device 1 addr 40 len 1
jf-cml-rvp-sdw-2 kernel: [ 7296.545702] soundwire sdw-master-0: trf on Slave 1 failed:-110
jf-cml-rvp-sdw-2 kernel: [ 7296.545706] soundwire sdw-master-0: SDW_SCP_INT1 read failed:-110
jf-cml-rvp-sdw-2 kernel: [ 7296.545708] soundwire sdw-master-0: Slave 1 alert handling failed: -110
jf-cml-rvp-sdw-2 kernel: [ 7296.545730] intel-sdw intel-sdw.1: Slave status change
jf-cml-rvp-sdw-2 kernel: [ 7296.545764] soundwire sdw-master-0: Slave attached, programming device number
...
jf-cml-rvp-sdw-2 kernel: [ 7296.546367] rt700 sdw:1:25d:700:0: sdw_modify_slave_status: signaling completion for Slave 1

dmesg_cml_rvp.txt

The text was updated successfully, but these errors were encountered:

bardliao · 2020-08-05T09:49:55Z

@plbossart Should we ignore the alert before codec is enumerated?
Does below change make sense?

diff --git a/drivers/soundwire/bus.c b/drivers/soundwire/bus.c
index d0aecf995c4f..219592cfee6f 100644
--- a/drivers/soundwire/bus.c
+++ b/drivers/soundwire/bus.c
@@ -1711,11 +1711,20 @@ int sdw_handle_slave_status(struct sdw_bus *bus,
                        break;

                case SDW_SLAVE_ALERT:
+                       if (!completion_done(&slave->slave->initialization_complete))
+                               break;
                        ret = sdw_handle_slave_alerts(slave);

plbossart · 2020-08-05T13:21:45Z

@bardliao it's not possible to have an alert if the DeviceNumber is zero and the interrupt masks are not initialized. I wonder if this is a codec problem where on hard reset the interrupt masks are not reset.

Edit: the master driver should first go through the sdw_clear_slave_status() to mark the status as UNATTACHED on resume. Then we would first deal with enumeration. I think we have a race condition here where the codec resumes before the link is reset and its status marked as UNACTTACHED.

bardliao · 2020-08-11T04:28:24Z

@plbossart I think I know the reason of IO transfer timed out now. We got an interrupt immediately after rt700 is enumerated. However, rt700 go into suspend right after that. See log below

[  820.942039] soundwire sdw-master-0: bard: sdw_clear_slave_status 1
[  820.942043] rt700 sdw:1:25d:700:0: sdw_modify_slave_status: initializing completion for Slave 1
[  820.942274] rt700 sdw:1:25d:700:0: bard: rt700_dev_resume
[  820.961605] intel-sdw intel-sdw.1: Slave status change
[  820.961622] soundwire sdw-master-0: Slave attached, programming device number
[  820.961924] soundwire sdw-master-0: SDW Slave Addr: 10025d070000
[  820.961929] soundwire sdw-master-0: SDW Slave class_id 0, part_id 700, mfg_id 25d, unique_id 0, version 1
[  820.961931] soundwire sdw-master-0: Slave already registered, reusing dev_num:1
[  820.962166] intel-sdw intel-sdw.1: Msg Ack not received
[  820.962167] intel-sdw intel-sdw.1: Msg Ack not received
[  820.962168] intel-sdw intel-sdw.1: Msg Ack not received
[  820.962170] intel-sdw intel-sdw.1: Msg Ack not received
[  820.962171] intel-sdw intel-sdw.1: Msg Ack not received
[  820.962172] intel-sdw intel-sdw.1: Msg Ack not received
[  820.962174] intel-sdw intel-sdw.1: Msg ignored for Slave 0
[  820.962176] soundwire sdw-master-0: No more devices to enumerate
[  820.962204] intel-sdw intel-sdw.1: Slave status change
[  820.962212] rt700 sdw:1:25d:700:0: sdw_modify_slave_status: signaling completion for Slave 1
[  820.962450] intel-sdw intel-sdw.1: Slave status change
[  820.962458] soundwire sdw-master-0: bard: Slave 1 status 1
[  820.962485] rt700 sdw:1:25d:700:0: bard: rt700_dev_resume cache sync start
[  820.967985] rt700 sdw:1:25d:700:0: bard: rt700_dev_resume cache sync end
[  820.967992] rt700 sdw:1:25d:700:0: bard: rt700_dev_suspend

Now see sdw_handle_slave_alerts(). We call pm_runtime_get_sync() while rt700 is just suspended.
I don't think rt700 is already resumed when pm_runtime_get_sync returned and then sdw_read(slave, SDW_SCP_INT1); is called while rt700 is suspended. So IO transfer timed out happens.

[  820.962450] intel-sdw intel-sdw.1: Slave status change
[  820.962458] soundwire sdw-master-0: bard: Slave 1 status 1
[  820.962485] rt700 sdw:1:25d:700:0: bard: rt700_dev_resume cache sync start
[  820.967985] rt700 sdw:1:25d:700:0: bard: rt700_dev_resume cache sync end
[  820.967992] rt700 sdw:1:25d:700:0: bard: rt700_dev_suspend
[  820.967998] rt700 sdw:1:25d:700:0: bard: sdw_handle_slave_alerts pm_runtime_get_sync ret 1
...
[  821.933210] intel-sdw intel-sdw.1: bard: intel_resume
[  821.933211] intel-sdw intel-sdw.1: intel_link_power_up: powering up all links
[  821.933212] intel-sdw intel-sdw.1: intel_link_power_up: first link up, programming SYNCPRD
[  821.933466] soundwire sdw-master-0: bard: sdw_clear_slave_status 1
[  821.933468] rt700 sdw:1:25d:700:0: sdw_modify_slave_status: initializing completion for Slave 1
[  821.933811] rt700 sdw:1:25d:700:0: bard: rt700_dev_resume
...
[  822.991662] intel-sdw intel-sdw.1: IO transfer timed out, cmd 2 device 1 addr 40 len 1
[  822.991669] soundwire sdw-master-0: trf on Slave 1 failed:-110
[  822.991673] rt700 sdw:1:25d:700:0: bard: sdw_handle_slave_alerts sdw_read ret -110
[  822.991675] soundwire sdw-master-0: SDW_SCP_INT1 read failed:-110
[  822.991678] soundwire sdw-master-0: Slave 1 alert handling failed: -110

dmesg_cml_rvp_2.txt

log_diff.txt

plbossart · 2020-08-11T16:18:35Z

Great work @bardliao

[  820.962485] rt700 sdw:1:25d:700:0: bard: rt700_dev_resume cache sync start
[  820.967985] rt700 sdw:1:25d:700:0: bard: rt700_dev_resume cache sync end
[  820.967992] rt700 sdw:1:25d:700:0: bard: rt700_dev_suspend

I think the last suspend is a system suspend.

it looks like we have a race condition with a system suspend happening immediately while we are still dealing with a device interrupt and we have a pending transaction. On the next system resume, there is a timeout and an error, but that doesn't seem to be a real problem. The error happens in a workqueue.

I think we should use cancel_work_sync() on the master level to make sure current transactions are completed, and no alert can be handled while the master suspends.

…ce alerts In system suspend stress cases, the SOF CI reports timeouts. The root cause is that an alert is generated while the system suspends, so on resume the previous transaction times out. This error doesn't seem too problematic since it happens in a work queue, and the system recovers without issues. Nevertheless, this should not happen. When doing a system suspend, or when disabling interrupts, we should make sure the current transaction can complete, and prevent new work from being queued. BugLink: thesofproject#2344 Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>

plbossart · 2020-08-11T16:25:47Z

@bardliao can you try #2354, not sure if this helps?

…ce alerts In system suspend stress cases, the SOF CI reports timeouts. The root cause is that an alert is generated while the system suspends, so on resume the previous transaction times out. This error doesn't seem too problematic since it happens in a work queue, and the system recovers without issues. Nevertheless, this should not happen. When doing a system suspend, or when disabling interrupts, we should make sure the current transaction can complete, and prevent new work from being queued. BugLink: thesofproject#2344 Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>

bardliao · 2020-08-12T03:17:46Z

@bardliao can you try #2354, not sure if this helps?

Thanks, @plbossart it works. But it should be if (cdns->interrupt_enable) below. We need to schedule the workqueue if interrupt is enabled.

/* if the interrupt disable is in process, don't schedule a workqueue */
		if (!cdns->interrupt_enable)
			schedule_work(&cdns->work);

…ce alerts In system suspend stress cases, the SOF CI reports timeouts. The root cause is that an alert is generated while the system suspends. The interrupt handling generates transactions on the bus that will never be handled because the interrupts are disabled in parallel. As a result, the transaction never completes and times out on resume. This error doesn't seem too problematic since it happens in a work queue, and the system recovers without issues. Nevertheless, this race condition should not happen. When doing a system suspend, or when disabling interrupts, we should make sure the current transaction can complete, and prevent new work from being queued. BugLink: thesofproject#2344 Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>

…ce alerts In system suspend stress cases, the SOF CI reports timeouts. The root cause is that an alert is generated while the system suspends. The interrupt handling generates transactions on the bus that will never be handled because the interrupts are disabled in parallel. As a result, the transaction never completes and times out on resume. This error doesn't seem too problematic since it happens in a work queue, and the system recovers without issues. Nevertheless, this race condition should not happen. When doing a system suspend, or when disabling interrupts, we should make sure the current transaction can complete, and prevent new work from being queued. BugLink: #2344 Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>

…ce alerts In system suspend stress cases, the SOF CI reports timeouts. The root cause is that an alert is generated while the system suspends. The interrupt handling generates transactions on the bus that will never be handled because the interrupts are disabled in parallel. As a result, the transaction never completes and times out on resume. This error doesn't seem too problematic since it happens in a work queue, and the system recovers without issues. Nevertheless, this race condition should not happen. When doing a system suspend, or when disabling interrupts, we should make sure the current transaction can complete, and prevent new work from being queued. BugLink: thesofproject#2344 Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com> Reviewed-by: Ranjani Sridharan <ranjani.sridharan@linux.intel.com> Reviewed-by: Rander Wang <rander.wang@linux.intel.com> Reviewed-by: Bard Liao <yung-chuan.liao@linux.intel.com>

…ce alerts In system suspend stress cases, the SOF CI reports timeouts. The root cause is that an alert is generated while the system suspends. The interrupt handling generates transactions on the bus that will never be handled because the interrupts are disabled in parallel. As a result, the transaction never completes and times out on resume. This error doesn't seem too problematic since it happens in a work queue, and the system recovers without issues. Nevertheless, this race condition should not happen. When doing a system suspend, or when disabling interrupts, we should make sure the current transaction can complete, and prevent new work from being queued. BugLink: thesofproject#2344 Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com> Reviewed-by: Ranjani Sridharan <ranjani.sridharan@linux.intel.com> Reviewed-by: Rander Wang <rander.wang@linux.intel.com> Signed-off-by: Bard Liao <yung-chuan.liao@linux.intel.com>

…ce alerts In system suspend stress cases, the SOF CI reports timeouts. The root cause is that an alert is generated while the system suspends. The interrupt handling generates transactions on the bus that will never be handled because the interrupts are disabled in parallel. As a result, the transaction never completes and times out on resume. This error doesn't seem too problematic since it happens in a work queue, and the system recovers without issues. Nevertheless, this race condition should not happen. When doing a system suspend, or when disabling interrupts, we should make sure the current transaction can complete, and prevent new work from being queued. BugLink: #2344 Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com> Reviewed-by: Ranjani Sridharan <ranjani.sridharan@linux.intel.com> Reviewed-by: Rander Wang <rander.wang@linux.intel.com> Reviewed-by: Bard Liao <yung-chuan.liao@linux.intel.com>

…ce alerts In system suspend stress cases, the SOF CI reports timeouts. The root cause is that an alert is generated while the system suspends. The interrupt handling generates transactions on the bus that will never be handled because the interrupts are disabled in parallel. As a result, the transaction never completes and times out on resume. This error doesn't seem too problematic since it happens in a work queue, and the system recovers without issues. Nevertheless, this race condition should not happen. When doing a system suspend, or when disabling interrupts, we should make sure the current transaction can complete, and prevent new work from being queued. BugLink: thesofproject#2344 Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com> Reviewed-by: Ranjani Sridharan <ranjani.sridharan@linux.intel.com> Reviewed-by: Rander Wang <rander.wang@linux.intel.com> Signed-off-by: Bard Liao <yung-chuan.liao@linux.intel.com>

…ce alerts In system suspend stress cases, the SOF CI reports timeouts. The root cause is that an alert is generated while the system suspends. The interrupt handling generates transactions on the bus that will never be handled because the interrupts are disabled in parallel. As a result, the transaction never completes and times out on resume. This error doesn't seem too problematic since it happens in a work queue, and the system recovers without issues. Nevertheless, this race condition should not happen. When doing a system suspend, or when disabling interrupts, we should make sure the current transaction can complete, and prevent new work from being queued. BugLink: #2344 Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com> Reviewed-by: Ranjani Sridharan <ranjani.sridharan@linux.intel.com> Reviewed-by: Rander Wang <rander.wang@linux.intel.com> Reviewed-by: Bard Liao <yung-chuan.liao@linux.intel.com>

…ce alerts In system suspend stress cases, the SOF CI reports timeouts. The root cause is that an alert is generated while the system suspends. The interrupt handling generates transactions on the bus that will never be handled because the interrupts are disabled in parallel. As a result, the transaction never completes and times out on resume. This error doesn't seem too problematic since it happens in a work queue, and the system recovers without issues. Nevertheless, this race condition should not happen. When doing a system suspend, or when disabling interrupts, we should make sure the current transaction can complete, and prevent new work from being queued. BugLink: thesofproject#2344 Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com> Reviewed-by: Ranjani Sridharan <ranjani.sridharan@linux.intel.com> Reviewed-by: Rander Wang <rander.wang@linux.intel.com> Signed-off-by: Bard Liao <yung-chuan.liao@linux.intel.com> Link: https://lore.kernel.org/alsa-devel/20200817222340.18042-1-yung-chuan.liao@linux.intel.com Signed-off-by: Jaroslav Kysela <jkysela@redhat.com>

…ce alerts In system suspend stress cases, the SOF CI reports timeouts. The root cause is that an alert is generated while the system suspends. The interrupt handling generates transactions on the bus that will never be handled because the interrupts are disabled in parallel. As a result, the transaction never completes and times out on resume. This error doesn't seem too problematic since it happens in a work queue, and the system recovers without issues. Nevertheless, this race condition should not happen. When doing a system suspend, or when disabling interrupts, we should make sure the current transaction can complete, and prevent new work from being queued. BugLink: thesofproject#2344 Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com> Reviewed-by: Ranjani Sridharan <ranjani.sridharan@linux.intel.com> Reviewed-by: Rander Wang <rander.wang@linux.intel.com> Signed-off-by: Bard Liao <yung-chuan.liao@linux.intel.com> Acked-by: Jaroslav Kysela <perex@perex.cz> Link: https://lore.kernel.org/r/20200817222340.18042-1-yung-chuan.liao@linux.intel.com Signed-off-by: Vinod Koul <vkoul@kernel.org>

…ce alerts [ Upstream commit d2068da ] In system suspend stress cases, the SOF CI reports timeouts. The root cause is that an alert is generated while the system suspends. The interrupt handling generates transactions on the bus that will never be handled because the interrupts are disabled in parallel. As a result, the transaction never completes and times out on resume. This error doesn't seem too problematic since it happens in a work queue, and the system recovers without issues. Nevertheless, this race condition should not happen. When doing a system suspend, or when disabling interrupts, we should make sure the current transaction can complete, and prevent new work from being queued. BugLink: thesofproject/linux#2344 Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com> Reviewed-by: Ranjani Sridharan <ranjani.sridharan@linux.intel.com> Reviewed-by: Rander Wang <rander.wang@linux.intel.com> Signed-off-by: Bard Liao <yung-chuan.liao@linux.intel.com> Acked-by: Jaroslav Kysela <perex@perex.cz> Link: https://lore.kernel.org/r/20200817222340.18042-1-yung-chuan.liao@linux.intel.com Signed-off-by: Vinod Koul <vkoul@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>

bardliao added the P2 Critical bugs or normal features label Aug 5, 2020

bardliao self-assigned this Aug 5, 2020

mengdonglin added bug Something isn't working CML Applies to Comet Lake platform labels Aug 6, 2020

plbossart mentioned this issue Aug 11, 2020

soundwire: cadence: fix race condition between suspend and Slave devi… #2354

Merged

plbossart added the SDW Applies to SoundWire bus for codec connection label Aug 11, 2020

aiChaoSONG mentioned this issue Aug 12, 2020

[CI][CML][SDW] IO transfer timed out in suspend/resume test #2355

Closed

bardliao closed this as completed Aug 16, 2020

marc-hb added the suspend resume Issues related to suspend resume (e.g. rtcwake) label Jul 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] [rt700] SDW Alert happens before codec is enumerated #2344

[BUG] [rt700] SDW Alert happens before codec is enumerated #2344

bardliao commented Aug 5, 2020

bardliao commented Aug 5, 2020

plbossart commented Aug 5, 2020 •

edited

Loading

bardliao commented Aug 11, 2020

plbossart commented Aug 11, 2020

plbossart commented Aug 11, 2020

bardliao commented Aug 12, 2020

[BUG] [rt700] SDW Alert happens before codec is enumerated #2344

[BUG] [rt700] SDW Alert happens before codec is enumerated #2344

Comments

bardliao commented Aug 5, 2020

bardliao commented Aug 5, 2020

plbossart commented Aug 5, 2020 • edited Loading

bardliao commented Aug 11, 2020

plbossart commented Aug 11, 2020

plbossart commented Aug 11, 2020

bardliao commented Aug 12, 2020

plbossart commented Aug 5, 2020 •

edited

Loading