[BUG][TGL] max98373 sdw:1:019f:8373:00:7 initialization not complete #3638

keqiaozhang · 2022-05-10T02:51:35Z

Describe the bug

max98373 codec driver initialization not completed when testing check-suspend-resume-with-capture-5.

2022-05-10 00:57:56 UTC [ERROR] Caught kernel log error
===========================>>
[ 8050.114557] kernel: max98373 sdw:1:019f:8373:00:7: Initialization not complete, timed out
[ 8050.114566] kernel: max98373 sdw:1:019f:8373:00:7: PM: dpm_run_callback(): acpi_subsys_resume+0x0/0x70 returns -110
[ 8050.114582] kernel: max98373 sdw:1:019f:8373:00:7: PM: failed to resume: error -110
<<===========================

To Reproduce
~/sof-test/test-case/check-suspend-resume-with-audio.sh -l 5 -m capture

Reproduction Rate
Hard to tell now, CI observed this issue for the first time and no reproductions by manual check so far..

Expected behavior
sof can resume form S3

Impact
codec driver initialization failed.

Environment

Branch name and commit hash of the 2 repositories: sof (firmware/topology) and linux (kernel driver).
- Kernel: sof-dev/e89036145299
- SOF: main/2fbbdb032c12
Name of the topology file
- Topology: sof-tgl-sdw-max98373-rt5682.tplg
Name of the platform(s) on which the bug is observed.
- Platform: TGL Chrome device Volteer

Screenshots or console output
dmesg

[ 8044.996720] kernel: max98373 sdw:1:019f:8373:00:3: sdw_modify_slave_status: signaling enumeration completion for Slave 2
[ 8045.007934] kernel: max98373 sdw:1:019f:8373:00:3: Reset completed (retry:0)
[ 8045.009226] kernel: max98373 sdw:1:019f:8373:00:3: sdw_handle_slave_status: signaling initialization completion for Slave 2
[ 8045.274836] kernel: rt5682 sdw:0:025d:5682:00: rt5682_io_init hw_init complete: 0
[ 8045.274844] kernel: rt5682 sdw:0:025d:5682:00: sdw_handle_slave_status: signaling initialization completion for Slave 1
[ 8045.274894] kernel: soundwire_intel soundwire_intel.link.0: Slave status change: 0x40
[ 8045.275291] kernel: soundwire_intel soundwire_intel.link.0: Slave status change: 0x20
[ 8045.467420] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ipc rx: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 8045.467456] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ipc rx done: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 8045.967381] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ipc rx: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 8045.967420] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ipc rx done: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 8047.467417] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ipc rx: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 8047.467458] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ipc rx done: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 8047.854408] kernel: ax88179_178a 4-1:1.0 enx000ec66a790d: ax88179 - Link status is: 1
[ 8048.467419] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ipc rx: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 8048.467465] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ipc rx done: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 8049.467163] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ipc rx: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 8049.467200] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ipc rx done: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 8050.114557] kernel: max98373 sdw:1:019f:8373:00:7: Initialization not complete, timed out
[ 8050.114566] kernel: max98373 sdw:1:019f:8373:00:7: PM: dpm_run_callback(): acpi_subsys_resume+0x0/0x70 returns -110
[ 8050.114582] kernel: max98373 sdw:1:019f:8373:00:7: PM: failed to resume: error -110
[ 8050.116705] kernel: OOM killer enabled.
[ 8050.116710] kernel: Restarting tasks ... done.
[ 8050.132377] kernel: PM: suspend exit

dmesg.txt
logger.txt

The text was updated successfully, but these errors were encountered:

plbossart · 2022-05-10T14:25:33Z

I think this is the same issue as #3063

[ 8044.971968] kernel: soundwire_intel soundwire_intel.link.0: intel_link_power_up: powering up all links
[ 8044.971974] kernel: soundwire_intel soundwire_intel.link.0: intel_link_power_up: first link up, programming SYNCPRD
...
[ 8044.973918] kernel: max98373 sdw:1:019f:8373:00:7: sdw_modify_slave_status: initializing enumeration and init completion for Slave 1
[ 8044.973921] kernel: max98373 sdw:1:019f:8373:00:3: sdw_modify_slave_status: initializing enumeration and init completion for Slave 2
[ 8044.974711] kernel: soundwire_intel soundwire_intel.link.1: Slave status change: 0x2
[ 8044.974715] kernel: soundwire sdw-master-1: Slave attached, programming device number
[ 8044.974902] kernel: soundwire sdw-master-1: SDW Slave Addr: 27019f837300
[ 8044.974903] kernel: soundwire sdw-master-1: SDW Slave class_id 0x00, mfg_id 0x019f, part_id 0x8373, unique_id 0x7, version 0x2
[ 8044.974904] kernel: soundwire sdw-master-1: Slave already registered, reusing dev_num:1
[ 8044.975145] kernel: soundwire_intel soundwire_intel.link.1: Msg ignored for Slave 0
[ 8044.975146] kernel: soundwire sdw-master-1: No more devices to enumerate
[ 8044.975173] kernel: soundwire_intel soundwire_intel.link.1: Slave status change: 0x21
[ 8044.975176] kernel: max98373 sdw:1:019f:8373:00:7: sdw_modify_slave_status: signaling enumeration completion for Slave 1
[ 8044.987010] kernel: max98373 sdw:1:019f:8373:00:7: Reset completed (retry:0)
[ 8044.988289] kernel: max98373 sdw:1:019f:8373:00:7: sdw_handle_slave_status: signaling initialization completion for Slave 1
[ 8044.996187] kernel: soundwire_intel soundwire_intel.link.1: Slave status change: 0x2
[ 8044.996198] kernel: max98373 sdw:1:019f:8373:00:7: Slave 1 state check1: UNATTACHED, status was 1 
<<< THIS IS NOT EXPECTED!
[ 8044.996201] kernel: max98373 sdw:1:019f:8373:00:7: sdw_modify_slave_status: initializing enumeration and init completion for Slave 1
[ 8044.996210] kernel: soundwire sdw-master-1: Slave attached, programming device number
[ 8044.996405] kernel: soundwire sdw-master-1: SDW Slave Addr: 23019f837300
[ 8044.996408] kernel: soundwire sdw-master-1: SDW Slave class_id 0x00, mfg_id 0x019f, part_id 0x8373, unique_id 0x3, version 0x2
[ 8044.996412] kernel: soundwire sdw-master-1: Slave already registered, reusing dev_num:2
[ 8044.996655] kernel: soundwire_intel soundwire_intel.link.1: Msg ignored for Slave 0
[ 8044.996657] kernel: soundwire sdw-master-1: No more devices to enumerate 
<<< END OF ENUMERATION
[ 8044.996713] kernel: soundwire_intel soundwire_intel.link.1: Slave status change: 0x201 
~~<<<< THIS MEANS THE STATUS CHANGES AFTER THE END OF ENUMERATION!~~
<<<< THIS MEANS DEVICE0 REMAINS UNATTACHED
[ 8044.996720] kernel: max98373 sdw:1:019f:8373:00:3: sdw_modify_slave_status: signaling enumeration completion for Slave 2
[ 8045.007934] kernel: max98373 sdw:1:019f:8373:00:3: Reset completed (retry:0)
[ 8045.009226] kernel: max98373 sdw:1:019f:8373:00:3: sdw_handle_slave_status: signaling initialization completion for Slave 2
....
[ 8050.114557] kernel: max98373 sdw:1:019f:8373:00:7: Initialization not complete, timed out
[ 8050.114566] kernel: max98373 sdw:1:019f:8373:00:7: PM: dpm_run_callback(): acpi_subsys_resume+0x0/0x70 returns -110
[ 8050.114582] kernel: max98373 sdw:1:019f:8373:00:7: PM: failed to resume: error -110

plbossart · 2022-06-08T17:36:09Z

Really unclear what causes the second device to remain unattached - but clearly there's already a hardware problem earlier since devices are not supposed to fall off the bus.

keqiaozhang · 2022-06-15T00:56:59Z

Observed this issue again on ADLP_BRYA_SDW.
https://sof-ci.sh.intel.com/#/result/planresultdetail/13336?model=ADLP_BRYA_SDW&testcase=check-suspend-resume-50

plbossart · 2022-06-15T01:33:24Z

[ 8217.255528] kernel: soundwire_bus:sdw_modify_slave_status: max98373 sdw:2:019f:8373:00:7: sdw_modify_slave_status: initializing enumeration and init completion for Slave 5
[ 8217.255563] kernel: soundwire_bus:sdw_modify_slave_status: max98373 sdw:2:019f:8373:00:3: sdw_modify_slave_status: initializing enumeration and init completion for Slave 6
[ 8217.255611] kernel: soundwire_cadence:cdns_update_slave_status_work: soundwire_intel soundwire_intel.link.0: Slave status change: 0x20000
[ 8217.256424] kernel: soundwire_cadence:cdns_update_slave_status_work: soundwire_intel soundwire_intel.link.2: Slave status change: 0x2
[ 8217.256431] kernel: soundwire_bus:sdw_handle_slave_status: soundwire sdw-master-1: Slave attached, programming device number
[ 8217.256659] kernel: soundwire_bus:sdw_extract_slave_id: soundwire sdw-master-1: SDW Slave Addr: 27019f837300
[ 8217.256661] kernel: soundwire_bus:sdw_extract_slave_id: soundwire sdw-master-1: SDW Slave class_id 0x00, mfg_id 0x019f, part_id 0x8373, unique_id 0x7, version 0x2
[ 8217.256665] kernel: soundwire_bus:sdw_assign_device_num: soundwire sdw-master-1: Slave already registered, reusing dev_num:5
[ 8217.256926] kernel: soundwire_cadence:cdns_fill_msg_resp: soundwire_intel soundwire_intel.link.2: Msg ignored for Slave 0
[ 8217.256928] kernel: soundwire_bus:sdw_program_device_num: soundwire sdw-master-1: No more devices to enumerate
[ 8217.256978] kernel: soundwire_bus:sdw_modify_slave_status: max98373 sdw:2:019f:8373:00:7: sdw_modify_slave_status: signaling enumeration completion for Slave 5
[ 8217.257047] kernel: max98373 sdw:2:019f:8373:00:7: Bus clash detected before INT mask is enabled
<< BUS CLASH?
[ 8217.268855] kernel: max98373 sdw:2:019f:8373:00:7: Reset completed (retry:0)
[ 8217.270631] kernel: soundwire_bus:sdw_handle_slave_status: max98373 sdw:2:019f:8373:00:7: sdw_handle_slave_status: signaling initialization completion for Slave 5
[ 8217.278179] kernel: max98373 sdw:2:019f:8373:00:7: Slave 5 state check1: UNATTACHED, status was 1
<<< LOST SYNC
[ 8217.278187] kernel: soundwire_bus:sdw_modify_slave_status: max98373 sdw:2:019f:8373:00:7: sdw_modify_slave_status: initializing enumeration and init completion for Slave 5
<<< Slave 5 never comes back
[ 8217.278203] kernel: soundwire_bus:sdw_handle_slave_status: soundwire sdw-master-1: Slave attached, programming device number
[ 8217.278516] kernel: soundwire_bus:sdw_extract_slave_id: soundwire sdw-master-1: SDW Slave Addr: 23019f837300
[ 8217.278522] kernel: soundwire_bus:sdw_extract_slave_id: soundwire sdw-master-1: SDW Slave class_id 0x00, mfg_id 0x019f, part_id 0x8373, unique_id 0x3, version 0x2
[ 8217.278528] kernel: soundwire_bus:sdw_assign_device_num: soundwire sdw-master-1: Slave already registered, reusing dev_num:6
[ 8217.278932] kernel: soundwire_cadence:cdns_fill_msg_resp: soundwire_intel soundwire_intel.link.2: Msg ignored for Slave 0
[ 8217.278939] kernel: soundwire_bus:sdw_program_device_num: soundwire sdw-master-1: No more devices to enumerate
[ 8217.279064] kernel: soundwire_bus:sdw_modify_slave_status: max98373 sdw:2:019f:8373:00:3: sdw_modify_slave_status: signaling enumeration completion for Slave 6
[ 8217.279190] kernel: max98373 sdw:2:019f:8373:00:3: Bus clash detected before INT mask is enabled
[ 8217.291380] kernel: max98373 sdw:2:019f:8373:00:3: Reset completed (retry:0)

<< BUS CLASH AGAIN?

[ 8217.294048] kernel: soundwire_bus:sdw_handle_slave_status: max98373 sdw:2:019f:8373:00:3: sdw_handle_slave_status: signaling initialization completion for Slave 6

<< Slave 6 complete init

[ 8217.306311] kernel: snd_soc_rt5682:rt5682_headset_detect: rt5682 sdw:0:025d:5682:00: jack_type = 0
[ 8217.487426] kernel: asix 3-9.1:1.0 enx000ec67071e8: Link is Down
[ 8219.550794] kernel: asix 3-9.1:1.0 enx000ec67071e8: Link is Up - 100Mbps/Full - flow control off
[ 8222.615538] kernel: max98373 sdw:2:019f:8373:00:7: Initialization not complete, timed out
[ 8222.615551] kernel: max98373 sdw:2:019f:8373:00:7: PM: dpm_run_callback(): acpi_subsys_resume+0x0/0x70 returns -110
[ 8222.615589] kernel: max98373 sdw:2:019f:8373:00:7: PM: failed to resume: error -110

plbossart · 2022-06-15T01:36:37Z

The pattern seems to be that the device '7' (Slave 5) falls off the bus first, and then never shows up again.

This device '7' is not a hardware issue, it's likely due to the SoundWire enumeration arbitration. When there are identical devices, the one with the largest unique_id wins and will be handled first.

Why it loses sync is really unclear. This should not happen!

plbossart · 2022-06-21T14:47:54Z

Same error in https://sof-ci.01.org/linuxpr/PR3708/build250/devicetest/?model=TGLU_RVP_SDW&testcase=check-suspend-resume-5, but this time with a different codec.

[ 1352.736159] kernel: soundwire_bus:sdw_modify_slave_status: rt1308 sdw:1:025d:1308:00:2: initializing enumeration and init completion for Slave 5
[ 1352.736165] kernel: soundwire_bus:sdw_modify_slave_status: rt1308 sdw:1:025d:1308:00:0: initializing enumeration and init completion for Slave 6
[ 1352.737049] kernel: soundwire_cadence:cdns_update_slave_status_work: soundwire_intel soundwire_intel.link.1: Slave status change: 0x2
[ 1352.737054] kernel: soundwire_bus:sdw_handle_slave_status: soundwire sdw-master-1: Slave attached, programming device number
[ 1352.737239] kernel: soundwire_bus:sdw_extract_slave_id: soundwire sdw-master-1: SDW Slave Addr: 20025d130800
[ 1352.737240] kernel: soundwire_bus:sdw_extract_slave_id: soundwire sdw-master-1: SDW Slave class_id 0x00, mfg_id 0x025d, part_id 0x1308, unique_id 0x0, version 0x2
[ 1352.737242] kernel: soundwire_bus:sdw_assign_device_num: soundwire sdw-master-1: Slave already registered, reusing dev_num:6
[ 1352.737553] kernel: soundwire_cadence:cdns_fill_msg_resp: soundwire_intel soundwire_intel.link.1: Msg ignored for Slave 0
[ 1352.737554] kernel: soundwire_bus:sdw_program_device_num: soundwire sdw-master-1: No more devices to enumerate
[ 1352.737589] kernel: soundwire_bus:sdw_modify_slave_status: rt1308 sdw:1:025d:1308:00:0: signaling enumeration completion for Slave 6
[ 1352.745819] kernel: snd_soc_rt1308_sdw:rt1308_io_init: rt1308 sdw:1:025d:1308:00:0: rt1308_io_init hw_init complete
[ 1352.745820] kernel: soundwire_bus:sdw_handle_slave_status: rt1308 sdw:1:025d:1308:00:0: signaling initialization completion for Slave 6
[ 1352.760641] kernel: rt1308 sdw:1:025d:1308:00:0: Slave 6 state check1: UNATTACHED, status was 1
[ 1352.760648] kernel: soundwire_bus:sdw_modify_slave_status: rt1308 sdw:1:025d:1308:00:0: initializing enumeration and init completion for Slave 6
[ 1352.760658] kernel: soundwire_bus:sdw_handle_slave_status: soundwire sdw-master-1: Slave attached, programming device number
[ 1352.760891] kernel: soundwire_bus:sdw_extract_slave_id: soundwire sdw-master-1: SDW Slave Addr: 22025d130800
[ 1352.760894] kernel: soundwire_bus:sdw_extract_slave_id: soundwire sdw-master-1: SDW Slave class_id 0x00, mfg_id 0x025d, part_id 0x1308, unique_id 0x2, version 0x2
[ 1352.760899] kernel: soundwire_bus:sdw_assign_device_num: soundwire sdw-master-1: Slave already registered, reusing dev_num:5
[ 1352.761227] kernel: soundwire_cadence:cdns_fill_msg_resp: soundwire_intel soundwire_intel.link.1: Msg ignored for Slave 0
[ 1352.761230] kernel: soundwire_bus:sdw_program_device_num: soundwire sdw-master-1: No more devices to enumerate
[ 1352.761330] kernel: soundwire_bus:sdw_modify_slave_status: rt1308 sdw:1:025d:1308:00:2: signaling enumeration completion for Slave 5
[ 1352.770175] kernel: snd_soc_rt1308_sdw:rt1308_io_init: rt1308 sdw:1:025d:1308:00:2: rt1308_io_init hw_init complete
[ 1352.770176] kernel: soundwire_bus:sdw_handle_slave_status: rt1308 sdw:1:025d:1308:00:2: signaling initialization completion for Slave 5
[ 1352.974900] kernel: soundwire_bus:sdw_handle_slave_alerts: rt711 sdw:0:025d:0711:00: Slave impl defined interrupt
[ 1352.974906] kernel: snd_soc_rt711:rt711_interrupt_callback: rt711 sdw:0:025d:0711:00: rt711_interrupt_callback control_port_stat=4
[ 1353.233082] kernel: snd_soc_rt711:rt711_sdw_read: rt711 sdw:0:025d:0711:00: [rt711_sdw_read] b921 0000 => 80000000
[ 1353.234310] kernel: snd_soc_rt711:rt711_sdw_read: rt711 sdw:0:025d:0711:00: [rt711_sdw_read] 7520 85a0 9c20 aca0 => 00000800
[ 1353.235346] kernel: snd_soc_rt711:rt711_sdw_read: rt711 sdw:0:025d:0711:00: [rt711_sdw_read] 7520 85a0 9c20 aca0 => 00000030
[ 1353.235350] kernel: snd_soc_rt711:rt711_jack_detect_handler: rt711 sdw:0:025d:0711:00: in rt711_jack_detect_handler, jack_type=0x3
[ 1353.235353] kernel: snd_soc_rt711:rt711_jack_detect_handler: rt711 sdw:0:025d:0711:00: in rt711_jack_detect_handler, btn_type=0x0
[ 1354.790460] kernel: asix 3-3.1:1.0 enx000ec66547b6: Link is Up - 100Mbps/Full - flow control off
[ 1358.112232] kernel: rt1308 sdw:1:025d:1308:00:0: Initialization not complete, timed out
[ 1358.112237] kernel: rt1308 sdw:1:025d:1308:00:0: PM: dpm_run_callback(): acpi_subsys_resume+0x0/0x70 returns -110
[ 1358.112249] kernel: rt1308 sdw:1:025d:1308:00:0: PM: failed to resume: error -110

This helper should help identify cases where devices fall off the bus and don't resync. BugLink: thesofproject#3638 Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>

This helper should help identify cases where devices fall off the bus and don't resync. BugLink: #3638 Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>

This helper should help identify cases where devices fall off the bus and don't resync. BugLink: #3638 Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com> Reviewed-by: Bard Liao <yung-chuan.liao@linux.intel.com> Reviewed-by: Rander Wang <rander.wang@intel.com>

This helper should help identify cases where devices fall off the bus and don't resync. BugLink: thesofproject#3638 Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com> Reviewed-by: Rander Wang <rander.wang@intel.com> Signed-off-by: Bard Liao <yung-chuan.liao@linux.intel.com>

This helper should help identify cases where devices fall off the bus and don't resync. BugLink: #3638 Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com> Reviewed-by: Bard Liao <yung-chuan.liao@linux.intel.com> Reviewed-by: Rander Wang <rander.wang@intel.com>

keqiaozhang · 2022-07-26T00:51:31Z

Observed this issue again:http://sof-ci.sh.intel.com/#/result/planresultdetail/14218?model=ADLP_BRYA_SDW&testcase=check-suspend-resume-50

plbossart · 2022-07-26T15:02:05Z

Ah, that's super interesting. Now that we added the PING status, we can check what the real status is.

[ 7660.952588] kernel: max98373 sdw:2:019f:8373:00:7: Initialization not complete, timed out
[ 7660.967828] kernel: soundwire_bus:sdw_show_ping_status: soundwire sdw-master-1: PING status: 0x1100

This means bits 8 and 12 are set. That means Slave4 and Slave6 are attached.

This suggests that the device reports as UNATTACHED but does not transition to the soft reset stage, and keeps reporting as ATTACHED with the original device number. In other words, either

the detection of the UNATTACHED device is incorrect.
the implementation of the device is incorrect.

That is of course not a logic that we thought was even possible. What we may need to do is that if a device reports as UNATTACHED we double-check the ping status and ignore that status change.

@bardliao what do you think?

In configurations with two amplifiers on the same link, the Intel CI reports occasional enumeration/initialization timeouts during suspend-resume stress tests, where one of the two amplifiers becomes UNATTACHED immediately after being enumerated. This problem was reported both with Maxim and Realtek codecs, which pointed initially to a problem with status handling on the Intel side. The Cadence IP integrated on Intel platforms throws an interrupt when the status changes, and the information is kept with sticky bits until cleared. We initially added more checks to make sure the edge detection did not miss any transition, but that did not improve the results significantly. With the recent addition of the read_ping_status() callback, we were able to show that the status in sticky bits is shown as UNATTACHED even though the PING frames show the problematic device as ATTACHED. That completely breaks the entire logic where we assumed that a peripheral would always re-attach as device0. The resume timeouts make sense in that in those cases, the enumeration/initialization never happens a second time. One possible explanation is that this problem typically happens when a bus clash is reported, so it could very well be that the detection is fooled by a transient electrical issue or conflict between two peripherals. This patch conditionally double-checks the status reported in the sticky bits with the actual PING frame status. If the peripheral reports as attached in PING frames, the early detection based on sticky bits is discarded. Note that this patch only corrects issues of false positives on the manager side. If the peripheral lost and regain sync, then it would report as attached on Device0. A peripheral that would not reset its dev_num would not be compliant with the MIPI specification. BugLink: thesofproject#3638 BugLink: thesofproject#3325 Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>

bardliao · 2022-07-27T08:13:20Z

@plbossart

[ 7655.624671] kernel: soundwire_bus:sdw_handle_slave_status: soundwire sdw-master-1: Slave attached, programming device number
[ 7655.624863] kernel: soundwire_bus:sdw_extract_slave_id: soundwire sdw-master-1: SDW Slave Addr: 27019f837300
[ 7655.624864] kernel: soundwire_bus:sdw_extract_slave_id: soundwire sdw-master-1: SDW Slave class_id 0x00, mfg_id 0x019f, part_id 0x8373, unique_id 0x7, version 0x2
[ 7655.624866] kernel: soundwire_bus:sdw_assign_device_num: soundwire sdw-master-1: Slave already registered, reusing dev_num:4
[ 7655.625113] kernel: soundwire_cadence:cdns_fill_msg_resp: soundwire_intel soundwire_intel.link.2: Msg ignored for Slave 0
[ 7655.625114] kernel: soundwire_bus:sdw_program_device_num: soundwire sdw-master-1: No more devices to enumerate
[ 7655.625165] kernel: soundwire_bus:sdw_modify_slave_status: max98373 sdw:2:019f:8373:00:7: signaling enumeration completion for Slave 4
[ 7655.625268] kernel: max98373 sdw:2:019f:8373:00:7: Bus clash detected before INT mask is enabled
[ 7655.637076] kernel: max98373 sdw:2:019f:8373:00:7: Reset completed (retry:0)
[ 7655.639310] kernel: soundwire_bus:sdw_handle_slave_status: max98373 sdw:2:019f:8373:00:7: signaling initialization completion for Slave 4
[ 7655.646506] kernel: max98373 sdw:2:019f:8373:00:7: Slave 4 state check1: UNATTACHED, status was 1

sdw_handle_slave_status checks status[i] right after programming device number. Is it possible that HW need some time to update the slave status from unattached to attached? That's why the status[] we read doesn't match the slave->status. I am wondering what is the ping status at that moment. Can we read the status after programing device number to make sure the slave is really attached?

plbossart · 2022-07-27T13:44:40Z

@bardliao sdw_handle_slave_status() is called when there is an interrupt signalling a change of status, which is detected when the PING frame status. If the device didn't update its PING frame status there would be no interrupt and no status handling.

That said, we could add a check that the device is indeed attach. In other words both for attach and detach events we would check the PING frame status to make sure the hardware detection didn't report a bad event.

bardliao · 2022-07-28T07:38:08Z

@bardliao sdw_handle_slave_status() is called when there is an interrupt signalling a change of status, which is detected when the PING frame status. If the device didn't update its PING frame status there would be no interrupt and no status handling.

The interrupt could be triggered by max98373 sdw:2:019f:8373:00:3 (Slave 6), right?
The scenario could be:

max98373 sdw:2:019f:8373:00:7 (Slave 4) and sdw:2:019f:8373:00:3 (Slave 6) are attached at the same time
sdw_handle_slave_status is called for Slave 4 first, program the device number and set Slave->status etc.
sdw_handle_slave_status is called for Slave 6 and check status for all devices and find Slave 4's status is still unattached.

I agree the unattached status is a false positive. But I am not sure if checking ping status at the same place will fix the issue or not. It seems to be some kind of timing issue to me.

That said, we could add a check that the device is indeed attach. In other words both for attach and detach events we would check the PING frame status to make sure the hardware detection didn't report a bad event.

This helper should help identify cases where devices fall off the bus and don't resync. BugLink: #3638 Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com> Reviewed-by: Bard Liao <yung-chuan.liao@linux.intel.com> Reviewed-by: Rander Wang <rander.wang@intel.com>

In configurations with two amplifiers on the same link, the Intel CI reports occasional enumeration/initialization timeouts during suspend-resume stress tests, where one of the two amplifiers becomes UNATTACHED immediately after being enumerated. This problem was reported both with Maxim and Realtek codecs, which pointed initially to a problem with status handling on the Intel side. The Cadence IP integrated on Intel platforms throws an interrupt when the status changes, and the information is kept with sticky bits until cleared. We initially added more checks to make sure the edge detection did not miss any transition, but that did not improve the results significantly. With the recent addition of the read_ping_status() callback, we were able to show that the status in sticky bits is shown as UNATTACHED even though the PING frames show the problematic device as ATTACHED. That completely breaks the entire logic where we assumed that a peripheral would always re-attach as device0. The resume timeouts make sense in that in those cases, the enumeration/initialization never happens a second time. One possible explanation is that this problem typically happens when a bus clash is reported, so it could very well be that the detection is fooled by a transient electrical issue or conflict between two peripherals. This patch conditionally double-checks the status reported in the sticky bits with the actual PING frame status. If the peripheral reports as attached in PING frames, the early detection based on sticky bits is discarded. Note that this patch only corrects issues of false positives on the manager side. If the peripheral lost and regain sync, then it would report as attached on Device0. A peripheral that would not reset its dev_num would not be compliant with the MIPI specification. BugLink: #3638 BugLink: #3325 Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>

This helper should help identify cases where devices fall off the bus and don't resync. BugLink: thesofproject#3638 Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com> Reviewed-by: Rander Wang <rander.wang@intel.com> Signed-off-by: Bard Liao <yung-chuan.liao@linux.intel.com> Acked-By: Vinod Koul <vkoul@kernel.org> Link: https://lore.kernel.org/r/20220714011043.46059-5-yung-chuan.liao@linux.intel.com Signed-off-by: Mark Brown <broonie@kernel.org>

Merge series from Bard Liao <yung-chuan.liao@linux.intel.com>: we've been stuck with problems in the dual-amplifier configurations where one of the two devices seems to become UNATTACHED and never regains sync, see thesofproject#3638. This is a rather infrequent issue that may happen once or twice per month, but still it remains a concern. One possibility is that the device does lose sync but somehow our hardware detection fails to see it resync. This series just adds a basic read directly from the PING frames to help confirm if yes/no the device regain sync.

In configurations with two amplifiers on the same link, the Intel CI reports occasional enumeration/initialization timeouts during suspend-resume stress tests, where one of the two amplifiers becomes UNATTACHED immediately after being enumerated. This problem was reported both with Maxim and Realtek codecs, which pointed initially to a problem with status handling on the Intel side. The Cadence IP integrated on Intel platforms throws an interrupt when the status changes, and the information is kept with sticky bits until cleared. We initially added more checks to make sure the edge detection did not miss any transition, but that did not improve the results significantly. With the recent addition of the read_ping_status() callback, we were able to show that the status in sticky bits is shown as UNATTACHED even though the PING frames show the problematic device as ATTACHED. That completely breaks the entire logic where we assumed that a peripheral would always re-attach as device0. The resume timeouts make sense in that in those cases, the enumeration/initialization never happens a second time. One possible explanation is that this problem typically happens when a bus clash is reported, so it could very well be that the detection is fooled by a transient electrical issue or conflict between two peripherals. This patch conditionally double-checks the status reported in the sticky bits with the actual PING frame status. If the peripheral reports as attached in PING frames, the early detection based on sticky bits is discarded. Note that this patch only corrects issues of false positives on the manager side. If the peripheral lost and regain sync, then it would report as attached on Device0. A peripheral that would not reset its dev_num would not be compliant with the MIPI specification. BugLink: thesofproject#3638 BugLink: thesofproject#3325 Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com> Reviewed-by: Rander Wang <rander.wang@intel.com> Signed-off-by: Bard Liao <yung-chuan.liao@linux.intel.com>

keqiaozhang · 2022-08-29T01:12:36Z

Observed this issue again in CI stress test:http://sof-ci.sh.intel.com/#/result/planresultdetail/14917?model=ADLP_BRYA_SDW_ZEPHYR&testcase=check-suspend-resume-with-capture-100

In configurations with two amplifiers on the same link, the Intel CI reports occasional enumeration/initialization timeouts during suspend-resume stress tests, where one of the two amplifiers becomes UNATTACHED immediately after being enumerated. This problem was reported both with Maxim and Realtek codecs, which pointed initially to a problem with status handling on the Intel side. The Cadence IP integrated on Intel platforms throws an interrupt when the status changes, and the information is kept with sticky bits until cleared. We initially added more checks to make sure the edge detection did not miss any transition, but that did not improve the results significantly. With the recent addition of the read_ping_status() callback, we were able to show that the status in sticky bits is shown as UNATTACHED even though the PING frames show the problematic device as ATTACHED. That completely breaks the entire logic where we assumed that a peripheral would always re-attach as device0. The resume timeouts make sense in that in those cases, the enumeration/initialization never happens a second time. One possible explanation is that this problem typically happens when a bus clash is reported, so it could very well be that the detection is fooled by a transient electrical issue or conflict between two peripherals. This patch conditionally double-checks the status reported in the sticky bits with the actual PING frame status. If the peripheral reports as attached in PING frames, the early detection based on sticky bits is discarded. Note that this patch only corrects issues of false positives on the manager side. If the peripheral lost and regain sync, then it would report as attached on Device0. A peripheral that would not reset its dev_num would not be compliant with the MIPI specification. BugLink: #3638 BugLink: #3325 Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com> Reviewed-by: Rander Wang <rander.wang@intel.com> Reviewed-by: Bard Liao <yung-chuan.liao@linux.intel.com>

plbossart · 2022-09-28T13:26:19Z

I am going to close this bug since we have fixes from Cirrus Logic that should solve the issue with more than one device per link. Of course if we see this problem again we can reopen as needed.

keqiaozhang added bug Something isn't working codec Codec HW or driver restriction TGL Applies to Tiger Lake platform SDW Applies to SoundWire bus for codec connection labels May 10, 2022

keqiaozhang added the P2 Critical bugs or normal features label May 23, 2022

plbossart mentioned this issue Jun 27, 2022

SoundWire: log actual PING status on resume issues #3724

Merged

plbossart closed this as completed Sep 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG][TGL] max98373 sdw:1:019f:8373:00:7 initialization not complete #3638

[BUG][TGL] max98373 sdw:1:019f:8373:00:7 initialization not complete #3638

keqiaozhang commented May 10, 2022

plbossart commented May 10, 2022 •

edited

Loading

plbossart commented Jun 8, 2022 •

edited

Loading

keqiaozhang commented Jun 15, 2022

plbossart commented Jun 15, 2022

plbossart commented Jun 15, 2022

plbossart commented Jun 21, 2022 •

edited

Loading

keqiaozhang commented Jul 26, 2022

plbossart commented Jul 26, 2022

bardliao commented Jul 27, 2022

plbossart commented Jul 27, 2022

bardliao commented Jul 28, 2022

keqiaozhang commented Aug 29, 2022

plbossart commented Sep 28, 2022

[BUG][TGL] max98373 sdw:1:019f:8373:00:7 initialization not complete #3638

[BUG][TGL] max98373 sdw:1:019f:8373:00:7 initialization not complete #3638

Comments

keqiaozhang commented May 10, 2022

plbossart commented May 10, 2022 • edited Loading

plbossart commented Jun 8, 2022 • edited Loading

keqiaozhang commented Jun 15, 2022

plbossart commented Jun 15, 2022

plbossart commented Jun 15, 2022

plbossart commented Jun 21, 2022 • edited Loading

keqiaozhang commented Jul 26, 2022

plbossart commented Jul 26, 2022

bardliao commented Jul 27, 2022

plbossart commented Jul 27, 2022

bardliao commented Jul 28, 2022

keqiaozhang commented Aug 29, 2022

plbossart commented Sep 28, 2022

plbossart commented May 10, 2022 •

edited

Loading

plbossart commented Jun 8, 2022 •

edited

Loading

plbossart commented Jun 21, 2022 •

edited

Loading