Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG][TGL] max98373 sdw:1:019f:8373:00:7 initialization not complete #3638

Closed
keqiaozhang opened this issue May 10, 2022 · 13 comments
Closed
Labels
bug Something isn't working codec Codec HW or driver restriction P2 Critical bugs or normal features SDW Applies to SoundWire bus for codec connection TGL Applies to Tiger Lake platform

Comments

@keqiaozhang
Copy link
Collaborator

Describe the bug

max98373 codec driver initialization not completed when testing check-suspend-resume-with-capture-5.

2022-05-10 00:57:56 UTC [ERROR] Caught kernel log error
===========================>>
[ 8050.114557] kernel: max98373 sdw:1:019f:8373:00:7: Initialization not complete, timed out
[ 8050.114566] kernel: max98373 sdw:1:019f:8373:00:7: PM: dpm_run_callback(): acpi_subsys_resume+0x0/0x70 returns -110
[ 8050.114582] kernel: max98373 sdw:1:019f:8373:00:7: PM: failed to resume: error -110
<<===========================

To Reproduce
~/sof-test/test-case/check-suspend-resume-with-audio.sh -l 5 -m capture

Reproduction Rate
Hard to tell now, CI observed this issue for the first time and no reproductions by manual check so far..

Expected behavior
sof can resume form S3

Impact
codec driver initialization failed.

Environment

  1. Branch name and commit hash of the 2 repositories: sof (firmware/topology) and linux (kernel driver).
    • Kernel: sof-dev/e89036145299
    • SOF: main/2fbbdb032c12
  2. Name of the topology file
    • Topology: sof-tgl-sdw-max98373-rt5682.tplg
  3. Name of the platform(s) on which the bug is observed.
    • Platform: TGL Chrome device Volteer

Screenshots or console output
dmesg

[ 8044.996720] kernel: max98373 sdw:1:019f:8373:00:3: sdw_modify_slave_status: signaling enumeration completion for Slave 2
[ 8045.007934] kernel: max98373 sdw:1:019f:8373:00:3: Reset completed (retry:0)
[ 8045.009226] kernel: max98373 sdw:1:019f:8373:00:3: sdw_handle_slave_status: signaling initialization completion for Slave 2
[ 8045.274836] kernel: rt5682 sdw:0:025d:5682:00: rt5682_io_init hw_init complete: 0
[ 8045.274844] kernel: rt5682 sdw:0:025d:5682:00: sdw_handle_slave_status: signaling initialization completion for Slave 1
[ 8045.274894] kernel: soundwire_intel soundwire_intel.link.0: Slave status change: 0x40
[ 8045.275291] kernel: soundwire_intel soundwire_intel.link.0: Slave status change: 0x20
[ 8045.467420] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ipc rx: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 8045.467456] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ipc rx done: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 8045.967381] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ipc rx: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 8045.967420] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ipc rx done: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 8047.467417] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ipc rx: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 8047.467458] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ipc rx done: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 8047.854408] kernel: ax88179_178a 4-1:1.0 enx000ec66a790d: ax88179 - Link status is: 1
[ 8048.467419] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ipc rx: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 8048.467465] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ipc rx done: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 8049.467163] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ipc rx: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 8049.467200] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ipc rx done: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 8050.114557] kernel: max98373 sdw:1:019f:8373:00:7: Initialization not complete, timed out
[ 8050.114566] kernel: max98373 sdw:1:019f:8373:00:7: PM: dpm_run_callback(): acpi_subsys_resume+0x0/0x70 returns -110
[ 8050.114582] kernel: max98373 sdw:1:019f:8373:00:7: PM: failed to resume: error -110
[ 8050.116705] kernel: OOM killer enabled.
[ 8050.116710] kernel: Restarting tasks ... done.
[ 8050.132377] kernel: PM: suspend exit

dmesg.txt
logger.txt

@keqiaozhang keqiaozhang added bug Something isn't working codec Codec HW or driver restriction TGL Applies to Tiger Lake platform SDW Applies to SoundWire bus for codec connection labels May 10, 2022
@plbossart
Copy link
Member

plbossart commented May 10, 2022

I think this is the same issue as #3063

[ 8044.971968] kernel: soundwire_intel soundwire_intel.link.0: intel_link_power_up: powering up all links
[ 8044.971974] kernel: soundwire_intel soundwire_intel.link.0: intel_link_power_up: first link up, programming SYNCPRD
...
[ 8044.973918] kernel: max98373 sdw:1:019f:8373:00:7: sdw_modify_slave_status: initializing enumeration and init completion for Slave 1
[ 8044.973921] kernel: max98373 sdw:1:019f:8373:00:3: sdw_modify_slave_status: initializing enumeration and init completion for Slave 2
[ 8044.974711] kernel: soundwire_intel soundwire_intel.link.1: Slave status change: 0x2
[ 8044.974715] kernel: soundwire sdw-master-1: Slave attached, programming device number
[ 8044.974902] kernel: soundwire sdw-master-1: SDW Slave Addr: 27019f837300
[ 8044.974903] kernel: soundwire sdw-master-1: SDW Slave class_id 0x00, mfg_id 0x019f, part_id 0x8373, unique_id 0x7, version 0x2
[ 8044.974904] kernel: soundwire sdw-master-1: Slave already registered, reusing dev_num:1
[ 8044.975145] kernel: soundwire_intel soundwire_intel.link.1: Msg ignored for Slave 0
[ 8044.975146] kernel: soundwire sdw-master-1: No more devices to enumerate
[ 8044.975173] kernel: soundwire_intel soundwire_intel.link.1: Slave status change: 0x21
[ 8044.975176] kernel: max98373 sdw:1:019f:8373:00:7: sdw_modify_slave_status: signaling enumeration completion for Slave 1
[ 8044.987010] kernel: max98373 sdw:1:019f:8373:00:7: Reset completed (retry:0)
[ 8044.988289] kernel: max98373 sdw:1:019f:8373:00:7: sdw_handle_slave_status: signaling initialization completion for Slave 1
[ 8044.996187] kernel: soundwire_intel soundwire_intel.link.1: Slave status change: 0x2
[ 8044.996198] kernel: max98373 sdw:1:019f:8373:00:7: Slave 1 state check1: UNATTACHED, status was 1 
<<< THIS IS NOT EXPECTED!
[ 8044.996201] kernel: max98373 sdw:1:019f:8373:00:7: sdw_modify_slave_status: initializing enumeration and init completion for Slave 1
[ 8044.996210] kernel: soundwire sdw-master-1: Slave attached, programming device number
[ 8044.996405] kernel: soundwire sdw-master-1: SDW Slave Addr: 23019f837300
[ 8044.996408] kernel: soundwire sdw-master-1: SDW Slave class_id 0x00, mfg_id 0x019f, part_id 0x8373, unique_id 0x3, version 0x2
[ 8044.996412] kernel: soundwire sdw-master-1: Slave already registered, reusing dev_num:2
[ 8044.996655] kernel: soundwire_intel soundwire_intel.link.1: Msg ignored for Slave 0
[ 8044.996657] kernel: soundwire sdw-master-1: No more devices to enumerate 
<<< END OF ENUMERATION
[ 8044.996713] kernel: soundwire_intel soundwire_intel.link.1: Slave status change: 0x201 
~~<<<< THIS MEANS THE STATUS CHANGES AFTER THE END OF ENUMERATION!~~
<<<< THIS MEANS DEVICE0 REMAINS UNATTACHED
[ 8044.996720] kernel: max98373 sdw:1:019f:8373:00:3: sdw_modify_slave_status: signaling enumeration completion for Slave 2
[ 8045.007934] kernel: max98373 sdw:1:019f:8373:00:3: Reset completed (retry:0)
[ 8045.009226] kernel: max98373 sdw:1:019f:8373:00:3: sdw_handle_slave_status: signaling initialization completion for Slave 2
....
[ 8050.114557] kernel: max98373 sdw:1:019f:8373:00:7: Initialization not complete, timed out
[ 8050.114566] kernel: max98373 sdw:1:019f:8373:00:7: PM: dpm_run_callback(): acpi_subsys_resume+0x0/0x70 returns -110
[ 8050.114582] kernel: max98373 sdw:1:019f:8373:00:7: PM: failed to resume: error -110

@keqiaozhang keqiaozhang added the P2 Critical bugs or normal features label May 23, 2022
@plbossart
Copy link
Member

plbossart commented Jun 8, 2022

Really unclear what causes the second device to remain unattached - but clearly there's already a hardware problem earlier since devices are not supposed to fall off the bus.

@keqiaozhang
Copy link
Collaborator Author

@plbossart
Copy link
Member

[ 8217.255528] kernel: soundwire_bus:sdw_modify_slave_status: max98373 sdw:2:019f:8373:00:7: sdw_modify_slave_status: initializing enumeration and init completion for Slave 5
[ 8217.255563] kernel: soundwire_bus:sdw_modify_slave_status: max98373 sdw:2:019f:8373:00:3: sdw_modify_slave_status: initializing enumeration and init completion for Slave 6
[ 8217.255611] kernel: soundwire_cadence:cdns_update_slave_status_work: soundwire_intel soundwire_intel.link.0: Slave status change: 0x20000
[ 8217.256424] kernel: soundwire_cadence:cdns_update_slave_status_work: soundwire_intel soundwire_intel.link.2: Slave status change: 0x2
[ 8217.256431] kernel: soundwire_bus:sdw_handle_slave_status: soundwire sdw-master-1: Slave attached, programming device number
[ 8217.256659] kernel: soundwire_bus:sdw_extract_slave_id: soundwire sdw-master-1: SDW Slave Addr: 27019f837300
[ 8217.256661] kernel: soundwire_bus:sdw_extract_slave_id: soundwire sdw-master-1: SDW Slave class_id 0x00, mfg_id 0x019f, part_id 0x8373, unique_id 0x7, version 0x2
[ 8217.256665] kernel: soundwire_bus:sdw_assign_device_num: soundwire sdw-master-1: Slave already registered, reusing dev_num:5
[ 8217.256926] kernel: soundwire_cadence:cdns_fill_msg_resp: soundwire_intel soundwire_intel.link.2: Msg ignored for Slave 0
[ 8217.256928] kernel: soundwire_bus:sdw_program_device_num: soundwire sdw-master-1: No more devices to enumerate
[ 8217.256978] kernel: soundwire_bus:sdw_modify_slave_status: max98373 sdw:2:019f:8373:00:7: sdw_modify_slave_status: signaling enumeration completion for Slave 5
[ 8217.257047] kernel: max98373 sdw:2:019f:8373:00:7: Bus clash detected before INT mask is enabled
<< BUS CLASH?
[ 8217.268855] kernel: max98373 sdw:2:019f:8373:00:7: Reset completed (retry:0)
[ 8217.270631] kernel: soundwire_bus:sdw_handle_slave_status: max98373 sdw:2:019f:8373:00:7: sdw_handle_slave_status: signaling initialization completion for Slave 5
[ 8217.278179] kernel: max98373 sdw:2:019f:8373:00:7: Slave 5 state check1: UNATTACHED, status was 1
<<< LOST SYNC
[ 8217.278187] kernel: soundwire_bus:sdw_modify_slave_status: max98373 sdw:2:019f:8373:00:7: sdw_modify_slave_status: initializing enumeration and init completion for Slave 5
<<< Slave 5 never comes back
[ 8217.278203] kernel: soundwire_bus:sdw_handle_slave_status: soundwire sdw-master-1: Slave attached, programming device number
[ 8217.278516] kernel: soundwire_bus:sdw_extract_slave_id: soundwire sdw-master-1: SDW Slave Addr: 23019f837300
[ 8217.278522] kernel: soundwire_bus:sdw_extract_slave_id: soundwire sdw-master-1: SDW Slave class_id 0x00, mfg_id 0x019f, part_id 0x8373, unique_id 0x3, version 0x2
[ 8217.278528] kernel: soundwire_bus:sdw_assign_device_num: soundwire sdw-master-1: Slave already registered, reusing dev_num:6
[ 8217.278932] kernel: soundwire_cadence:cdns_fill_msg_resp: soundwire_intel soundwire_intel.link.2: Msg ignored for Slave 0
[ 8217.278939] kernel: soundwire_bus:sdw_program_device_num: soundwire sdw-master-1: No more devices to enumerate
[ 8217.279064] kernel: soundwire_bus:sdw_modify_slave_status: max98373 sdw:2:019f:8373:00:3: sdw_modify_slave_status: signaling enumeration completion for Slave 6
[ 8217.279190] kernel: max98373 sdw:2:019f:8373:00:3: Bus clash detected before INT mask is enabled
[ 8217.291380] kernel: max98373 sdw:2:019f:8373:00:3: Reset completed (retry:0)

<< BUS CLASH AGAIN?

[ 8217.294048] kernel: soundwire_bus:sdw_handle_slave_status: max98373 sdw:2:019f:8373:00:3: sdw_handle_slave_status: signaling initialization completion for Slave 6

<< Slave 6 complete init

[ 8217.306311] kernel: snd_soc_rt5682:rt5682_headset_detect: rt5682 sdw:0:025d:5682:00: jack_type = 0
[ 8217.487426] kernel: asix 3-9.1:1.0 enx000ec67071e8: Link is Down
[ 8219.550794] kernel: asix 3-9.1:1.0 enx000ec67071e8: Link is Up - 100Mbps/Full - flow control off
[ 8222.615538] kernel: max98373 sdw:2:019f:8373:00:7: Initialization not complete, timed out
[ 8222.615551] kernel: max98373 sdw:2:019f:8373:00:7: PM: dpm_run_callback(): acpi_subsys_resume+0x0/0x70 returns -110
[ 8222.615589] kernel: max98373 sdw:2:019f:8373:00:7: PM: failed to resume: error -110

@plbossart
Copy link
Member

The pattern seems to be that the device '7' (Slave 5) falls off the bus first, and then never shows up again.

This device '7' is not a hardware issue, it's likely due to the SoundWire enumeration arbitration. When there are identical devices, the one with the largest unique_id wins and will be handled first.

Why it loses sync is really unclear. This should not happen!

@plbossart
Copy link
Member

plbossart commented Jun 21, 2022

Same error in https://sof-ci.01.org/linuxpr/PR3708/build250/devicetest/?model=TGLU_RVP_SDW&testcase=check-suspend-resume-5, but this time with a different codec.

[ 1352.736159] kernel: soundwire_bus:sdw_modify_slave_status: rt1308 sdw:1:025d:1308:00:2: initializing enumeration and init completion for Slave 5
[ 1352.736165] kernel: soundwire_bus:sdw_modify_slave_status: rt1308 sdw:1:025d:1308:00:0: initializing enumeration and init completion for Slave 6
[ 1352.737049] kernel: soundwire_cadence:cdns_update_slave_status_work: soundwire_intel soundwire_intel.link.1: Slave status change: 0x2
[ 1352.737054] kernel: soundwire_bus:sdw_handle_slave_status: soundwire sdw-master-1: Slave attached, programming device number
[ 1352.737239] kernel: soundwire_bus:sdw_extract_slave_id: soundwire sdw-master-1: SDW Slave Addr: 20025d130800
[ 1352.737240] kernel: soundwire_bus:sdw_extract_slave_id: soundwire sdw-master-1: SDW Slave class_id 0x00, mfg_id 0x025d, part_id 0x1308, unique_id 0x0, version 0x2
[ 1352.737242] kernel: soundwire_bus:sdw_assign_device_num: soundwire sdw-master-1: Slave already registered, reusing dev_num:6
[ 1352.737553] kernel: soundwire_cadence:cdns_fill_msg_resp: soundwire_intel soundwire_intel.link.1: Msg ignored for Slave 0
[ 1352.737554] kernel: soundwire_bus:sdw_program_device_num: soundwire sdw-master-1: No more devices to enumerate
[ 1352.737589] kernel: soundwire_bus:sdw_modify_slave_status: rt1308 sdw:1:025d:1308:00:0: signaling enumeration completion for Slave 6
[ 1352.745819] kernel: snd_soc_rt1308_sdw:rt1308_io_init: rt1308 sdw:1:025d:1308:00:0: rt1308_io_init hw_init complete
[ 1352.745820] kernel: soundwire_bus:sdw_handle_slave_status: rt1308 sdw:1:025d:1308:00:0: signaling initialization completion for Slave 6
[ 1352.760641] kernel: rt1308 sdw:1:025d:1308:00:0: Slave 6 state check1: UNATTACHED, status was 1
[ 1352.760648] kernel: soundwire_bus:sdw_modify_slave_status: rt1308 sdw:1:025d:1308:00:0: initializing enumeration and init completion for Slave 6
[ 1352.760658] kernel: soundwire_bus:sdw_handle_slave_status: soundwire sdw-master-1: Slave attached, programming device number
[ 1352.760891] kernel: soundwire_bus:sdw_extract_slave_id: soundwire sdw-master-1: SDW Slave Addr: 22025d130800
[ 1352.760894] kernel: soundwire_bus:sdw_extract_slave_id: soundwire sdw-master-1: SDW Slave class_id 0x00, mfg_id 0x025d, part_id 0x1308, unique_id 0x2, version 0x2
[ 1352.760899] kernel: soundwire_bus:sdw_assign_device_num: soundwire sdw-master-1: Slave already registered, reusing dev_num:5
[ 1352.761227] kernel: soundwire_cadence:cdns_fill_msg_resp: soundwire_intel soundwire_intel.link.1: Msg ignored for Slave 0
[ 1352.761230] kernel: soundwire_bus:sdw_program_device_num: soundwire sdw-master-1: No more devices to enumerate
[ 1352.761330] kernel: soundwire_bus:sdw_modify_slave_status: rt1308 sdw:1:025d:1308:00:2: signaling enumeration completion for Slave 5
[ 1352.770175] kernel: snd_soc_rt1308_sdw:rt1308_io_init: rt1308 sdw:1:025d:1308:00:2: rt1308_io_init hw_init complete
[ 1352.770176] kernel: soundwire_bus:sdw_handle_slave_status: rt1308 sdw:1:025d:1308:00:2: signaling initialization completion for Slave 5
[ 1352.974900] kernel: soundwire_bus:sdw_handle_slave_alerts: rt711 sdw:0:025d:0711:00: Slave impl defined interrupt
[ 1352.974906] kernel: snd_soc_rt711:rt711_interrupt_callback: rt711 sdw:0:025d:0711:00: rt711_interrupt_callback control_port_stat=4
[ 1353.233082] kernel: snd_soc_rt711:rt711_sdw_read: rt711 sdw:0:025d:0711:00: [rt711_sdw_read] b921 0000 => 80000000
[ 1353.234310] kernel: snd_soc_rt711:rt711_sdw_read: rt711 sdw:0:025d:0711:00: [rt711_sdw_read] 7520 85a0 9c20 aca0 => 00000800
[ 1353.235346] kernel: snd_soc_rt711:rt711_sdw_read: rt711 sdw:0:025d:0711:00: [rt711_sdw_read] 7520 85a0 9c20 aca0 => 00000030
[ 1353.235350] kernel: snd_soc_rt711:rt711_jack_detect_handler: rt711 sdw:0:025d:0711:00: in rt711_jack_detect_handler, jack_type=0x3
[ 1353.235353] kernel: snd_soc_rt711:rt711_jack_detect_handler: rt711 sdw:0:025d:0711:00: in rt711_jack_detect_handler, btn_type=0x0
[ 1354.790460] kernel: asix 3-3.1:1.0 enx000ec66547b6: Link is Up - 100Mbps/Full - flow control off
[ 1358.112232] kernel: rt1308 sdw:1:025d:1308:00:0: Initialization not complete, timed out
[ 1358.112237] kernel: rt1308 sdw:1:025d:1308:00:0: PM: dpm_run_callback(): acpi_subsys_resume+0x0/0x70 returns -110
[ 1358.112249] kernel: rt1308 sdw:1:025d:1308:00:0: PM: failed to resume: error -110

plbossart added a commit to plbossart/sound that referenced this issue Jun 27, 2022
This helper should help identify cases where devices fall off the bus
and don't resync.

BugLink: thesofproject#3638
Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
plbossart added a commit to plbossart/sound that referenced this issue Jun 27, 2022
This helper should help identify cases where devices fall off the bus
and don't resync.

BugLink: thesofproject#3638
Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
plbossart added a commit to plbossart/sound that referenced this issue Jun 28, 2022
This helper should help identify cases where devices fall off the bus
and don't resync.

BugLink: thesofproject#3638
Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
plbossart added a commit that referenced this issue Jun 29, 2022
This helper should help identify cases where devices fall off the bus
and don't resync.

BugLink: #3638
Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
plbossart added a commit that referenced this issue Jul 6, 2022
This helper should help identify cases where devices fall off the bus
and don't resync.

BugLink: #3638
Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
Reviewed-by: Bard Liao <yung-chuan.liao@linux.intel.com>
Reviewed-by: Rander Wang <rander.wang@intel.com>
plbossart added a commit that referenced this issue Jul 7, 2022
This helper should help identify cases where devices fall off the bus
and don't resync.

BugLink: #3638
Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
Reviewed-by: Bard Liao <yung-chuan.liao@linux.intel.com>
Reviewed-by: Rander Wang <rander.wang@intel.com>
bardliao pushed a commit to bardliao/linux that referenced this issue Jul 13, 2022
This helper should help identify cases where devices fall off the bus
and don't resync.

BugLink: thesofproject#3638
Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
Reviewed-by: Rander Wang <rander.wang@intel.com>
Signed-off-by: Bard Liao <yung-chuan.liao@linux.intel.com>
plbossart added a commit that referenced this issue Jul 15, 2022
This helper should help identify cases where devices fall off the bus
and don't resync.

BugLink: #3638
Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
Reviewed-by: Bard Liao <yung-chuan.liao@linux.intel.com>
Reviewed-by: Rander Wang <rander.wang@intel.com>
plbossart added a commit that referenced this issue Jul 25, 2022
This helper should help identify cases where devices fall off the bus
and don't resync.

BugLink: #3638
Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
Reviewed-by: Bard Liao <yung-chuan.liao@linux.intel.com>
Reviewed-by: Rander Wang <rander.wang@intel.com>
@keqiaozhang
Copy link
Collaborator Author

@plbossart
Copy link
Member

Ah, that's super interesting. Now that we added the PING status, we can check what the real status is.

[ 7660.952588] kernel: max98373 sdw:2:019f:8373:00:7: Initialization not complete, timed out
[ 7660.967828] kernel: soundwire_bus:sdw_show_ping_status: soundwire sdw-master-1: PING status: 0x1100

This means bits 8 and 12 are set. That means Slave4 and Slave6 are attached.

This suggests that the device reports as UNATTACHED but does not transition to the soft reset stage, and keeps reporting as ATTACHED with the original device number. In other words, either

  • the detection of the UNATTACHED device is incorrect.
  • the implementation of the device is incorrect.

That is of course not a logic that we thought was even possible. What we may need to do is that if a device reports as UNATTACHED we double-check the ping status and ignore that status change.

@bardliao what do you think?

plbossart added a commit to plbossart/sound that referenced this issue Jul 26, 2022
In configurations with two amplifiers on the same link, the Intel CI
reports occasional enumeration/initialization timeouts during
suspend-resume stress tests, where one of the two amplifiers becomes
UNATTACHED immediately after being enumerated. This problem was
reported both with Maxim and Realtek codecs, which pointed initially
to a problem with status handling on the Intel side.

The Cadence IP integrated on Intel platforms throws an interrupt when
the status changes, and the information is kept with sticky bits until
cleared. We initially added more checks to make sure the edge
detection did not miss any transition, but that did not improve the
results significantly.

With the recent addition of the read_ping_status() callback, we were
able to show that the status in sticky bits is shown as UNATTACHED
even though the PING frames show the problematic device as
ATTACHED. That completely breaks the entire logic where we assumed
that a peripheral would always re-attach as device0. The resume
timeouts make sense in that in those cases, the
enumeration/initialization never happens a second time.

One possible explanation is that this problem typically happens when a
bus clash is reported, so it could very well be that the detection is
fooled by a transient electrical issue or conflict between two
peripherals.

This patch conditionally double-checks the status reported in the
sticky bits with the actual PING frame status. If the peripheral
reports as attached in PING frames, the early detection based on
sticky bits is discarded.

Note that this patch only corrects issues of false positives on the
manager side.

If the peripheral lost and regain sync, then it would report as
attached on Device0. A peripheral that would not reset its dev_num
would not be compliant with the MIPI specification.

BugLink: thesofproject#3638
BugLink: thesofproject#3325
Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
@bardliao
Copy link
Collaborator

@plbossart

[ 7655.624671] kernel: soundwire_bus:sdw_handle_slave_status: soundwire sdw-master-1: Slave attached, programming device number
[ 7655.624863] kernel: soundwire_bus:sdw_extract_slave_id: soundwire sdw-master-1: SDW Slave Addr: 27019f837300
[ 7655.624864] kernel: soundwire_bus:sdw_extract_slave_id: soundwire sdw-master-1: SDW Slave class_id 0x00, mfg_id 0x019f, part_id 0x8373, unique_id 0x7, version 0x2
[ 7655.624866] kernel: soundwire_bus:sdw_assign_device_num: soundwire sdw-master-1: Slave already registered, reusing dev_num:4
[ 7655.625113] kernel: soundwire_cadence:cdns_fill_msg_resp: soundwire_intel soundwire_intel.link.2: Msg ignored for Slave 0
[ 7655.625114] kernel: soundwire_bus:sdw_program_device_num: soundwire sdw-master-1: No more devices to enumerate
[ 7655.625165] kernel: soundwire_bus:sdw_modify_slave_status: max98373 sdw:2:019f:8373:00:7: signaling enumeration completion for Slave 4
[ 7655.625268] kernel: max98373 sdw:2:019f:8373:00:7: Bus clash detected before INT mask is enabled
[ 7655.637076] kernel: max98373 sdw:2:019f:8373:00:7: Reset completed (retry:0)
[ 7655.639310] kernel: soundwire_bus:sdw_handle_slave_status: max98373 sdw:2:019f:8373:00:7: signaling initialization completion for Slave 4
[ 7655.646506] kernel: max98373 sdw:2:019f:8373:00:7: Slave 4 state check1: UNATTACHED, status was 1

sdw_handle_slave_status checks status[i] right after programming device number. Is it possible that HW need some time to update the slave status from unattached to attached? That's why the status[] we read doesn't match the slave->status. I am wondering what is the ping status at that moment. Can we read the status after programing device number to make sure the slave is really attached?

@plbossart
Copy link
Member

@bardliao sdw_handle_slave_status() is called when there is an interrupt signalling a change of status, which is detected when the PING frame status. If the device didn't update its PING frame status there would be no interrupt and no status handling.

That said, we could add a check that the device is indeed attach. In other words both for attach and detach events we would check the PING frame status to make sure the hardware detection didn't report a bad event.

@bardliao
Copy link
Collaborator

@bardliao sdw_handle_slave_status() is called when there is an interrupt signalling a change of status, which is detected when the PING frame status. If the device didn't update its PING frame status there would be no interrupt and no status handling.

The interrupt could be triggered by max98373 sdw:2:019f:8373:00:3 (Slave 6), right?
The scenario could be:

  1. max98373 sdw:2:019f:8373:00:7 (Slave 4) and sdw:2:019f:8373:00:3 (Slave 6) are attached at the same time
  2. sdw_handle_slave_status is called for Slave 4 first, program the device number and set Slave->status etc.
  3. sdw_handle_slave_status is called for Slave 6 and check status for all devices and find Slave 4's status is still unattached.

I agree the unattached status is a false positive. But I am not sure if checking ping status at the same place will fix the issue or not. It seems to be some kind of timing issue to me.

That said, we could add a check that the device is indeed attach. In other words both for attach and detach events we would check the PING frame status to make sure the hardware detection didn't report a bad event.

plbossart added a commit that referenced this issue Aug 1, 2022
This helper should help identify cases where devices fall off the bus
and don't resync.

BugLink: #3638
Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
Reviewed-by: Bard Liao <yung-chuan.liao@linux.intel.com>
Reviewed-by: Rander Wang <rander.wang@intel.com>
plbossart added a commit that referenced this issue Aug 2, 2022
In configurations with two amplifiers on the same link, the Intel CI
reports occasional enumeration/initialization timeouts during
suspend-resume stress tests, where one of the two amplifiers becomes
UNATTACHED immediately after being enumerated. This problem was
reported both with Maxim and Realtek codecs, which pointed initially
to a problem with status handling on the Intel side.

The Cadence IP integrated on Intel platforms throws an interrupt when
the status changes, and the information is kept with sticky bits until
cleared. We initially added more checks to make sure the edge
detection did not miss any transition, but that did not improve the
results significantly.

With the recent addition of the read_ping_status() callback, we were
able to show that the status in sticky bits is shown as UNATTACHED
even though the PING frames show the problematic device as
ATTACHED. That completely breaks the entire logic where we assumed
that a peripheral would always re-attach as device0. The resume
timeouts make sense in that in those cases, the
enumeration/initialization never happens a second time.

One possible explanation is that this problem typically happens when a
bus clash is reported, so it could very well be that the detection is
fooled by a transient electrical issue or conflict between two
peripherals.

This patch conditionally double-checks the status reported in the
sticky bits with the actual PING frame status. If the peripheral
reports as attached in PING frames, the early detection based on
sticky bits is discarded.

Note that this patch only corrects issues of false positives on the
manager side.

If the peripheral lost and regain sync, then it would report as
attached on Device0. A peripheral that would not reset its dev_num
would not be compliant with the MIPI specification.

BugLink: #3638
BugLink: #3325
Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
intel-lab-lkp pushed a commit to intel-lab-lkp/linux that referenced this issue Aug 18, 2022
This helper should help identify cases where devices fall off the bus
and don't resync.

BugLink: thesofproject#3638
Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
Reviewed-by: Rander Wang <rander.wang@intel.com>
Signed-off-by: Bard Liao <yung-chuan.liao@linux.intel.com>
Acked-By: Vinod Koul <vkoul@kernel.org>
Link: https://lore.kernel.org/r/20220714011043.46059-5-yung-chuan.liao@linux.intel.com
Signed-off-by: Mark Brown <broonie@kernel.org>
intel-lab-lkp pushed a commit to intel-lab-lkp/linux that referenced this issue Aug 18, 2022
Merge series from Bard Liao <yung-chuan.liao@linux.intel.com>:

we've been stuck with problems in the dual-amplifier configurations where
one of the two devices seems to become UNATTACHED and never regains sync,
see thesofproject#3638.

This is a rather infrequent issue that may happen once or twice per month,
but still it remains a concern.

One possibility is that the device does lose sync but somehow our hardware
detection fails to see it resync.

This series just adds a basic read directly from the PING frames to help
confirm if yes/no the device regain sync.
bardliao pushed a commit to bardliao/linux that referenced this issue Aug 24, 2022
In configurations with two amplifiers on the same link, the Intel CI
reports occasional enumeration/initialization timeouts during
suspend-resume stress tests, where one of the two amplifiers becomes
UNATTACHED immediately after being enumerated. This problem was
reported both with Maxim and Realtek codecs, which pointed initially
to a problem with status handling on the Intel side.

The Cadence IP integrated on Intel platforms throws an interrupt when
the status changes, and the information is kept with sticky bits until
cleared. We initially added more checks to make sure the edge
detection did not miss any transition, but that did not improve the
results significantly.

With the recent addition of the read_ping_status() callback, we were
able to show that the status in sticky bits is shown as UNATTACHED
even though the PING frames show the problematic device as
ATTACHED. That completely breaks the entire logic where we assumed
that a peripheral would always re-attach as device0. The resume
timeouts make sense in that in those cases, the
enumeration/initialization never happens a second time.

One possible explanation is that this problem typically happens when a
bus clash is reported, so it could very well be that the detection is
fooled by a transient electrical issue or conflict between two
peripherals.

This patch conditionally double-checks the status reported in the
sticky bits with the actual PING frame status. If the peripheral
reports as attached in PING frames, the early detection based on
sticky bits is discarded.

Note that this patch only corrects issues of false positives on the
manager side.

If the peripheral lost and regain sync, then it would report as
attached on Device0. A peripheral that would not reset its dev_num
would not be compliant with the MIPI specification.

BugLink: thesofproject#3638
BugLink: thesofproject#3325
Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
Reviewed-by: Rander Wang <rander.wang@intel.com>
Signed-off-by: Bard Liao <yung-chuan.liao@linux.intel.com>
@keqiaozhang
Copy link
Collaborator Author

ujfalusi pushed a commit that referenced this issue Sep 19, 2022
In configurations with two amplifiers on the same link, the Intel CI
reports occasional enumeration/initialization timeouts during
suspend-resume stress tests, where one of the two amplifiers becomes
UNATTACHED immediately after being enumerated. This problem was
reported both with Maxim and Realtek codecs, which pointed initially
to a problem with status handling on the Intel side.

The Cadence IP integrated on Intel platforms throws an interrupt when
the status changes, and the information is kept with sticky bits until
cleared. We initially added more checks to make sure the edge
detection did not miss any transition, but that did not improve the
results significantly.

With the recent addition of the read_ping_status() callback, we were
able to show that the status in sticky bits is shown as UNATTACHED
even though the PING frames show the problematic device as
ATTACHED. That completely breaks the entire logic where we assumed
that a peripheral would always re-attach as device0. The resume
timeouts make sense in that in those cases, the
enumeration/initialization never happens a second time.

One possible explanation is that this problem typically happens when a
bus clash is reported, so it could very well be that the detection is
fooled by a transient electrical issue or conflict between two
peripherals.

This patch conditionally double-checks the status reported in the
sticky bits with the actual PING frame status. If the peripheral
reports as attached in PING frames, the early detection based on
sticky bits is discarded.

Note that this patch only corrects issues of false positives on the
manager side.

If the peripheral lost and regain sync, then it would report as
attached on Device0. A peripheral that would not reset its dev_num
would not be compliant with the MIPI specification.

BugLink: #3638
BugLink: #3325
Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
Reviewed-by: Rander Wang <rander.wang@intel.com>
Reviewed-by: Bard Liao <yung-chuan.liao@linux.intel.com>
ujfalusi pushed a commit that referenced this issue Sep 19, 2022
In configurations with two amplifiers on the same link, the Intel CI
reports occasional enumeration/initialization timeouts during
suspend-resume stress tests, where one of the two amplifiers becomes
UNATTACHED immediately after being enumerated. This problem was
reported both with Maxim and Realtek codecs, which pointed initially
to a problem with status handling on the Intel side.

The Cadence IP integrated on Intel platforms throws an interrupt when
the status changes, and the information is kept with sticky bits until
cleared. We initially added more checks to make sure the edge
detection did not miss any transition, but that did not improve the
results significantly.

With the recent addition of the read_ping_status() callback, we were
able to show that the status in sticky bits is shown as UNATTACHED
even though the PING frames show the problematic device as
ATTACHED. That completely breaks the entire logic where we assumed
that a peripheral would always re-attach as device0. The resume
timeouts make sense in that in those cases, the
enumeration/initialization never happens a second time.

One possible explanation is that this problem typically happens when a
bus clash is reported, so it could very well be that the detection is
fooled by a transient electrical issue or conflict between two
peripherals.

This patch conditionally double-checks the status reported in the
sticky bits with the actual PING frame status. If the peripheral
reports as attached in PING frames, the early detection based on
sticky bits is discarded.

Note that this patch only corrects issues of false positives on the
manager side.

If the peripheral lost and regain sync, then it would report as
attached on Device0. A peripheral that would not reset its dev_num
would not be compliant with the MIPI specification.

BugLink: #3638
BugLink: #3325
Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
Reviewed-by: Rander Wang <rander.wang@intel.com>
Reviewed-by: Bard Liao <yung-chuan.liao@linux.intel.com>
@plbossart
Copy link
Member

I am going to close this bug since we have fixes from Cirrus Logic that should solve the issue with more than one device per link. Of course if we see this problem again we can reopen as needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working codec Codec HW or driver restriction P2 Critical bugs or normal features SDW Applies to SoundWire bus for codec connection TGL Applies to Tiger Lake platform
Projects
None yet
Development

No branches or pull requests

3 participants