Skip to content

Fw boot refine and DSP init retry fixes#507

Merged
plbossart merged 2 commits intothesofproject:topic/sof-devfrom
keyonjie:fw_boot_retry
Jan 9, 2019
Merged

Fw boot refine and DSP init retry fixes#507
plbossart merged 2 commits intothesofproject:topic/sof-devfrom
keyonjie:fw_boot_retry

Conversation

@keyonjie
Copy link
Copy Markdown

@keyonjie keyonjie commented Jan 7, 2019

We met DSP boot failed issues at GLK suspend/resume stress test(>1000 iterations), the dmesg error logs looks like this:

[17960.915106] sof-audio sof-audio: error: iteration 0 of load fw failed err: -62
[17961.592271] sof-audio sof-audio: error: status = 0x00000000 panic = 0x00000000
[17961.614226] sof-audio sof-audio: Error code=0xffffffff: FW status=0xffffffff
[17961.614232] sof-audio sof-audio: iteration 1 of Core En/ROM load fail:-5
[17961.659305] sof-audio sof-audio: error: rx list empty but received 0x4200
[17961.659309] sof-audio sof-audio: error: can't find message header 0x1004200
[17962.269225] sof-audio sof-audio: error: status = 0x00000000 panic = 0x00000000
[17962.291209] sof-audio sof-audio: Error code=0xffffffff: FW status=0xffffffff
[17962.291214] sof-audio sof-audio: iteration 2 of Core En/ROM load fail:-5
[17962.291233] sof-audio sof-audio: error: status = 0xffffffff panic = 0xffffffff
[17962.291253] sof-audio sof-audio: error: load fw failed after 3 attempts with err: -5
[17962.291256] sof-audio sof-audio: error: failed to reset DSP
[17962.291259] sof-audio sof-audio: error: failed to boot DSP firmware after resume -5

After analyzing and debugging, we root cause it to several gaps here:

  1. There is a possibility of CSE auth failing during stress tests due to a race condition between when the audio driver requests fw authentication and then the CSE is ready, that means, if driver request fw authentication before CSE is ready, we may fail.
  2. We should only retry to check until the ROM init(CSE ready) is done, but neither stream prepare nor FW copy need the retry.
  3. We need cleanup to free/put code loader stream(once we have get that successfully), no matter the whole cl_boot_firmware success or not, otherwise, there might be stream 'leak' in suspend/resume stress test failures.

Here aligning Polling with cAVS driver, refining dsp init retry, and do cleanup to avoid HDA stream 'leak'(stream not put, can't be used anymore).

With these changes, I have verified(used hack below) that our retry mechanism works on APL UP^2.

I used hack showed in below link to verify that the retrying works:
keyonjie/linux@7ceae80

Copy link
Copy Markdown
Member

@plbossart plbossart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return status and management of tag/ret values needs more work.

Comment thread sound/soc/sof/ops.c Outdated
Comment thread sound/soc/sof/ops.c Outdated
Comment thread sound/soc/sof/intel/hda-loader.c Outdated
Comment thread sound/soc/sof/intel/hda-loader.c Outdated
Comment thread sound/soc/sof/intel/hda-loader.c Outdated
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks very odd. You could return 0 as an error but the calling layers in core.c expects errors to be <0.
Again mixing tags and return values doesn't seem to be very wise.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to change too much in this small series, shall we keep cl_stream_prepare() what it is now, and refine it later if needed?

Today, it returns allocated stream_tag(>=1) at success and minus value(<0) at fail.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so zero is what then?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so zero is what then?

There won't be zero returned, so will you feel better if I change it like this:

--- if (tag <= 0) {
+++ if (tag < 0) {

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@plbossart changed.

@keyonjie
Copy link
Copy Markdown
Author

keyonjie commented Jan 8, 2019

@plbossart updated.

Copy link
Copy Markdown
Member

@plbossart plbossart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the second patch is very hard to review, I had to go to the branch to figure it out myself...

Comment thread sound/soc/sof/ops.c Outdated
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it'd be nicer if this was timeout_ms. we will get this feedback from Andy so let's anticipate :-)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, let me change it.

Comment thread sound/soc/sof/ops.c Outdated
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we make those constants device-specific, e.g. with fields that are set in sdev? I don't see any reason why every hardware on the planet would need to use those constants?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is for register polling, it poll until we get some specific value or timeout, we sleep some time between every 2 polling to save CPU bandwidth, we will get the target value within different time polling for different devices/registers, no any document staged this for specific devices, here it is a quite generic algorithm about the trade-off between responsiveness(about register update) and bandwidth occupying.

I think the logic here is:

  1. for those registers which its value is changed to target one rapidly, let's get it ASAP, that is phase 1(0.5~1.0 ms sleep only).
  2. after tried enough(e.g. 10 here) times with failing, we deem that it might need wait more time, let's take it easy, slow down our cadence.

We can define macros for those numbers(e.g. PHASE1_TRY=10, PHASE1_MIN/MAX, PHASE2_MIN/MAX, ...), but I don't think we can find any theoretical basis about what numbers are ideal ones(let alone to specific devices), if we won't change them frequently, I think using direct numbers(not macros) is fine.

@plbossart what's your opinion?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggested constant fields that are provided in the SOC-dependent structures, not generic macros which indeed are not better than magic values.

Comment thread sound/soc/sof/intel/hda-loader.c Outdated
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reverse xmas tree please, we will get that feedback when upstreaming

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, let me do it as already at here.

Comment thread sound/soc/sof/intel/hda-loader.c Outdated
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the error sequence doesn't seem to deal with more than 2 cores?

Copy link
Copy Markdown
Author

@keyonjie keyonjie Jan 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, it doesn't, @ranj063 mentioned to do that so I thought it already done, let me change it now.

ranj063 and others added 2 commits January 9, 2019 11:00
Optimize DSP register polling to poll in phases starting with finer
sleep value and moving on to larger sleep values(tips from what we
do in sst driver). This is helpful to reduce the extra waiting time
(e.g. from 20ms to 1ms) after hw status/registers updated.

Signed-off-by: Ranjani Sridharan <ranjani.sridharan@linux.intel.com>
Signed-off-by: Keyon Jie <yang.jie@linux.intel.com>
… retries

There is a possibility of CSE auth failing during stress tests due to a
race condition between when the audio driver requests fw authentication
and then the CSE is ready. Reduce the timeout to check for ROM init and
retry this step a few times to improve the chances of FW download being
successful.

For the retrying, we only need retry cl_dsp_init() which with race
condition existed described above, here moves cl_stream_prepare() and
cl_copy_fw() out of the retry loop.

Here we change to do cleanup(put stream, free DMA buffer, ...) at
xx_cl_boot_firmware() done, no matter it fail or success there, this
will fix the stream 'leak' issue at boot_firmware() fail during
suspend/resume test.

Signed-off-by: Ranjani Sridharan <ranjani.sridharan@linux.intel.com>
Signed-off-by: Keyon Jie <yang.jie@linux.intel.com>
@keyonjie
Copy link
Copy Markdown
Author

keyonjie commented Jan 9, 2019

@plbossart addressed and updated.

Copy link
Copy Markdown
Member

@plbossart plbossart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I will merge but log an improvement to see if we can have platform-dependent values instead of hard-coded 20/5/10 magic values

@plbossart
Copy link
Copy Markdown
Member

@xiulipan can you please fix the compiler used by the Travis CI?

@plbossart plbossart merged commit e3e4d0f into thesofproject:topic/sof-dev Jan 9, 2019
@ranj063 ranj063 mentioned this pull request Jan 10, 2019
@xiulipan
Copy link
Copy Markdown

@plbossart
Sorry about the delay, tried some ENV way to switch gcc but failed.
Replace the default compiler with #520

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants