Fw boot refine and DSP init retry fixes by keyonjie · Pull Request #507 · thesofproject/linux

keyonjie · 2019-01-07T05:40:24Z

We met DSP boot failed issues at GLK suspend/resume stress test(>1000 iterations), the dmesg error logs looks like this:

[17960.915106] sof-audio sof-audio: error: iteration 0 of load fw failed err: -62
[17961.592271] sof-audio sof-audio: error: status = 0x00000000 panic = 0x00000000
[17961.614226] sof-audio sof-audio: Error code=0xffffffff: FW status=0xffffffff
[17961.614232] sof-audio sof-audio: iteration 1 of Core En/ROM load fail:-5
[17961.659305] sof-audio sof-audio: error: rx list empty but received 0x4200
[17961.659309] sof-audio sof-audio: error: can't find message header 0x1004200
[17962.269225] sof-audio sof-audio: error: status = 0x00000000 panic = 0x00000000
[17962.291209] sof-audio sof-audio: Error code=0xffffffff: FW status=0xffffffff
[17962.291214] sof-audio sof-audio: iteration 2 of Core En/ROM load fail:-5
[17962.291233] sof-audio sof-audio: error: status = 0xffffffff panic = 0xffffffff
[17962.291253] sof-audio sof-audio: error: load fw failed after 3 attempts with err: -5
[17962.291256] sof-audio sof-audio: error: failed to reset DSP
[17962.291259] sof-audio sof-audio: error: failed to boot DSP firmware after resume -5

After analyzing and debugging, we root cause it to several gaps here:

There is a possibility of CSE auth failing during stress tests due to a race condition between when the audio driver requests fw authentication and then the CSE is ready, that means, if driver request fw authentication before CSE is ready, we may fail.
We should only retry to check until the ROM init(CSE ready) is done, but neither stream prepare nor FW copy need the retry.
We need cleanup to free/put code loader stream(once we have get that successfully), no matter the whole cl_boot_firmware success or not, otherwise, there might be stream 'leak' in suspend/resume stress test failures.

Here aligning Polling with cAVS driver, refining dsp init retry, and do cleanup to avoid HDA stream 'leak'(stream not put, can't be used anymore).

With these changes, I have verified(used hack below) that our retry mechanism works on APL UP^2.

I used hack showed in below link to verify that the retrying works:
keyonjie/linux@7ceae80

plbossart

The return status and management of tag/ret values needs more work.

plbossart · 2019-01-07T21:27:52Z

This looks very odd. You could return 0 as an error but the calling layers in core.c expects errors to be <0.
Again mixing tags and return values doesn't seem to be very wise.

I don't want to change too much in this small series, shall we keep cl_stream_prepare() what it is now, and refine it later if needed?

Today, it returns allocated stream_tag(>=1) at success and minus value(<0) at fail.

so zero is what then?

so zero is what then?

There won't be zero returned, so will you feel better if I change it like this:

--- if (tag <= 0) { +++ if (tag < 0) {

@plbossart changed.

keyonjie · 2019-01-08T02:26:40Z

@plbossart updated.

plbossart

the second patch is very hard to review, I had to go to the branch to figure it out myself...

plbossart · 2019-01-08T14:32:36Z

it'd be nicer if this was timeout_ms. we will get this feedback from Andy so let's anticipate :-)

OK, let me change it.

plbossart · 2019-01-08T14:34:04Z

Could we make those constants device-specific, e.g. with fields that are set in sdev? I don't see any reason why every hardware on the planet would need to use those constants?

This function is for register polling, it poll until we get some specific value or timeout, we sleep some time between every 2 polling to save CPU bandwidth, we will get the target value within different time polling for different devices/registers, no any document staged this for specific devices, here it is a quite generic algorithm about the trade-off between responsiveness(about register update) and bandwidth occupying.

I think the logic here is:

for those registers which its value is changed to target one rapidly, let's get it ASAP, that is phase 1(0.5~1.0 ms sleep only).

after tried enough(e.g. 10 here) times with failing, we deem that it might need wait more time, let's take it easy, slow down our cadence.

We can define macros for those numbers(e.g. PHASE1_TRY=10, PHASE1_MIN/MAX, PHASE2_MIN/MAX, ...), but I don't think we can find any theoretical basis about what numbers are ideal ones(let alone to specific devices), if we won't change them frequently, I think using direct numbers(not macros) is fine.

@plbossart what's your opinion?

I suggested constant fields that are provided in the SOC-dependent structures, not generic macros which indeed are not better than magic values.

plbossart · 2019-01-08T14:36:12Z

reverse xmas tree please, we will get that feedback when upstreaming

Sure, let me do it as already at here.

plbossart · 2019-01-08T14:44:27Z

the error sequence doesn't seem to deal with more than 2 cores?

ah, it doesn't, @ranj063 mentioned to do that so I thought it already done, let me change it now.

Optimize DSP register polling to poll in phases starting with finer sleep value and moving on to larger sleep values(tips from what we do in sst driver). This is helpful to reduce the extra waiting time (e.g. from 20ms to 1ms) after hw status/registers updated. Signed-off-by: Ranjani Sridharan <ranjani.sridharan@linux.intel.com> Signed-off-by: Keyon Jie <yang.jie@linux.intel.com>

… retries There is a possibility of CSE auth failing during stress tests due to a race condition between when the audio driver requests fw authentication and then the CSE is ready. Reduce the timeout to check for ROM init and retry this step a few times to improve the chances of FW download being successful. For the retrying, we only need retry cl_dsp_init() which with race condition existed described above, here moves cl_stream_prepare() and cl_copy_fw() out of the retry loop. Here we change to do cleanup(put stream, free DMA buffer, ...) at xx_cl_boot_firmware() done, no matter it fail or success there, this will fix the stream 'leak' issue at boot_firmware() fail during suspend/resume test. Signed-off-by: Ranjani Sridharan <ranjani.sridharan@linux.intel.com> Signed-off-by: Keyon Jie <yang.jie@linux.intel.com>

keyonjie · 2019-01-09T03:38:31Z

@plbossart addressed and updated.

plbossart

Ok, I will merge but log an improvement to see if we can have platform-dependent values instead of hard-coded 20/5/10 magic values

plbossart · 2019-01-09T16:06:43Z

@xiulipan can you please fix the compiler used by the Travis CI?

xiulipan · 2019-01-10T09:09:06Z

@plbossart
Sorry about the delay, tried some ENV way to switch gcc but failed.
Replace the default compiler with #520

keyonjie requested review from lyakh, mengdonglin, plbossart and ranj063 January 7, 2019 05:40

keyonjie force-pushed the fw_boot_retry branch from 8155b43 to 2670e3a Compare January 7, 2019 05:46

plbossart requested changes Jan 7, 2019

View reviewed changes

keyonjie force-pushed the fw_boot_retry branch from 2670e3a to e01e821 Compare January 8, 2019 02:20

keyonjie force-pushed the fw_boot_retry branch from e01e821 to a01cdfc Compare January 8, 2019 06:35

plbossart reviewed Jan 8, 2019

View reviewed changes

ranj063 and others added 2 commits January 9, 2019 11:00

keyonjie force-pushed the fw_boot_retry branch from a01cdfc to 53d6049 Compare January 9, 2019 03:34

plbossart approved these changes Jan 9, 2019

View reviewed changes

plbossart merged commit e3e4d0f into thesofproject:topic/sof-dev Jan 9, 2019

ranj063 mentioned this pull request Jan 10, 2019

Fix fw boot retry - V3 #519

Merged

wenqingfu mentioned this pull request Jan 10, 2019

[Stress] Random FW loading failure thesofproject/sof#785

Closed

Conversation

keyonjie commented Jan 7, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

plbossart left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

keyonjie commented Jan 8, 2019

Uh oh!

plbossart left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

keyonjie Jan 9, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

keyonjie commented Jan 9, 2019

Uh oh!

plbossart left a comment

Choose a reason for hiding this comment

Uh oh!

plbossart commented Jan 9, 2019

Uh oh!

xiulipan commented Jan 10, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

keyonjie commented Jan 7, 2019 •

edited

Loading

keyonjie Jan 9, 2019 •

edited

Loading