Fw boot refine and DSP init retry fixes#507
Fw boot refine and DSP init retry fixes#507plbossart merged 2 commits intothesofproject:topic/sof-devfrom
Conversation
8155b43 to
2670e3a
Compare
plbossart
left a comment
There was a problem hiding this comment.
The return status and management of tag/ret values needs more work.
There was a problem hiding this comment.
This looks very odd. You could return 0 as an error but the calling layers in core.c expects errors to be <0.
Again mixing tags and return values doesn't seem to be very wise.
There was a problem hiding this comment.
I don't want to change too much in this small series, shall we keep cl_stream_prepare() what it is now, and refine it later if needed?
Today, it returns allocated stream_tag(>=1) at success and minus value(<0) at fail.
There was a problem hiding this comment.
so zero is what then?
There won't be zero returned, so will you feel better if I change it like this:
--- if (tag <= 0) {
+++ if (tag < 0) {
2670e3a to
e01e821
Compare
|
@plbossart updated. |
e01e821 to
a01cdfc
Compare
plbossart
left a comment
There was a problem hiding this comment.
the second patch is very hard to review, I had to go to the branch to figure it out myself...
There was a problem hiding this comment.
it'd be nicer if this was timeout_ms. we will get this feedback from Andy so let's anticipate :-)
There was a problem hiding this comment.
Could we make those constants device-specific, e.g. with fields that are set in sdev? I don't see any reason why every hardware on the planet would need to use those constants?
There was a problem hiding this comment.
This function is for register polling, it poll until we get some specific value or timeout, we sleep some time between every 2 polling to save CPU bandwidth, we will get the target value within different time polling for different devices/registers, no any document staged this for specific devices, here it is a quite generic algorithm about the trade-off between responsiveness(about register update) and bandwidth occupying.
I think the logic here is:
- for those registers which its value is changed to target one rapidly, let's get it ASAP, that is phase 1(0.5~1.0 ms sleep only).
- after tried enough(e.g. 10 here) times with failing, we deem that it might need wait more time, let's take it easy, slow down our cadence.
We can define macros for those numbers(e.g. PHASE1_TRY=10, PHASE1_MIN/MAX, PHASE2_MIN/MAX, ...), but I don't think we can find any theoretical basis about what numbers are ideal ones(let alone to specific devices), if we won't change them frequently, I think using direct numbers(not macros) is fine.
@plbossart what's your opinion?
There was a problem hiding this comment.
I suggested constant fields that are provided in the SOC-dependent structures, not generic macros which indeed are not better than magic values.
There was a problem hiding this comment.
reverse xmas tree please, we will get that feedback when upstreaming
There was a problem hiding this comment.
Sure, let me do it as already at here.
There was a problem hiding this comment.
the error sequence doesn't seem to deal with more than 2 cores?
There was a problem hiding this comment.
ah, it doesn't, @ranj063 mentioned to do that so I thought it already done, let me change it now.
Optimize DSP register polling to poll in phases starting with finer sleep value and moving on to larger sleep values(tips from what we do in sst driver). This is helpful to reduce the extra waiting time (e.g. from 20ms to 1ms) after hw status/registers updated. Signed-off-by: Ranjani Sridharan <ranjani.sridharan@linux.intel.com> Signed-off-by: Keyon Jie <yang.jie@linux.intel.com>
… retries There is a possibility of CSE auth failing during stress tests due to a race condition between when the audio driver requests fw authentication and then the CSE is ready. Reduce the timeout to check for ROM init and retry this step a few times to improve the chances of FW download being successful. For the retrying, we only need retry cl_dsp_init() which with race condition existed described above, here moves cl_stream_prepare() and cl_copy_fw() out of the retry loop. Here we change to do cleanup(put stream, free DMA buffer, ...) at xx_cl_boot_firmware() done, no matter it fail or success there, this will fix the stream 'leak' issue at boot_firmware() fail during suspend/resume test. Signed-off-by: Ranjani Sridharan <ranjani.sridharan@linux.intel.com> Signed-off-by: Keyon Jie <yang.jie@linux.intel.com>
a01cdfc to
53d6049
Compare
|
@plbossart addressed and updated. |
plbossart
left a comment
There was a problem hiding this comment.
Ok, I will merge but log an improvement to see if we can have platform-dependent values instead of hard-coded 20/5/10 magic values
|
@xiulipan can you please fix the compiler used by the Travis CI? |
|
@plbossart |
We met DSP boot failed issues at GLK suspend/resume stress test(>1000 iterations), the dmesg error logs looks like this:
After analyzing and debugging, we root cause it to several gaps here:
Here aligning Polling with cAVS driver, refining dsp init retry, and do cleanup to avoid HDA stream 'leak'(stream not put, can't be used anymore).
With these changes, I have verified(used hack below) that our retry mechanism works on APL UP^2.
I used hack showed in below link to verify that the retrying works:
keyonjie/linux@7ceae80