Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

byt-ipc: refine the IPC DONE mask #2119

Closed
wants to merge 2 commits into from

Conversation

keyonjie
Copy link
Contributor

Refine the IPCD DONE mask sequence:

  1. After initialized, mask the IPCD DONE.
  2. Enable/unmask IPCD DONE at new message sending to host.
  3. Disable/mask IPCD DONE once the FW initiated IPC handling is done.

Signed-off-by: Keyon Jie yang.jie@linux.intel.com

Copy link
Member

@lgirdwood lgirdwood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why, please explain what you are fixing.

@keyonjie
Copy link
Contributor Author

Why, please explain what you are fixing.

This is aim to harden the robustness to avoid IPC IRQ storm, we only expect to receive ACK/reply IPC from host after we have initiated an IPC to host.

@lgirdwood imagine that the host can send ACK IPC(or even storm ACKIPCs) by intentional or accident, we hit similar issue on host side and are doing similar fix on Linux side.

thesofproject/linux#1491

@lgirdwood
Copy link
Member

Why, please explain what you are fixing.

This is aim to harden the robustness to avoid IPC IRQ storm, we only expect to receive ACK/reply IPC from host after we have initiated an IPC to host.

@lgirdwood imagine that the host can send ACK IPC(or even storm ACKIPCs) by intentional or accident, we hit similar issue on host side and are doing similar fix on Linux side.

thesofproject/linux#1491

Ok, please include all this data in your commit message. Do we need to do the same for BDW given the similarities in IPC IP.

@keyonjie
Copy link
Contributor Author

Why, please explain what you are fixing.

This is aim to harden the robustness to avoid IPC IRQ storm, we only expect to receive ACK/reply IPC from host after we have initiated an IPC to host.
@lgirdwood imagine that the host can send ACK IPC(or even storm ACKIPCs) by intentional or accident, we hit similar issue on host side and are doing similar fix on Linux side.
thesofproject/linux#1491

Ok, please include all this data in your commit message. Do we need to do the same for BDW given the similarities in IPC IP.

Sure, let me update the commit message, and yes, we need the similar fixes for BDW also, let me add it together.

Copy link
Member

@lgirdwood lgirdwood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tlauda @mmaka1 do we need to do similar on CAVS ?

@RanderWang
Copy link
Collaborator

@keyonjie "To fix that, reefine the IPCD DONE mask sequence as below:" reefine typo ?

Copy link
Member

@plbossart plbossart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @keyonjie you keep sending IPC related patches without clear explanations.

This is critical code, we should not change this without extreme caution.

I am not aware of any rain storm or issues with Baytrail, so let's please follow rational engineering steps

@keyonjie
Copy link
Contributor Author

keyonjie commented Nov 22, 2019

Sorry @keyonjie you keep sending IPC related patches without clear explanations.

This is critical code, we should not change this without extreme caution.

I am not aware of any rain storm or issues with Baytrail, so let's please follow rational engineering steps

@plbossart I understand and do follow the extreme caution rule here, we have logic explanation in the commit message, experiment explanation in the filed issue thesofproject/linux#1492, we have also run stress test on BYT/CHT/BSW.

For the interrupt storm issue, if you run module unload/reload in BSW, you will see it randomly, if the last ack IPC is not handled in the driver before the module is unloaded, the HW will keep generating this interrupt and hang the whole system.

Vice versa, in the FW side, my PR here implement the same logic to avoid this error.

@keyonjie
Copy link
Contributor Author

@tlauda @mmaka1 do we need to do similar on CAVS ?

I think it's better to add similar to cAVS also.

@keyonjie
Copy link
Contributor Author

@keyonjie "To fix that, reefine the IPCD DONE mask sequence as below:" reefine typo ?

good catch, fixed now.

This is aim to harden the robustness to avoid IPC IRQ storm, we only
expect to receive ACK/reply IPC from host after we have initiated an IPC
to host.

Imagine that the host can keep sending ACK IPC(or even storm ACK IPCs)
intentiionally or by accident, with today's code the FW will keep
receiving this ACK IPC interrupts and looks hang, we hit this similar
issue on host side and are doing similar fix on Linux side.

To fix that, refine the IPCD DONE mask sequence as below:
1. After initialized, mask the IPCD DONE.
2. Enable/unmask IPCD DONE at new message sending to host.
3. Disable/mask IPCD DONE once the FW initiated IPC handling is done.

Signed-off-by: Keyon Jie <yang.jie@linux.intel.com>
This is aim to harden the robustness to avoid IPC IRQ storm, we only
expect to receive ACK/reply IPC from host after we have initiated an IPC
to host.

Imagine that the host can keep sending ACK IPC(or even storm ACK IPCs)
intentiionally or by accident, with today's code the FW will keep
receiving this ACK IPC interrupts and looks hang, we hit this similar
issue on host side and are doing similar fix on Linux side.

To fix that, refine the IPCD DONE mask sequence as below:
1. After initialized, mask the IPCD DONE.
2. Enable/unmask IPCD DONE at new message sending to host.
3. Disable/mask IPCD DONE once the FW initiated IPC handling is done.

Signed-off-by: Keyon Jie <yang.jie@linux.intel.com>
@plbossart
Copy link
Member

Sorry @keyonjie you keep sending IPC related patches without clear explanations.
This is critical code, we should not change this without extreme caution.
I am not aware of any rain storm or issues with Baytrail, so let's please follow rational engineering steps

@plbossart I understand and do follow the extreme caution rule here, we have logic explanation in the commit message, experiment explanation in the filed issue thesofproject/linux#1492, we have also run stress test on BYT/CHT/BSW.

For the interrupt storm issue, if you run module unload/reload in BSW, you will see it randomly, if the last ack IPC is not handled in the driver before the module is unloaded, the HW will keep generating this interrupt and hang the whole system.

Vice versa, in the FW side, my PR here implement the same logic to avoid this error.

I stopped here:
"Imagine that the host can keep sending ACK IPC(or even storm ACK IPCs)
intentiionally or by accident,"

you didn't describe how this can happen in the first place...

@ranj063
Copy link
Collaborator

ranj063 commented Nov 22, 2019

@keyonjie I am not also not very convinced about this change too. What is the driver "ACK"ing when the module is removed?

@keyonjie
Copy link
Contributor Author

keyonjie commented Nov 22, 2019

@keyonjie I am not also not very convinced about this change too. What is the driver "ACK"ing when the module is removed?

@ranj063 not the driver "ACK"ing in the module removing, the issue happens with module removing is on host side, where is the FW "ACK"ing to the host side. Here is only a symmetric fix in FW side.

@lgirdwood
Copy link
Member

@keyonjie so this is curent real world problem when we remove modules. Can you detail the real problem in commit messages (adding logs shwoing the issue would be good).

@ranj063
Copy link
Collaborator

ranj063 commented Nov 23, 2019

@lgirdwood @keyonjie I think the problem is that on BYT/CHT the interrupts arent getting disabled properly when the module is removed, which is what is probably leading to the problem that keyon is talking about. This PR will just hide the real problem IMO

@keyonjie
Copy link
Contributor Author

keyonjie commented Nov 23, 2019

@lgirdwood @ranj063 yes, we still need to disable the interrupt properly, this PR won't cover all the cases, what here it changes only is to enable the specific DONE interrupt when necessary, I think at least there is no harm to do so.

@keyonjie
Copy link
Contributor Author

keyonjie commented Nov 23, 2019

@ranj063 @plbossart with this change, we can eliminate those "no reply expected" ACK IPCs also. If I am not asking for ACK, all your ACKs(or attacks) will not be passed to me.

@lgirdwood
Copy link
Member

@lgirdwood @keyonjie I think the problem is that on BYT/CHT the interrupts arent getting disabled properly when the module is removed, which is what is probably leading to the problem that keyon is talking about. This PR will just hide the real problem IMO

@keyonjie can you fix this in the driver. It seems your are trying to fix a driver issue in the FW here.

@keyonjie
Copy link
Contributor Author

@lgirdwood @keyonjie I think the problem is that on BYT/CHT the interrupts arent getting disabled properly when the module is removed, which is what is probably leading to the problem that keyon is talking about. This PR will just hide the real problem IMO

@keyonjie can you fix this in the driver. It seems your are trying to fix a driver issue in the FW here.

@lgirdwood no actually, the symmetric PR in the driver side is here: thesofproject/linux#1491

let me know if you have any specific concerns about the FW PR here.

@plbossart
Copy link
Member

@lgirdwood @keyonjie I think the problem is that on BYT/CHT the interrupts arent getting disabled properly when the module is removed, which is what is probably leading to the problem that keyon is talking about. This PR will just hide the real problem IMO

There is no problem on all UEFI machines, the issue with module load/unload only happens on the Cyan model, and we have only tested with the legacy BIOS.
To me that's not enough evidence that the interrupt handling is the root cause.

@lgirdwood
Copy link
Member

@keyonjie any update, can this be closed ?

@keyonjie
Copy link
Contributor Author

keyonjie commented Jan 7, 2020

@lgirdwood yes, let me close it.

@keyonjie keyonjie closed this Jan 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants