New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug in lib.ctable resizing? #1491
Comments
ping @alexandergall |
I can confirm this issue also pops up in a production environment. I don't think the problem is in the What happens next is that a bogus record may look like an idle flow and is removed by a call to I was only able to reproduce the problem in a multi-process setup. I also have strong evidence that it is related to memory-mapped huge pages: the effect seems to disappear if
in |
The corruption occurs in chunks of 64 bytes. For example
I have identified this as a "completion queue entry" (CQE) as described in section 7.12.1.1 of the PRM. For example, the bytes It is yet unclear, why this error is raised and why the CQE is posted to an address way outside the CQ. Also interesting is that this error is usually followed by more that have 0x22 as error code, which is a generic "Abort error", e.g.
The It also seems like the corruption either occurs very early on and then only once or not at all. |
The source of the problem is that a non-clean shutdown of a Snabb process does not properly shut down the NIC. The NIC continues to receive packets and writes the CQEs to the physical memory pages that were assigned to it by the process that has exited. The same page can be re-mapped by a new process which leads to the corruption. The generic shutdown mechanism has a provision to unset the bus master even for a non-orderly shutdown of a worker process in |
Hit this non-deterministic failure while hacking. This issue is a note in case it pops up again.
The text was updated successfully, but these errors were encountered: