Address ABA vulnerability in LocklessTaskQueue. #32
Conversation
I also wonder if it might be worth having the some of steering committee (particularly Audiokinetic/Epic/Unity) have a look. I would not be surprised if they have their own solutions to this. Might lead us to having the best solution pushed to the resonance repository. |
Thanks for bringing this up! +1 on having additional reviews (in addition to @jkammerl from Google), to keep everyone in the loop. I personally wasn't aware of any specific issues caused by LocklessTaskQueue, as long as there is one consumer thread for executing the tasks (i.e., the audio thread). Do you mind providing more context where/how you encountered this issue on FMOD side? Cheers |
Sure. So even with the change, one consumer thread is the maximum allowed. FMOD uses only one consumer thread. We encountered this issue with our users reporting crashes when using resonance audio with FMOD after an extended period of time. Here is one such user case - https://qa.fmod.com/t/fmod-unity-integration-crash-with-resonanceaudio/15710 - and we were provided with crash dumps with callstacks crashed inside LocklessTaskQueue. We were unable to reproduce it ourselves, so looked to solve the problem based on available information on Treiber stack c++ implementations (as this problem does not exist, for example, in Java implementations due to language guarantees). It's easiest to conceptualise how problems can arise looking at a three producer (P1 and P2&3) one consumer (C) scenario. P1 enters PopNodeFromList(&free_list_head_) via Post(). It gets list_head_ptr = list_head.load(); (P1H) and list_head_next_ptr = list_head_ptr->next.load(); (P1N). P1 gets interrupted. P2&3 and C perform a number of operations such that free_list_head_ == P1H and free_list_head_->next != P1N. P1 resumes, and the compare and swap in PopNodeFromList succeeds as *list_head == list_head_ptr (P1H == free_list_head_). But as P1N no longer == list_head_next_ptr (which could theoretically could be anywhere at this point), this swap should be invalid. Hopefully my explanation is clear enough - it takes me a little while to get my head around these types of problems. |
Oh, I probably should have asked before - is there a best way to reach the other steering committee members? |
Thanks for the PR! I'll take a close look on Monday and try to reproduce the issue. |
Good catch! I agree the described case could be an issue. I refactored your patch a bit and will send it out for an internal review on Monday. Once I get approval, I'll update this github repo accordingly. Thanks a lot for reaching out! |
Hi, I refactored your approach and just submitted another pull request (I'll add some feedback on your PR in a second which motivated me to rewrite some parts). |
|
||
// Union for swap operations | ||
union NodeAndTag { | ||
std::atomic<uint64_t> both_atomic; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks dangerous to me given that the size of the std::atomic is not guaranteed to be the exact size of its templated type, see also discussion here: https://stackoverflow.com/questions/18278611/can-two-stdatomics-be-part-of-one-union
This is a solution to address the ABA limitation of the LocklessTaskQueue implementation. Problem is outlined here https://domino.research.ibm.com/library/cyberdig.nsf/papers/58319A2ED2B1078985257003004617EF/$File/rj5118.pdf on page 19 in figure 10.