New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FIX: problematic use of shared_ptr on droid #4039
Conversation
This makes me real nervous, we moved to a shared_ptr for access to player class for a reason, I bet that reason still exists and is waiting to bite us in the ass again. |
Yeah, I wanted to build a test and somehow had in mind I had to create a PR for it, while I don't. |
I'm thinking that ClosePlayer is perhaps where it goes wrong (although it shouldn't). Currently we get a copy of the internal shared_ptr (rc +1), then we reset the internal shared_ptr (rc -1), finally the copy goes out of scope (again rc -1).
|
I suggest a modification to:
We don't need to hold the player lock (nor do we want to) while calling CloseFile, and we also don't need to hold the player pointer across that function call as it'll do it internally. |
Throwing in a pair of cents: Especially ::ClosePlayer() seems to have some (locking) relation to the Python interface. With the locking ("CSingleLock lock(m_player_lock);") at the beginning of the method, there was a very high chance of a deadlock when playback got stopped and some Python addon was doing something player-related - #3200 + #3206 (fixup). |
Odds are the deadlock was actually caused by holding the lock over CloseFile(). |
Welcome, puzzle-lovers ;) Note that:
I've tried JM's ClosePlayer() patch but it doesn't solve this. |
eh, what is python doing in the background ? looks like you are playing something via smb. |
The python threads are services. trakt & skin.widgets IIRC |
do you suspect them in this issue ? otherwise, why log this ? |
I don't suspect the addons, but this only happens with multi-threaded python calls. |
I'm not sure you can infer that from the log, you can't see the bt. All you know is ~CDVDPlayer() occurred after that particular XBMCAddon::xbmc::Player::isPlayingVideo() I would setup a trap, check use_count in ~CDVDPlayer() on that shared pointer and when it hits zero, try and dump a backtrace to see who did the dirty deed. |
Well, it's in the same thread and we see Pretty convincing to me |
humm, then boost<shared_ptr> is failing. I'd still setup a trap |
if shared_ptr is failing then we have big, big problems. |
Has anyone seen strange crashes like this on other platforms? |
never seen it on my droid boxes and I abuse the hell out of them. |
It's not obvious to me whether the deletion is due to the use_count dropping to zero, but I'm guessing that's what it is as you haven't said otherwise :) I'd suggest you also need logging in CApplicationPlayer to see exactly which calls are being performed - it's not obvious to me what the process is from the massive amount of logging in the python stuff. I'd remove that logging and instead add logging to GetInternal() and Close/CreatePlayer, where you log the use_count of the shared_ptr. You could add logging to all the other functions in there if you want as well - it might help give context to the log. I can't really think of anything that your change to use raw pointers would achieve - essentially all you're doing is holding an unsafe pointer instead of a safe one by dereferencing the smart ptr earlier (smart_ptr held only for the duration of the get()) during the interaction with the player (i.e. that pointer could well be deleted out from under you while you still hold it). The issue with ClosePlayer is not a problem in current master I don't think - rather it was fixing an issue that would be introduced if @Voyager1's suggestion was taken up (deadlock). The only thing I can think of is some weird issue with a shared_ptr not being copied (use_count not incremented properly) which seems unlikely. |
Replacing all
with
also solves the issue. http://forum.xbmc.org/showthread.php?tid=183467 seems also related. |
I don't get how this is different from current code... You're copying the smart pointer through the assignment operator, while the old GetInternal() code just return the copy through the call stack... |
Neither do I, which points to a shared_ptr issue, as the title says ;) |
wait a minute... I see the difference. The reason it works now is because every CApplicationPlayer method now locks the player for the entire duration of the method execution. So you've essentially serialized access to the player. This potentially creates other deadlock issues. if you try this instead, you'll know whether it's a shared_ptr issue or a locking one.
|
The difference is that the current code only increments the refcount atomically while the decrement happens at will. This holds the lock throughout the duration. The shared_ptr isn't even needed in this case. |
Ah, @Voyager1 had the lightbulb at the same time. :) Yes, we do step back into deadlock territory now. Perhaps we should just hold our own refcount to m_pPlayer and manage it through an Acquire()/Release() interface? |
@t-nelson we're saying the same thing (although the new code also increases refcount.) |
Why would it deadlock? is ApplicationPlayer reentrant? |
I tend to agree with @t-nelson . I assume we need atomic ref-counting for some reason, but we don't need a shared_ptr for a pointer which is not actually shared, as it is contained in the class. |
@koying We lock while incrementing to be sure we get a valid player reference, which stays valid for as long as we use it. So there's no need to lock while decrementing. That's maybe where some of the problems arise - maybe boost is not thread safe and the decrement could be incorrect? (sorry if this is an ignorant assumption). |
operations on a shared_ptr are non-atomic |
Our problem doesn't occur on acquisition, it's on deletion. Presumably, since decrement isn't atomic, two threads zero the refcount at once and BANG! |
"Different shared_ptr instances can be "written to" (accessed using mutable operations such as operator= or reset) simultaneosly by multiple threads (even when these instances are copies, and share the same reference count underneath.)" |
@t-nelson I'm not sure how 2 threads can 0 a shared_ptr at the same time unless the ref count is messed up. The only problem I can think of is if one thread is 0'ing it an another is trying to fetch it at the same time. Then you can end up with the one trying to fetch it getting a deleted instance. |
Do we know if the player is getting deleted and then used? Is there a log in the destructor? |
You guys sure about that? I thought #define BOOST_SP_DISABLE_THREADS is how you have non-atomic refcounts. Otherwise, they should be atomic. Regarding the player getting deleted: ClosePlayer() and CreatePlayer() call reset() on m_pPlayer - each causing a decrement of the refcount of the (previously) held pointer. |
Another one: http://forum.xbmc.org/showthread.php?tid=183785 |
EDIT @Voyager1. While I can't say I'm sure what's happening, I understand what the docs say. It's not thread safe if you're writing the same instance of a shared_ptr from two different threads. If you want to verify if this is what's happening then simply avoid direct use of the shared_ptr and see if the problem goes away. If you want I can provide you with a commit that will do this but I can't verify it. Let me know. It would take me about 10 minutes. |
@jimfcarroll But we never write to the same instance. We always use copies of the shared_ptr to decrement, and the increment is locked. |
Nope. There's at least 3 places the same instance can be used from multiple threads. |
well ... actually, those places require the destructor to be the thing that's called from the "other" thread. In the current master they are protected with a lock. |
@jimfcarroll thanks for clarifying. I get it now. |
@koying, hey ... if what I suggested is what's happening, add a destructor and use the lock while resetting the m_pPlayer. If that fixes it then that was the problem. If not, the universe is inconsistent. :-) |
The only root cause I can see, considering that doing locked decrements solves the issue, is that the boost::shared_ptr we are using on android is not thread-safe when sharing the same counter, even if using different instances and contrary to the doc. |
So we're 100% sure the destructor on CApplicationPlayer isn't being called? If so, then I tend to agree with you. |
CApplicationPlayer is basically a singleton, created and destructed together with CApplication |
I don't want to keep spaming everyone ... can you jump on IRC? |
Surely adding a log line in GetInternal() that outputs the use_count() will assist in figuring out where things are going bang and verify where the problem is, rather than guessing ;) The destructor should not be being called, and every other write (or read) or the original shared_ptr is under a lock. Thus, either the locks are broken, or shared_ptr is broken or the copy mechanism of shared_ptr is being optimised away in some way, or something else we're not thinking of. I suspect the first three are unlikely. The locked decrements are fundamentally changing the code as they're also serialising access to the player (in addition to the shared_ptr) - that change is definitely wrong and could well cause deadlocks as some of those functions take quite some time to process (and potentially fire off a bunch of callbacks on other threads). I doubt replacing shared_ptr with a custom ref-counting scheme will do anything - all we're using shared_ptr here for is for the reference counting. |
Could there be a bug with getinternal getting optimized away? Did anybody try putting a shared ptr on stack inside getinternal before But I agree with Jonathan here, a debug line would help.
|
it's worth figuring out if there is some issue with shared_ptr, we use it all over the place. |
Gents, please provide me with the debug patches you want and I'll happily If it's a threading issue inside shared_ptr, I tend to think logging might |
From http://www.boost.org/doc/libs/1_55_0/libs/smart_ptr/shared_ptr.htm#ThreadSafety
Have we tried this? Maybe the boost atomics are busted under android. |
@t-nelson So far so good. Probably not an actual good news, but at least we seem to have a viable workaround. |
Could you check also what happens if you change the internal m_pPlayer declaration from: Only the member variable should need changing. I'm thinking some optimizations could break stuff. The shared_ptr is accessed from multiple thread, even thou it's properly protected. But then again.. It's arm here. I wouldn't be too surprised if the CAS implementations in boost are broken somewhere. |
tools/depends/target/boost/add-arm-mem-barrier.patch |
@davilla Are they saying that we need BOOST_SP_USE_SPINLOCK, i.e. that we are applying the patch but not actually using it? |
possible |
There was a shitload of nice info on lockfree stuff on http://preshing.com/ . Stuff like double checked lock pattern being broken. I'm pretty sure we use that in places.
Without memory barriers that is b0rk, and more likely to crash on arm than x86. Unrelated to issue, but still food for thought and education. Sorry for the spam, just found it interesting. |
Superseeded by #4062 |
See http://forum.xbmc.org/showthread.php?tid=177737
I bet http://forum.xbmc.org/showthread.php?tid=183367 is related.