FIX: problematic use of shared_ptr on droid #4039

koying · 2014-01-17T15:46:35Z

See http://forum.xbmc.org/showthread.php?tid=177737

I bet http://forum.xbmc.org/showthread.php?tid=183367 is related.

davilla · 2014-01-17T16:40:04Z

This makes me real nervous, we moved to a shared_ptr for access to player class for a reason, I bet that reason still exists and is waiting to bite us in the ass again.

koying · 2014-01-17T17:14:28Z

Yeah, I wanted to build a test and somehow had in mind I had to create a PR for it, while I don't.
Anyway, fact is this workarounds the issue, I hope someone could help finding out the root cause.

Voyager1 · 2014-01-17T20:04:56Z

I'm thinking that ClosePlayer is perhaps where it goes wrong (although it shouldn't). Currently we get a copy of the internal shared_ptr (rc +1), then we reset the internal shared_ptr (rc -1), finally the copy goes out of scope (again rc -1).
One thing to try is to modify it as follows, should be just as nice, but without any double refcount reduction...

void CApplicationPlayer::ClosePlayer()
{
  CSingleLock lock(m_player_lock);
  if (m_pPlayer != NULL)
  {
    CloseFile();
    // we need to do this directly on the member
    m_pPlayer.reset();
  }
}

jmarshallnz · 2014-01-17T20:37:03Z

I suggest a modification to:

void CApplicationPlayer::ClosePlayer()
{
  CloseFile();
  { // release the player
    CSingleLock lock(m_player_lock);
    m_pPlayer.reset();
  }
}

We don't need to hold the player lock (nor do we want to) while calling CloseFile, and we also don't need to hold the player pointer across that function call as it'll do it internally.

herrnst · 2014-01-17T22:47:28Z

Throwing in a pair of cents: Especially ::ClosePlayer() seems to have some (locking) relation to the Python interface. With the locking ("CSingleLock lock(m_player_lock);") at the beginning of the method, there was a very high chance of a deadlock when playback got stopped and some Python addon was doing something player-related - #3200 + #3206 (fixup).

t-nelson · 2014-01-17T23:10:58Z

Odds are the deadlock was actually caused by holding the lock over CloseFile().

koying · 2014-01-18T11:25:03Z

Welcome, puzzle-lovers ;)
See https://app.box.com/s/5r4afo9og1bjx5ye9zy3 for a log of the issue.
Hell breaks loose around line 669284, where the CDVDPlayer gets deleted from apparently nowhere.

Note that:

It's crashing, not deadlocking. I think the crash is just a secondary symptom of the deletion, though.
It's 5 miin into the video. That might lead to a random thread-safety issue.
The first delete seems to originate from the main thread. Could be simultaneous with the second one, though.

I've tried JM's ClosePlayer() patch but it doesn't solve this.

davilla · 2014-01-18T16:23:32Z

eh, what is python doing in the background ? looks like you are playing something via smb.

koying · 2014-01-18T16:25:33Z

The python threads are services. trakt & skin.widgets IIRC

davilla · 2014-01-18T16:27:46Z

do you suspect them in this issue ? otherwise, why log this ?

koying · 2014-01-18T16:31:15Z

I don't suspect the addons, but this only happens with multi-threaded python calls.
In line 669368, see that ~CDVDPlayer() is called inside XBMCAddon::xbmc::Player::isPlayingVideo()

davilla · 2014-01-18T16:53:13Z

I'm not sure you can infer that from the log, you can't see the bt. All you know is ~CDVDPlayer() occurred after that particular XBMCAddon::xbmc::Player::isPlayingVideo()

I would setup a trap, check use_count in ~CDVDPlayer() on that shared pointer and when it hits zero, try and dump a backtrace to see who did the dirty deed.

koying · 2014-01-18T16:59:45Z

Well, it's in the same thread and we see
NEWADDON Entering bool XBMCAddon::xbmc::Player::isPlayingVideo()
but no
NEWADDON Leaving bool XBMCAddon::xbmc::Player::isPlayingVideo()

Pretty convincing to me

davilla · 2014-01-18T17:02:41Z

humm, then boost<shared_ptr> is failing. I'd still setup a trap

davilla · 2014-01-18T17:03:09Z

if shared_ptr is failing then we have big, big problems.

koying · 2014-01-18T17:09:56Z

Has anyone seen strange crashes like this on other platforms?
That'd tell us whether it's a boost issue or a logic issue on our side...

davilla · 2014-01-18T17:31:11Z

never seen it on my droid boxes and I abuse the hell out of them.

jmarshallnz · 2014-01-18T22:08:49Z

It's not obvious to me whether the deletion is due to the use_count dropping to zero, but I'm guessing that's what it is as you haven't said otherwise :)

I'd suggest you also need logging in CApplicationPlayer to see exactly which calls are being performed - it's not obvious to me what the process is from the massive amount of logging in the python stuff. I'd remove that logging and instead add logging to GetInternal() and Close/CreatePlayer, where you log the use_count of the shared_ptr. You could add logging to all the other functions in there if you want as well - it might help give context to the log.

I can't really think of anything that your change to use raw pointers would achieve - essentially all you're doing is holding an unsafe pointer instead of a safe one by dereferencing the smart ptr earlier (smart_ptr held only for the duration of the get()) during the interaction with the player (i.e. that pointer could well be deleted out from under you while you still hold it).

The issue with ClosePlayer is not a problem in current master I don't think - rather it was fixing an issue that would be introduced if @Voyager1's suggestion was taken up (deadlock).

The only thing I can think of is some weird issue with a shared_ptr not being copied (use_count not incremented properly) which seems unlikely.

koying · 2014-01-19T17:13:05Z

Replacing all

boost::shared_ptr<IPlayer> player = GetInternal();

with

  CSingleLock lock(m_player_lock);
  boost::shared_ptr<IPlayer> player = m_pPlayer;

also solves the issue.
Would that be acceptable?

http://forum.xbmc.org/showthread.php?tid=183467 seems also related.

Voyager1 · 2014-01-19T18:08:20Z

I don't get how this is different from current code... You're copying the smart pointer through the assignment operator, while the old GetInternal() code just return the copy through the call stack...

koying · 2014-01-19T18:12:37Z

Neither do I, which points to a shared_ptr issue, as the title says ;)

Voyager1 · 2014-01-19T18:30:37Z

wait a minute... I see the difference. The reason it works now is because every CApplicationPlayer method now locks the player for the entire duration of the method execution. So you've essentially serialized access to the player. This potentially creates other deadlock issues.

if you try this instead, you'll know whether it's a shared_ptr issue or a locking one.

boost::shared_ptr<IPlayer> player;
{
  CSingleLock lock(m_player_lock);
  player = m_pPlayer;
}

t-nelson · 2014-01-19T18:31:22Z

The difference is that the current code only increments the refcount atomically while the decrement happens at will. This holds the lock throughout the duration. The shared_ptr isn't even needed in this case.

t-nelson · 2014-01-19T18:33:58Z

Ah, @Voyager1 had the lightbulb at the same time. :)

Yes, we do step back into deadlock territory now. Perhaps we should just hold our own refcount to m_pPlayer and manage it through an Acquire()/Release() interface?

Voyager1 · 2014-01-19T18:36:32Z

@t-nelson we're saying the same thing (although the new code also increases refcount.)

koying · 2014-01-19T18:45:29Z

Why would it deadlock? is ApplicationPlayer reentrant?
Why do we currently lock while incrementing and not while decrementing?
BTW, are we sure we do? Is it guaranteed somehow that the returned copy is done before the dtor of the lock?

koying · 2014-01-19T18:55:53Z

I tend to agree with @t-nelson . I assume we need atomic ref-counting for some reason, but we don't need a shared_ptr for a pointer which is not actually shared, as it is contained in the class.

Voyager1 · 2014-01-19T18:58:30Z

@koying We lock while incrementing to be sure we get a valid player reference, which stays valid for as long as we use it. So there's no need to lock while decrementing. That's maybe where some of the problems arise - maybe boost is not thread safe and the decrement could be incorrect? (sorry if this is an ignorant assumption).
For your last question : the copy constructor is called before any destruction of locals (which would be the "unlock").
I don't know about AP being reentrant, but I just highlighted the risk.

t-nelson · 2014-01-19T18:59:46Z

operations on a shared_ptr are non-atomic

t-nelson · 2014-01-19T19:06:19Z

Our problem doesn't occur on acquisition, it's on deletion. Presumably, since decrement isn't atomic, two threads zero the refcount at once and BANG!

koying · 2014-01-19T19:09:04Z

"Different shared_ptr instances can be "written to" (accessed using mutable operations such as operator= or reset) simultaneosly by multiple threads (even when these instances are copies, and share the same reference count underneath.)"
As we only decrement m_pPlayer copies, that should be fine, although it seems it is not.

jimfcarroll · 2014-01-19T19:15:46Z

@t-nelson I'm not sure how 2 threads can 0 a shared_ptr at the same time unless the ref count is messed up. The only problem I can think of is if one thread is 0'ing it an another is trying to fetch it at the same time. Then you can end up with the one trying to fetch it getting a deleted instance.

jimfcarroll · 2014-01-19T19:16:39Z

Do we know if the player is getting deleted and then used? Is there a log in the destructor?

Voyager1 · 2014-01-19T19:23:45Z

You guys sure about that? I thought #define BOOST_SP_DISABLE_THREADS is how you have non-atomic refcounts. Otherwise, they should be atomic.

Regarding the player getting deleted: ClosePlayer() and CreatePlayer() call reset() on m_pPlayer - each causing a decrement of the refcount of the (previously) held pointer.

koying · 2014-01-19T19:31:11Z

Another one: http://forum.xbmc.org/showthread.php?tid=183785
Just so that everyone is convinced there is a problem ;)

jimfcarroll · 2014-01-19T19:38:29Z

EDIT @Voyager1. While I can't say I'm sure what's happening, I understand what the docs say. It's not thread safe if you're writing the same instance of a shared_ptr from two different threads. If you want to verify if this is what's happening then simply avoid direct use of the shared_ptr and see if the problem goes away. If you want I can provide you with a commit that will do this but I can't verify it.

Let me know. It would take me about 10 minutes.

koying · 2014-01-19T19:41:15Z

@jimfcarroll But we never write to the same instance. We always use copies of the shared_ptr to decrement, and the increment is locked.

jimfcarroll · 2014-01-19T19:44:15Z

Nope. There's at least 3 places the same instance can be used from multiple threads.

jimfcarroll · 2014-01-19T19:46:42Z

well ... actually, those places require the destructor to be the thing that's called from the "other" thread. In the current master they are protected with a lock.

Voyager1 · 2014-01-19T19:48:31Z

@jimfcarroll thanks for clarifying. I get it now.

jimfcarroll · 2014-01-19T19:52:22Z

@koying, hey ... if what I suggested is what's happening, add a destructor and use the lock while resetting the m_pPlayer. If that fixes it then that was the problem. If not, the universe is inconsistent. :-)

koying · 2014-01-19T19:59:10Z

The only root cause I can see, considering that doing locked decrements solves the issue, is that the boost::shared_ptr we are using on android is not thread-safe when sharing the same counter, even if using different instances and contrary to the doc.

jimfcarroll · 2014-01-19T20:02:23Z

So we're 100% sure the destructor on CApplicationPlayer isn't being called? If so, then I tend to agree with you.

koying · 2014-01-19T20:05:44Z

CApplicationPlayer is basically a singleton, created and destructed together with CApplication

jimfcarroll · 2014-01-19T20:11:27Z

I don't want to keep spaming everyone ... can you jump on IRC?

jmarshallnz · 2014-01-19T21:20:42Z

Surely adding a log line in GetInternal() that outputs the use_count() will assist in figuring out where things are going bang and verify where the problem is, rather than guessing ;)

The destructor should not be being called, and every other write (or read) or the original shared_ptr is under a lock. Thus, either the locks are broken, or shared_ptr is broken or the copy mechanism of shared_ptr is being optimised away in some way, or something else we're not thinking of. I suspect the first three are unlikely.

The locked decrements are fundamentally changing the code as they're also serialising access to the player (in addition to the shared_ptr) - that change is definitely wrong and could well cause deadlocks as some of those functions take quite some time to process (and potentially fire off a bunch of callbacks on other threads).

I doubt replacing shared_ptr with a custom ref-counting scheme will do anything - all we're using shared_ptr here for is for the reference counting.

elupus · 2014-01-19T22:02:57Z

Could there be a bug with getinternal getting optimized away?

Did anybody try putting a shared ptr on stack inside getinternal before
returning it?

But I agree with Jonathan here, a debug line would help.
On Jan 19, 2014 10:20 PM, "jmarshallnz" notifications@github.com wrote:

Surely adding a log line in GetInternal() that outputs the use_count()
will assist in figuring out where things are going bang and verify where
the problem is, rather than guessing ;)

The destructor should not be being called, and every other write (or read)
or the original shared_ptr is under a lock. Thus, either the locks are
broken, or shared_ptr is broken or the copy mechanism of shared_ptr is
being optimised away in some way, or something else we're not thinking of.
I suspect the first three are unlikely.

The locked decrements are fundamentally changing the code as they're also
serialising access to the player (in addition to the shared_ptr) - that
change is definitely wrong and could well cause deadlocks as some of those
functions take quite some time to process (and potentially fire off a bunch
of callbacks on other threads).

I doubt replacing shared_ptr with a custom ref-counting scheme will do
anything - all we're using shared_ptr here for is for the reference
counting.

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/4039#issuecomment-32720255
.

davilla · 2014-01-20T00:03:26Z

it's worth figuring out if there is some issue with shared_ptr, we use it all over the place.

koying · 2014-01-20T00:48:25Z

Gents, please provide me with the debug patches you want and I'll happily
run them.

If it's a threading issue inside shared_ptr, I tend to think logging might
prove inconclusive though

t-nelson · 2014-01-20T01:17:31Z

From http://www.boost.org/doc/libs/1_55_0/libs/smart_ptr/shared_ptr.htm#ThreadSafety

You can define the macro BOOST_SP_USE_PTHREADS to turn off the lock-free platform-specific
implementation and fall back to the generic pthread_mutex_t-based code.

Have we tried this? Maybe the boost atomics are busted under android.

koying · 2014-01-20T18:29:17Z

@t-nelson So far so good. Probably not an actual good news, but at least we seem to have a viable workaround.
I'll update the PR tomorrow if the fix is confirmed after the "night session" ;)

elupus · 2014-01-20T18:41:18Z

Could you check also what happens if you change the internal m_pPlayer declaration from:
"shared_ptr" to "volatile shared_ptr".

Only the member variable should need changing. I'm thinking some optimizations could break stuff. The shared_ptr is accessed from multiple thread, even thou it's properly protected.

But then again.. It's arm here. I wouldn't be too surprised if the CAS implementations in boost are broken somewhere.

davilla · 2014-01-20T19:05:22Z

http://stackoverflow.com/questions/14692802/boost-threads-in-ios-thread-info-object-is-being-destructed-before-the-thread

tools/depends/target/boost/add-arm-mem-barrier.patch

koying · 2014-01-20T19:12:34Z

@davilla Are they saying that we need BOOST_SP_USE_SPINLOCK, i.e. that we are applying the patch but not actually using it?

davilla · 2014-01-20T19:43:28Z

possible

elupus · 2014-01-20T21:21:41Z

There was a shitload of nice info on lockfree stuff on http://preshing.com/ . Stuff like double checked lock pattern being broken. I'm pretty sure we use that in places.

getSingleton() {
  if(a == null) {
    lock();
    if(a == null)
      a = new foo();
    release();
  }
  return a;
}

Without memory barriers that is b0rk, and more likely to crash on arm than x86. Unrelated to issue, but still food for thought and education.

Sorry for the spam, just found it interesting.

koying · 2014-01-21T11:52:47Z

Superseeded by #4062
Thanks for your help

FIX: problematic use of shared_ptr on droid

672da20

koying closed this Jan 21, 2014

koying deleted the fixsharedptr branch January 21, 2014 11:53

FIX: problematic use of shared_ptr on droid #4039

FIX: problematic use of shared_ptr on droid #4039

Conversation

koying commented Jan 17, 2014

davilla commented Jan 17, 2014

koying commented Jan 17, 2014

Voyager1 commented Jan 17, 2014

jmarshallnz commented Jan 17, 2014

herrnst commented Jan 17, 2014

t-nelson commented Jan 17, 2014

koying commented Jan 18, 2014

davilla commented Jan 18, 2014

koying commented Jan 18, 2014

davilla commented Jan 18, 2014

koying commented Jan 18, 2014

davilla commented Jan 18, 2014

koying commented Jan 18, 2014

davilla commented Jan 18, 2014

davilla commented Jan 18, 2014

koying commented Jan 18, 2014

davilla commented Jan 18, 2014

jmarshallnz commented Jan 18, 2014

koying commented Jan 19, 2014

Voyager1 commented Jan 19, 2014

koying commented Jan 19, 2014

Voyager1 commented Jan 19, 2014

t-nelson commented Jan 19, 2014

t-nelson commented Jan 19, 2014

Voyager1 commented Jan 19, 2014

koying commented Jan 19, 2014

koying commented Jan 19, 2014

Voyager1 commented Jan 19, 2014

t-nelson commented Jan 19, 2014

t-nelson commented Jan 19, 2014

koying commented Jan 19, 2014

jimfcarroll commented Jan 19, 2014

jimfcarroll commented Jan 19, 2014

Voyager1 commented Jan 19, 2014

koying commented Jan 19, 2014

jimfcarroll commented Jan 19, 2014

koying commented Jan 19, 2014

jimfcarroll commented Jan 19, 2014

jimfcarroll commented Jan 19, 2014

Voyager1 commented Jan 19, 2014

jimfcarroll commented Jan 19, 2014

koying commented Jan 19, 2014

jimfcarroll commented Jan 19, 2014

koying commented Jan 19, 2014

jimfcarroll commented Jan 19, 2014

jmarshallnz commented Jan 19, 2014

elupus commented Jan 19, 2014

davilla commented Jan 20, 2014

koying commented Jan 20, 2014

t-nelson commented Jan 20, 2014

koying commented Jan 20, 2014

elupus commented Jan 20, 2014

davilla commented Jan 20, 2014

koying commented Jan 20, 2014

davilla commented Jan 20, 2014

elupus commented Jan 20, 2014

koying commented Jan 21, 2014