Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX: problematic use of shared_ptr on droid #4039

Closed
wants to merge 1 commit into from

Conversation

koying
Copy link
Contributor

@koying koying commented Jan 17, 2014

@davilla
Copy link
Contributor

davilla commented Jan 17, 2014

This makes me real nervous, we moved to a shared_ptr for access to player class for a reason, I bet that reason still exists and is waiting to bite us in the ass again.

@koying
Copy link
Contributor Author

koying commented Jan 17, 2014

Yeah, I wanted to build a test and somehow had in mind I had to create a PR for it, while I don't.
Anyway, fact is this workarounds the issue, I hope someone could help finding out the root cause.

@Voyager1
Copy link

I'm thinking that ClosePlayer is perhaps where it goes wrong (although it shouldn't). Currently we get a copy of the internal shared_ptr (rc +1), then we reset the internal shared_ptr (rc -1), finally the copy goes out of scope (again rc -1).
One thing to try is to modify it as follows, should be just as nice, but without any double refcount reduction...

void CApplicationPlayer::ClosePlayer()
{
  CSingleLock lock(m_player_lock);
  if (m_pPlayer != NULL)
  {
    CloseFile();
    // we need to do this directly on the member
    m_pPlayer.reset();
  }
}

@jmarshallnz
Copy link
Contributor

I suggest a modification to:

void CApplicationPlayer::ClosePlayer()
{
  CloseFile();
  { // release the player
    CSingleLock lock(m_player_lock);
    m_pPlayer.reset();
  }
}

We don't need to hold the player lock (nor do we want to) while calling CloseFile, and we also don't need to hold the player pointer across that function call as it'll do it internally.

@herrnst
Copy link
Member

herrnst commented Jan 17, 2014

Throwing in a pair of cents: Especially ::ClosePlayer() seems to have some (locking) relation to the Python interface. With the locking ("CSingleLock lock(m_player_lock);") at the beginning of the method, there was a very high chance of a deadlock when playback got stopped and some Python addon was doing something player-related - #3200 + #3206 (fixup).

@t-nelson
Copy link
Contributor

Odds are the deadlock was actually caused by holding the lock over CloseFile().

@koying
Copy link
Contributor Author

koying commented Jan 18, 2014

Welcome, puzzle-lovers ;)
See https://app.box.com/s/5r4afo9og1bjx5ye9zy3 for a log of the issue.
Hell breaks loose around line 669284, where the CDVDPlayer gets deleted from apparently nowhere.

Note that:

  • It's crashing, not deadlocking. I think the crash is just a secondary symptom of the deletion, though.
  • It's 5 miin into the video. That might lead to a random thread-safety issue.
  • The first delete seems to originate from the main thread. Could be simultaneous with the second one, though.

I've tried JM's ClosePlayer() patch but it doesn't solve this.

@davilla
Copy link
Contributor

davilla commented Jan 18, 2014

eh, what is python doing in the background ? looks like you are playing something via smb.

@koying
Copy link
Contributor Author

koying commented Jan 18, 2014

The python threads are services. trakt & skin.widgets IIRC

@davilla
Copy link
Contributor

davilla commented Jan 18, 2014

do you suspect them in this issue ? otherwise, why log this ?

@koying
Copy link
Contributor Author

koying commented Jan 18, 2014

I don't suspect the addons, but this only happens with multi-threaded python calls.
In line 669368, see that ~CDVDPlayer() is called inside XBMCAddon::xbmc::Player::isPlayingVideo()

@davilla
Copy link
Contributor

davilla commented Jan 18, 2014

I'm not sure you can infer that from the log, you can't see the bt. All you know is ~CDVDPlayer() occurred after that particular XBMCAddon::xbmc::Player::isPlayingVideo()

I would setup a trap, check use_count in ~CDVDPlayer() on that shared pointer and when it hits zero, try and dump a backtrace to see who did the dirty deed.

@koying
Copy link
Contributor Author

koying commented Jan 18, 2014

Well, it's in the same thread and we see
NEWADDON Entering bool XBMCAddon::xbmc::Player::isPlayingVideo()
but no
NEWADDON Leaving bool XBMCAddon::xbmc::Player::isPlayingVideo()

Pretty convincing to me

@davilla
Copy link
Contributor

davilla commented Jan 18, 2014

humm, then boost<shared_ptr> is failing. I'd still setup a trap

@davilla
Copy link
Contributor

davilla commented Jan 18, 2014

if shared_ptr is failing then we have big, big problems.

@koying
Copy link
Contributor Author

koying commented Jan 18, 2014

Has anyone seen strange crashes like this on other platforms?
That'd tell us whether it's a boost issue or a logic issue on our side...

@davilla
Copy link
Contributor

davilla commented Jan 18, 2014

never seen it on my droid boxes and I abuse the hell out of them.

@jmarshallnz
Copy link
Contributor

It's not obvious to me whether the deletion is due to the use_count dropping to zero, but I'm guessing that's what it is as you haven't said otherwise :)

I'd suggest you also need logging in CApplicationPlayer to see exactly which calls are being performed - it's not obvious to me what the process is from the massive amount of logging in the python stuff. I'd remove that logging and instead add logging to GetInternal() and Close/CreatePlayer, where you log the use_count of the shared_ptr. You could add logging to all the other functions in there if you want as well - it might help give context to the log.

I can't really think of anything that your change to use raw pointers would achieve - essentially all you're doing is holding an unsafe pointer instead of a safe one by dereferencing the smart ptr earlier (smart_ptr held only for the duration of the get()) during the interaction with the player (i.e. that pointer could well be deleted out from under you while you still hold it).

The issue with ClosePlayer is not a problem in current master I don't think - rather it was fixing an issue that would be introduced if @Voyager1's suggestion was taken up (deadlock).

The only thing I can think of is some weird issue with a shared_ptr not being copied (use_count not incremented properly) which seems unlikely.

@koying
Copy link
Contributor Author

koying commented Jan 19, 2014

Replacing all

boost::shared_ptr<IPlayer> player = GetInternal();

with

  CSingleLock lock(m_player_lock);
  boost::shared_ptr<IPlayer> player = m_pPlayer;

also solves the issue.
Would that be acceptable?

http://forum.xbmc.org/showthread.php?tid=183467 seems also related.

@Voyager1
Copy link

I don't get how this is different from current code... You're copying the smart pointer through the assignment operator, while the old GetInternal() code just return the copy through the call stack...

@koying
Copy link
Contributor Author

koying commented Jan 19, 2014

Neither do I, which points to a shared_ptr issue, as the title says ;)

@Voyager1
Copy link

wait a minute... I see the difference. The reason it works now is because every CApplicationPlayer method now locks the player for the entire duration of the method execution. So you've essentially serialized access to the player. This potentially creates other deadlock issues.

if you try this instead, you'll know whether it's a shared_ptr issue or a locking one.

boost::shared_ptr<IPlayer> player;
{
  CSingleLock lock(m_player_lock);
  player = m_pPlayer;
}

@t-nelson
Copy link
Contributor

The difference is that the current code only increments the refcount atomically while the decrement happens at will. This holds the lock throughout the duration. The shared_ptr isn't even needed in this case.

@t-nelson
Copy link
Contributor

Ah, @Voyager1 had the lightbulb at the same time. :)

Yes, we do step back into deadlock territory now. Perhaps we should just hold our own refcount to m_pPlayer and manage it through an Acquire()/Release() interface?

@Voyager1
Copy link

@t-nelson we're saying the same thing (although the new code also increases refcount.)

@koying
Copy link
Contributor Author

koying commented Jan 19, 2014

Why would it deadlock? is ApplicationPlayer reentrant?
Why do we currently lock while incrementing and not while decrementing?
BTW, are we sure we do? Is it guaranteed somehow that the returned copy is done before the dtor of the lock?

@koying
Copy link
Contributor Author

koying commented Jan 19, 2014

I tend to agree with @t-nelson . I assume we need atomic ref-counting for some reason, but we don't need a shared_ptr for a pointer which is not actually shared, as it is contained in the class.

@Voyager1
Copy link

@koying We lock while incrementing to be sure we get a valid player reference, which stays valid for as long as we use it. So there's no need to lock while decrementing. That's maybe where some of the problems arise - maybe boost is not thread safe and the decrement could be incorrect? (sorry if this is an ignorant assumption).
For your last question : the copy constructor is called before any destruction of locals (which would be the "unlock").
I don't know about AP being reentrant, but I just highlighted the risk.

@t-nelson
Copy link
Contributor

operations on a shared_ptr are non-atomic

@t-nelson
Copy link
Contributor

Our problem doesn't occur on acquisition, it's on deletion. Presumably, since decrement isn't atomic, two threads zero the refcount at once and BANG!

@koying
Copy link
Contributor Author

koying commented Jan 19, 2014

"Different shared_ptr instances can be "written to" (accessed using mutable operations such as operator= or reset) simultaneosly by multiple threads (even when these instances are copies, and share the same reference count underneath.)"
As we only decrement m_pPlayer copies, that should be fine, although it seems it is not.

@jimfcarroll
Copy link
Member

@t-nelson I'm not sure how 2 threads can 0 a shared_ptr at the same time unless the ref count is messed up. The only problem I can think of is if one thread is 0'ing it an another is trying to fetch it at the same time. Then you can end up with the one trying to fetch it getting a deleted instance.

@jimfcarroll
Copy link
Member

Do we know if the player is getting deleted and then used? Is there a log in the destructor?

@Voyager1
Copy link

You guys sure about that? I thought #define BOOST_SP_DISABLE_THREADS is how you have non-atomic refcounts. Otherwise, they should be atomic.

Regarding the player getting deleted: ClosePlayer() and CreatePlayer() call reset() on m_pPlayer - each causing a decrement of the refcount of the (previously) held pointer.

@koying
Copy link
Contributor Author

koying commented Jan 19, 2014

Another one: http://forum.xbmc.org/showthread.php?tid=183785
Just so that everyone is convinced there is a problem ;)

@jimfcarroll
Copy link
Member

EDIT @Voyager1. While I can't say I'm sure what's happening, I understand what the docs say. It's not thread safe if you're writing the same instance of a shared_ptr from two different threads. If you want to verify if this is what's happening then simply avoid direct use of the shared_ptr and see if the problem goes away. If you want I can provide you with a commit that will do this but I can't verify it.

Let me know. It would take me about 10 minutes.

@koying
Copy link
Contributor Author

koying commented Jan 19, 2014

@jimfcarroll But we never write to the same instance. We always use copies of the shared_ptr to decrement, and the increment is locked.

@jimfcarroll
Copy link
Member

Nope. There's at least 3 places the same instance can be used from multiple threads.

@jimfcarroll
Copy link
Member

well ... actually, those places require the destructor to be the thing that's called from the "other" thread. In the current master they are protected with a lock.

@Voyager1
Copy link

@jimfcarroll thanks for clarifying. I get it now.

@jimfcarroll
Copy link
Member

@koying, hey ... if what I suggested is what's happening, add a destructor and use the lock while resetting the m_pPlayer. If that fixes it then that was the problem. If not, the universe is inconsistent. :-)

@koying
Copy link
Contributor Author

koying commented Jan 19, 2014

The only root cause I can see, considering that doing locked decrements solves the issue, is that the boost::shared_ptr we are using on android is not thread-safe when sharing the same counter, even if using different instances and contrary to the doc.

@jimfcarroll
Copy link
Member

So we're 100% sure the destructor on CApplicationPlayer isn't being called? If so, then I tend to agree with you.

@koying
Copy link
Contributor Author

koying commented Jan 19, 2014

CApplicationPlayer is basically a singleton, created and destructed together with CApplication

@jimfcarroll
Copy link
Member

I don't want to keep spaming everyone ... can you jump on IRC?

@jmarshallnz
Copy link
Contributor

Surely adding a log line in GetInternal() that outputs the use_count() will assist in figuring out where things are going bang and verify where the problem is, rather than guessing ;)

The destructor should not be being called, and every other write (or read) or the original shared_ptr is under a lock. Thus, either the locks are broken, or shared_ptr is broken or the copy mechanism of shared_ptr is being optimised away in some way, or something else we're not thinking of. I suspect the first three are unlikely.

The locked decrements are fundamentally changing the code as they're also serialising access to the player (in addition to the shared_ptr) - that change is definitely wrong and could well cause deadlocks as some of those functions take quite some time to process (and potentially fire off a bunch of callbacks on other threads).

I doubt replacing shared_ptr with a custom ref-counting scheme will do anything - all we're using shared_ptr here for is for the reference counting.

@elupus
Copy link
Contributor

elupus commented Jan 19, 2014

Could there be a bug with getinternal getting optimized away?

Did anybody try putting a shared ptr on stack inside getinternal before
returning it?

But I agree with Jonathan here, a debug line would help.
On Jan 19, 2014 10:20 PM, "jmarshallnz" notifications@github.com wrote:

Surely adding a log line in GetInternal() that outputs the use_count()
will assist in figuring out where things are going bang and verify where
the problem is, rather than guessing ;)

The destructor should not be being called, and every other write (or read)
or the original shared_ptr is under a lock. Thus, either the locks are
broken, or shared_ptr is broken or the copy mechanism of shared_ptr is
being optimised away in some way, or something else we're not thinking of.
I suspect the first three are unlikely.

The locked decrements are fundamentally changing the code as they're also
serialising access to the player (in addition to the shared_ptr) - that
change is definitely wrong and could well cause deadlocks as some of those
functions take quite some time to process (and potentially fire off a bunch
of callbacks on other threads).

I doubt replacing shared_ptr with a custom ref-counting scheme will do
anything - all we're using shared_ptr here for is for the reference
counting.


Reply to this email directly or view it on GitHubhttps://github.com//pull/4039#issuecomment-32720255
.

@davilla
Copy link
Contributor

davilla commented Jan 20, 2014

it's worth figuring out if there is some issue with shared_ptr, we use it all over the place.

@koying
Copy link
Contributor Author

koying commented Jan 20, 2014

Gents, please provide me with the debug patches you want and I'll happily
run them.

If it's a threading issue inside shared_ptr, I tend to think logging might
prove inconclusive though

@t-nelson
Copy link
Contributor

From http://www.boost.org/doc/libs/1_55_0/libs/smart_ptr/shared_ptr.htm#ThreadSafety

You can define the macro BOOST_SP_USE_PTHREADS to turn off the lock-free platform-specific
implementation and fall back to the generic pthread_mutex_t-based code.

Have we tried this? Maybe the boost atomics are busted under android.

@koying
Copy link
Contributor Author

koying commented Jan 20, 2014

@t-nelson So far so good. Probably not an actual good news, but at least we seem to have a viable workaround.
I'll update the PR tomorrow if the fix is confirmed after the "night session" ;)

@elupus
Copy link
Contributor

elupus commented Jan 20, 2014

Could you check also what happens if you change the internal m_pPlayer declaration from:
"shared_ptr" to "volatile shared_ptr".

Only the member variable should need changing. I'm thinking some optimizations could break stuff. The shared_ptr is accessed from multiple thread, even thou it's properly protected.

But then again.. It's arm here. I wouldn't be too surprised if the CAS implementations in boost are broken somewhere.

@davilla
Copy link
Contributor

davilla commented Jan 20, 2014

@koying
Copy link
Contributor Author

koying commented Jan 20, 2014

@davilla Are they saying that we need BOOST_SP_USE_SPINLOCK, i.e. that we are applying the patch but not actually using it?

@davilla
Copy link
Contributor

davilla commented Jan 20, 2014

possible

@elupus
Copy link
Contributor

elupus commented Jan 20, 2014

There was a shitload of nice info on lockfree stuff on http://preshing.com/ . Stuff like double checked lock pattern being broken. I'm pretty sure we use that in places.

getSingleton() {
  if(a == null) {
    lock();
    if(a == null)
      a = new foo();
    release();
  }
  return a;
}

Without memory barriers that is b0rk, and more likely to crash on arm than x86. Unrelated to issue, but still food for thought and education.

Sorry for the spam, just found it interesting.

@koying
Copy link
Contributor Author

koying commented Jan 21, 2014

Superseeded by #4062
Thanks for your help

@koying koying closed this Jan 21, 2014
@koying koying deleted the fixsharedptr branch January 21, 2014 11:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants