libnfs crashes XBMC resuming - segfault rpc_allocate_pdu #28

Cyberwizzard · 2012-11-16T08:32:46Z

The problem: I am running the XBMC nightly builds and often when I pause and then resume playback, XBMC crashes. With help from one of the XBMC devs I narrowed the problem down to a crash within libnfs, see also: http://trac.xbmc.org/ticket/13505

I started with libnfs 1.2.x from Ubuntu 11.10 (I think), upgraded to 1.3.0 from the nightly XBMC PPA and now pulled the master from here.

I have experienced this issue for a while now and it seems to get worse. My guess is that I keep running into a race condition.

Relevant parts of the backtrace as posted at http://xbmclogs.com/show.php?id=14577 :
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ff8d1f44700 (LWP 30968)]
0x00007ff8c931064e in rpc_allocate_pdu () from /usr/lib/x86_64-linux-gnu/libnfs.so.1
(gdb) thread apply all bt
snip
Thread 7 (Thread 0x7ff8d1f44700 (LWP 30968)):
#0 0x00007ff8c931064e in rpc_allocate_pdu () from /usr/lib/x86_64-linux-gnu/libnfs.so.1
#1 0x00007ff8c9317936 in rpc_nfs_read_async () from /usr/lib/x86_64-linux-gnu/libnfs.so.1

---Type to continue, or q to quit---
#2 0x00007ff8c93069b3 in nfs_pread_async () from /usr/lib/x86_64-linux-gnu/libnfs.so.1
#3 0x00007ff8c930e58d in nfs_pread () from /usr/lib/x86_64-linux-gnu/libnfs.so.1
#4 0x00007ff8c930e60a in nfs_read () from /usr/lib/x86_64-linux-gnu/libnfs.so.1
#5 0x0000000000c945b6 in XFILE::CNFSFile::Read(void*, long) ()
#6 0x0000000000bda796 in XFILE::CFile::Read(void*, long) ()
#7 0x0000000000a276d8 in CDVDInputStreamFile::Read(unsigned char*, int) ()

snip

I compiled libnfs from source and tried debugging the issue myself. I tried multiple things, checked all pointers in 'rpc_allocate_pdu' for sane values and concluded that the mallocs are checked properly and the memset was not causing the crash either. I then spotted that at the locations where it crashes, values from the 'rpc' struct pointer are read.

Perhaps the rpc struct is freed somewhere while 'rpc_allocate_pdu' is using it or something similar?

sahlberg · 2012-11-17T18:28:21Z

Can you build with debugging so you get the arguments to rpc_allocate_pdu ()
But you say with gdb you have inspected "rpc" and it is non-NULL and it
looks sane?

You can try adding printouts to
rpc_init_context() and rpc_destroy_context()

These should where only places where rpc is malloced and freed so if rpc
is freed somewhere this should show were that happens.

Can you also try running under valgrind ? If it is accessing freed memory
valgrind should show immediately where and what happens.

Do you by any chance compile libnfs with --enable-tirpc ?
I have just added a workaround for tirpc that seems to cause memory
corruption or crashes when using the accessors for 64 bit types.
So could you try my latest git tree as well and re-test.

On Fri, Nov 16, 2012 at 12:32 AM, Cyberwizzard notifications@github.comwrote:

The problem: I am running the XBMC nightly builds and often when I pause
and then resume playback, XBMC crashes. With help from one of the XBMC devs
I narrowed the problem down to a crash within libnfs, see also:
http://trac.xbmc.org/ticket/13505

I started with libnfs 1.2.x from Ubuntu 11.10 (I think), upgraded to 1.3.0
from the nightly XBMC PPA and now pulled the master from here.

I have experienced this issue for a while now and it seems to get worse.
My guess is that I keep running into a race condition.

Relevant parts of the backtrace as posted at
http://xbmclogs.com/show.php?id=14577 :
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ff8d1f44700 (LWP 30968)]
0x00007ff8c931064e in rpc_allocate_pdu () from
/usr/lib/x86_64-linux-gnu/libnfs.so.1
(gdb) thread apply all bt
snip
Thread 7 (Thread 0x7ff8d1f44700 (LWP 30968)):
#0 0x00007ff8c931064e in rpc_allocate_pdu () from
/usr/lib/x86_64-linux-gnu/libnfs.so.1
#1 #1 0x00007ff8c9317936 in
rpc_nfs_read_async () from /usr/lib/x86_64-linux-gnu/libnfs.so.1
---Type to continue, or q to quit---
#2 #2 0x00007ff8c93069b3 in
nfs_pread_async () from /usr/lib/x86_64-linux-gnu/libnfs.so.1
#3 #3 0x00007ff8c930e58d in
nfs_pread () from /usr/lib/x86_64-linux-gnu/libnfs.so.1
#4 #4 0x00007ff8c930e60a in
nfs_read () from /usr/lib/x86_64-linux-gnu/libnfs.so.1
#5 #5 0x0000000000c945b6 in
XFILE::CNFSFile::Read(void_, long) ()
#6 #6 0x0000000000bda796 in
XFILE::CFile::Read(void_, long) ()
#7 #7 0x0000000000a276d8 in
CDVDInputStreamFile::Read(unsigned char_, int) ()
*snip_

I compiled libnfs from source and tried debugging the issue myself. I
tried multiple things, checked all pointers in 'rpc_allocate_pdu' for sane
values and concluded that the mallocs are checked properly and the memset
was not causing the crash either. I then spotted that at the locations
where it crashes, values from the 'rpc' struct pointer are read.

Perhaps the rpc struct is freed somewhere while 'rpc_allocate_pdu' is
using it or something similar?

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/28.

Cyberwizzard · 2012-11-18T17:30:49Z

I added print statements to the function and simply dumped values to the console; so I did not use gdb to step through the program.

I will add some logging to init and destroy context to see if the clear the memory of 'rpc'.

Cyberwizzard · 2012-11-18T18:35:05Z

I numbered the calls to quickly see what was printed where and finally only left the print statements that showed the addresses from the rpc pointer.

Interestingly, you can see in the end that it suddenly uses a different pointer from the one that was in use before.

snip
1.1: rpc_allocate_pdu rpc @ 0x3f9a520
1.1: rpc_allocate_pdu rpc @ 0x3f9a520
1.1: rpc_allocate_pdu rpc @ 0x3f9a520
1.1: rpc_allocate_pdu rpc @ 0x3f9a520
2000: rpc_destroy_context destroyed rpc @ 0x3f9a520
1000: rpc_init_context rpc @ 0x4390060
1.1: rpc_allocate_pdu rpc @ 0x4390060
1.1: rpc_allocate_pdu rpc @ 0x4390060
1.1: rpc_allocate_pdu rpc @ 0x4390060
1.1: rpc_allocate_pdu rpc @ 0x4390060
1.1: rpc_allocate_pdu rpc @ 0x4390060
1.1: rpc_allocate_pdu rpc @ 0x4390060
1.1: rpc_allocate_pdu rpc @ 0x4390060
1.1: rpc_allocate_pdu rpc @ 0x4390060
1.1: rpc_allocate_pdu rpc @ 0x4390060
1.1: rpc_allocate_pdu rpc @ 0x4390060
1000: rpc_init_context rpc @ 0x5001cd0
1.1: rpc_allocate_pdu rpc @ 0x5001cd0
1.1: rpc_allocate_pdu rpc @ 0x5001cd0
1.1: rpc_allocate_pdu rpc @ 0x5001cd0
2000: rpc_destroy_context destroyed rpc @ 0x5001cd0
1.1: rpc_allocate_pdu rpc @ 0x4390060
1.1: rpc_allocate_pdu rpc @ 0x4390060
1.1: rpc_allocate_pdu rpc @ 0x4390060
1.1: rpc_allocate_pdu rpc @ 0x4390060
1.1: rpc_allocate_pdu rpc @ 0x4390060
1.1: rpc_allocate_pdu rpc @ 0x4390060
1.1: rpc_allocate_pdu rpc @ 0x4390060
1.1: rpc_allocate_pdu rpc @ 0x4390060
1.1: rpc_allocate_pdu rpc @ 0x4390060
1.1: rpc_allocate_pdu rpc @ 0x4390060
1.1: rpc_allocate_pdu rpc @ 0x4390060
1.1: rpc_allocate_pdu rpc @ 0x50eeb30
segfault

What can I do next? Would building it with debug support help?

sahlberg · 2012-11-18T18:52:00Z

Hmmmm.

1.1: rpc_allocate_pdu rpc @ 0x50eeb30
segfault

This context where it crashes, was that context ever created ?
Or is this a case of the rpc context pointer being overwritten by some
other random pointer? In which case 0x50eeb30 would just be some random
piece of memory.

Try to build XBMC with debugging so it is easier for you to look at the
stack backtraces.
Also build it with -O0 and -g

And when it crashes, look in the stack backtrace to try to see where XBMC
got this rpc context pointer from.

If this pointer 0x50eeb30 was never created through a call to
rpc_init_context,
then it looks like XBMC might be storing the pointer to the rpc_context in
some structure
and just before this crash then that memory area for that object got
overwritten with some other data.

But first, try to find out where XBMC got that pointer from.

On Sun, Nov 18, 2012 at 10:35 AM, Cyberwizzard notifications@github.comwrote:

I numbered the calls to quickly see what was printed where and finally
only left the print statements that showed the addresses from the rpc
pointer.

Interestingly, you can see in the end that it suddenly uses a different
pointer from the one that was in use before.

snip
1.1: rpc_allocate_pdu rpc @ 0x3f9a520
1.1: rpc_allocate_pdu rpc @ 0x3f9a520
1.1: rpc_allocate_pdu rpc @ 0x3f9a520
1.1: rpc_allocate_pdu rpc @ 0x3f9a520
2000: rpc_destroy_context destroyed rpc @ 0x3f9a520
1000: rpc_init_context rpc @ 0x4390060
1.1: rpc_allocate_pdu rpc @ 0x4390060
1.1: rpc_allocate_pdu rpc @ 0x4390060
1.1: rpc_allocate_pdu rpc @ 0x4390060
1.1: rpc_allocate_pdu rpc @ 0x4390060
1.1: rpc_allocate_pdu rpc @ 0x4390060
1.1: rpc_allocate_pdu rpc @ 0x4390060
1.1: rpc_allocate_pdu rpc @ 0x4390060
1.1: rpc_allocate_pdu rpc @ 0x4390060
1.1: rpc_allocate_pdu rpc @ 0x4390060
1.1: rpc_allocate_pdu rpc @ 0x4390060
1000: rpc_init_context rpc @ 0x5001cd0
1.1: rpc_allocate_pdu rpc @ 0x5001cd0
1.1: rpc_allocate_pdu rpc @ 0x5001cd0
1.1: rpc_allocate_pdu rpc @ 0x5001cd0
2000: rpc_destroy_context destroyed rpc @ 0x5001cd0
1.1: rpc_allocate_pdu rpc @ 0x4390060
1.1: rpc_allocate_pdu rpc @ 0x4390060
1.1: rpc_allocate_pdu rpc @ 0x4390060
1.1: rpc_allocate_pdu rpc @ 0x4390060
1.1: rpc_allocate_pdu rpc @ 0x4390060
1.1: rpc_allocate_pdu rpc @ 0x4390060
1.1: rpc_allocate_pdu rpc @ 0x4390060
1.1: rpc_allocate_pdu rpc @ 0x4390060
1.1: rpc_allocate_pdu rpc @ 0x4390060
1.1: rpc_allocate_pdu rpc @ 0x4390060
1.1: rpc_allocate_pdu rpc @ 0x4390060
1.1: rpc_allocate_pdu rpc @ 0x50eeb30
segfault

What can I do next? Would building it with debug support help?

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/28#issuecomment-10489451.

Cyberwizzard · 2012-11-19T20:18:26Z

I built XBMC and libnfs with -O0 and -g - I got a log at http://xbmclogs.com/show.php?id=15024 showing the crash. The problem is that according to Memphiz from XBMC they keep one context per file and as such never touch the rpc directly.

I tried adding debug statements to the libnfs wrapper to see if it discards the nfs_context prematurely. But so far, it seems to behave as it should...

I saw the xdr commits in master, is that something I could try? As in: do I need to do something to use the new xdr routines except for 'git pull && git update && make clean all' ?

sahlberg · 2012-11-19T20:30:59Z

You could try the xdr commit.

The change is for an issue where the 64 bit accessor on 64 bit platforms
seems to write to unallocated memory or something.
So the workaround is to use a new builtin funciton to read/write 64 bit
quantities instead of the ones coming from libc/tirpc.

It could have an impact.

Try latest git and see if it helps.

On Mon, Nov 19, 2012 at 12:18 PM, Cyberwizzard notifications@github.comwrote:

I built XBMC and libnfs with -O0 and -g - I got a log at
http://xbmclogs.com/show.php?id=15024 showing the crash. The problem is
that according to Memphiz from XBMC they keep one context per file and as
such never touch the rpc directly.

I tried adding debug statements to the libnfs wrapper to see if it
discards the nfs_context prematurely. But so far, it seems to behave as it
should...

I saw the xdr commits in master, is that something I could try? As in: do
I need to do something to use the new xdr routines except for 'git pull &&
git update && make clean all' ?

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/28#issuecomment-10529210.

Cyberwizzard · 2012-11-20T20:45:57Z

Since I am not using tirpc (as far as I know) - it did not seem to make a difference.

But the debug build of both libnfs and xbmc (with loads of debug output) did help me figure out the problem: xbmc had a bug where it considered an active context to be expired and destroyed the context. The consecutive call to nfs_read would then provide this invalid context resulting in the crash; this was an issue with xbmc.

Thank you for helping!

sahlberg · 2012-11-21T17:46:03Z

Good stuff!

I need to do a new release soon to handle the XDR 64 bit issue anyway.

I am thinking about adding a
rpc_context->magic variable that is set on context creation and cleared on
context destroyed.
Then checked on every call that takes an rcp context as argument.
So that it would be a lot easier to find similar issues if they reoccur.

Unfortunately, it the rpc_context is garbage, then I cant set the rpc_error
string.
So only option would be to just do an assert
but at least if that assert triggers it would be a lot easier to track the
root cause.

If I add this tracking/check to libnfs over the next few days, would you be
willing to test it and see if it would trigger and make it easier to spot
the issue
by adding the bug back to XBMC as a test?

regards
ronnie sahlberg

On Tue, Nov 20, 2012 at 12:45 PM, Cyberwizzard notifications@github.comwrote:

Since I am not using tirpc (as far as I know) - it did not seem to make a
difference.

But the debug build of both libnfs and xbmc (with loads of debug output)
did help me figure out the problem: xbmc had a bug where it considered an
active context to be expired and destroyed the context. The consecutive
call to nfs_read would then provide this invalid context resulting in the
crash; this was an issue with xbmc.

Thank you for helping!

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/28#issuecomment-10572192.

Cyberwizzard · 2012-11-21T17:56:47Z

Sure, I have the test builds still on my system so testing the asserts should be no problem.

sahlberg · 2012-11-22T01:12:52Z

Thanks.

I have pushed a change that adds a new rpc->magic field.
It is set when the context is created and it is cleared when the context is
destroyed.
All functions that take an rpc_context as arument has an assert that will
trigger if called after the rpc was destroyed.

Please test that this assert triggers so we know that this kind of defects
will be a lot easier to find in the future.

regards
ronnie sahlberg

On Wed, Nov 21, 2012 at 9:56 AM, Cyberwizzard notifications@github.comwrote:

Sure, I have the test builds still on my system so testing the asserts
should be no problem.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/28#issuecomment-10607498.

Cyberwizzard · 2012-11-22T22:09:01Z

The skin in XBMC causing the issue was updated and I could not get it to crash like it used to because of this.

I did alter XBMC to explicitly destroy a context just before it was used, which of course crashed the program.

Using gdb on XBMC yielded the following:


Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffdd79a700 (LWP 27416)]
0x00007fffd3a94114 in nfs_get_rpc_context (nfs=0x7fffe62562b0) at libnfs.c:3624
3624            assert(nfs->rpc->magic == RPC_CONTEXT_MAGIC);```

So it seems to be working.

sahlberg · 2012-11-22T23:07:18Z

Good stuff.

Ok so at least we can trap this kind of issue easier.

Thanks for testing.
I will release a new version of libnfs with these changes in a few dys.

regards
ronnie sahlberg

On Thu, Nov 22, 2012 at 2:09 PM, Cyberwizzard notifications@github.comwrote:

The skin in XBMC causing the issue was updated and I could not get it to
crash like it used to because of this.

I did alter XBMC to explicitly destroy a context just before it was used,
which of course crashed the program.

Using gdb on XBMC yielded the following:
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffdd79a700 (LWP 27416)]
0x00007fffd3a94114 in nfs_get_rpc_context (nfs=0x7fffe62562b0) at
libnfs.c:3624
3624 assert(nfs->rpc->magic == RPC_CONTEXT_MAGIC);```

So it seems to be working.

—
Reply to this email directly or view it on GitHubhttps://github.com/sahlberg/libnfs/issues/28#issuecomment-10646071.

Cyberwizzard closed this as completed Nov 20, 2012

sahlberg mentioned this issue Nov 21, 2012

NFS issues after resume the client device #26

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

libnfs crashes XBMC resuming - segfault rpc_allocate_pdu #28

libnfs crashes XBMC resuming - segfault rpc_allocate_pdu #28

Cyberwizzard commented Nov 16, 2012

sahlberg commented Nov 17, 2012

Cyberwizzard commented Nov 18, 2012

Cyberwizzard commented Nov 18, 2012

sahlberg commented Nov 18, 2012

Cyberwizzard commented Nov 19, 2012

sahlberg commented Nov 19, 2012

Cyberwizzard commented Nov 20, 2012

sahlberg commented Nov 21, 2012

Cyberwizzard commented Nov 21, 2012

sahlberg commented Nov 22, 2012

Cyberwizzard commented Nov 22, 2012

sahlberg commented Nov 22, 2012

libnfs crashes XBMC resuming - segfault rpc_allocate_pdu #28

libnfs crashes XBMC resuming - segfault rpc_allocate_pdu #28

Comments

Cyberwizzard commented Nov 16, 2012

sahlberg commented Nov 17, 2012

Cyberwizzard commented Nov 18, 2012

Cyberwizzard commented Nov 18, 2012

sahlberg commented Nov 18, 2012

Cyberwizzard commented Nov 19, 2012

sahlberg commented Nov 19, 2012

Cyberwizzard commented Nov 20, 2012

sahlberg commented Nov 21, 2012

Cyberwizzard commented Nov 21, 2012

sahlberg commented Nov 22, 2012

Cyberwizzard commented Nov 22, 2012

sahlberg commented Nov 22, 2012