Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

libnfs-related SegFault/Abort with recent master builds of Kodi #124

Closed
MilhouseVH opened this Issue Aug 7, 2015 · 19 comments

Comments

Projects
None yet
2 participants
@MilhouseVH
Copy link
Contributor

MilhouseVH commented Aug 7, 2015

I've had a report of crashes in Kodi with RPi2 test builds after 2 August - this would coincide with the inclusion of 2288339 (1.9.8).

Prior to this, my test builds had been including libnfs 6894180 and no reported issues.

Since the 1.9.8 bump on 2 August appears fairly benign, I'm not sure what else might be responsible for these crashes.

Below is a debug-enabled crashlog from build #0803 - would appreciate if you could give it the once-over:

http://sprunge.us/HKOD

PS. please ignore the references to 1.9.7, this is just the OpenELEC internal version that I haven't changed as it simplifies the process of updating source tarballs - be assured I'm using the source from libnfs 2288339.

@MilhouseVH MilhouseVH changed the title Seg Fault with recent master in Kodi libnfs-related SegFault/Abort with recent master builds of Kodi Aug 7, 2015

@sahlberg

This comment has been minimized.

Copy link
Owner

sahlberg commented Aug 7, 2015

Hm.

So, as far as you know, up to and including 6894180 all was well ?

I will try to see what could be happening. But all changes post 6894180
looks quite bening.

On Fri, Aug 7, 2015 at 9:24 AM, MilhouseVH notifications@github.com wrote:

I've had a report of crashes
http://forum.kodi.tv/showthread.php?tid=231092&pid=2073258#pid2073258
in Kodi with RPi2 test builds after 2 August - this would coincide with the
inclusion of 2288339
2288339
(1.9.8).

Prior to this, my test builds had been including libnfs 6894180
6894180
and no reported issues.

Since the 1.9.8 bump on 2 August appears fairly benign, I'm not sure what
else might be responsible for these crashes.

Below is a debug-enabled crashlog from build #0803
http://forum.kodi.tv/showthread.php?tid=231092&pid=2070514#pid2070514 -
would appreciate if you could give it the once-over:

http://sprunge.us/HKOD

PS. please ignore the references to 1.9.7, this is just the OpenELEC
internal version that I haven't changed as it simplifies the process of
updating source tarballs - be assured I'm using the source from libnfs
2288339
2288339
.


Reply to this email directly or view it on GitHub
#124.

@MilhouseVH

This comment has been minimized.

Copy link
Contributor Author

MilhouseVH commented Aug 8, 2015

So, as far as you know, up to and including 6894180 all was well ?

Yes, no reports of problems with 6894180.

According to the user experiencing the crashes this only started with the #0803 build[1], which to be honest doesn't really make a great deal of sense as the #120 and 1.9.8 commits were added to libnfs on 2 Aug and included in build #0802, so #0802 should also have been crashing but apparently did not.

Although my suspicion would be that if this is in any way related to the most recent 2288339 commits, then it may just be that a crash has not yet been observed in #0802 as the crash obviously doesn't happen often, takes a while and I don't think we know what is triggering the crash, so it could just be that #0802 hasn't been tested long enough to confirm the problem.

So far there is just the one user reporting this issue.

  1. http://forum.kodi.tv/showthread.php?tid=231092&pid=2073647#pid2073647
@sahlberg

This comment has been minimized.

Copy link
Owner

sahlberg commented Sep 6, 2015

The assert means that we have a data corruption, or use after free, issue.

I think one of my merges today to address an issue with the reconnect logic where we might, at best, deadlock in some situations when we reconnect a tcp connection could be related.

Lets see if the user can reproduce using 5c7a0f0 or later.

@MilhouseVH

This comment has been minimized.

Copy link
Contributor Author

MilhouseVH commented Sep 6, 2015

OK many thanks, I did ask the user if this issue had repeated but got no response it had not (doh, missed this original reply in the forum) - I'll ping them again after tonight's build which will include the latest updates.

@MilhouseVH

This comment has been minimized.

Copy link
Contributor Author

MilhouseVH commented Oct 14, 2015

Hi, there's been another couple of libnfs segfault reports when using latest libnfs master 82d2a22 with Kodi master:

http://forum.kodi.tv/showthread.php?tid=231092&pid=2133289#pid2133289 (with debug-enabled crashlog)
http://forum.kodi.tv/showthread.php?tid=231092&pid=2133038#pid2133038 (regular crashlog)

@sahlberg

This comment has been minimized.

Copy link
Owner

sahlberg commented Oct 15, 2015

Looking at the first one.
This is actually not a segfault but an assert in the code. I think we have
memory corruption here.

The backtrace :

#0 GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:56
#1 0x75519a68 in __GI_abort () at abort.c:89
#2 0x75511530 in __assert_fail_base (fmt=0x7560018c "%s%s%s:%u:
%s%sAssertion `%s' failed.\n%n", assertion=assertion@entry=0x75620088
"", file=0x69cf94f9
"/home/neil/projects/OpenELEC.tv/build.OpenELEC-RPi2.arm-6.0-devel-debug/libnfs-1.9.8/lib/pdu.c",
file@entry=0x662bf400 "\001", line=1, line@entry=2,
function=function@entry=0x69cfe65a <__PRETTY_FUNCTION
.8121>
"rpc_allocate_pdu") at assert.c:92
#3 0x7551160c in GI___assert_fail (assertion=0x75620088 "",
file=0x662bf400 "\001", line=2, function=0x69cfe65a
<__PRETTY_FUNCTION
.8121> "rpc_allocate_pdu") at assert.c:101
#4 0x69cf2d30 in rpc_allocate_pdu (rpc=0x662bf400, program=, version=1775211769, procedure=1, cb=0x69cf4394
<nfs_pread_mcb.lto_priv.18>, private_data=0x6414c070,
zdr_decode_fn=0x69ce6088 <zdr_READ3res>, zdr_decode_bufsize=120) at
/home/neil/projects/OpenELEC.tv/build.OpenELEC-RPi2.arm-6.0-devel-debug/libnfs-1.9.8/lib/pdu.c:92
#5 0x69ced728 in rpc_nfs3_read_async (rpc=0x35d3038, cb=, args=0x662be6a8, private_data=) at
/home/neil/projects/OpenELEC.tv/build.OpenELEC-RPi2.arm-6.0-devel-debug/libnfs-1.9.8/nfs/nfs.c:283
#6 0x69cf6fcc in nfs_pread_async_internal (nfs=0x3f5bed0,
nfsfh=0x5039b6e0, offset=561266688, count=7624436665724438864,
cb=0x69ce2048 <pread_cb>, cb@entry=0x3f5bed0, private_data=0x662be720,
private_data@entry=0x69ce2d90 <nfs_read+64>,
update_pos=update_pos@entry=1) at
/home/neil/projects/OpenELEC.tv/build.OpenELEC-RPi2.arm-6.0-devel-debug/libnfs-1.9.8/lib/libnfs.c:2387
#7 0x69cf70c0 in nfs_read_async (nfs=,
nfsfh=, count=, cb=0x69ce2048
<pread_cb>, private_data=0x662be720) at
/home/neil/projects/OpenELEC.tv/build.OpenELEC-RPi2.arm-6.0-devel-debug/libnfs-1.9.8/lib/libnfs.c:2416
#8 0x69ce2d90 in nfs_read (nfs=0x3f5bed0, nfsfh=,
count=, buffer=0x5077cd70 "\342") at
/home/neil/projects/OpenELEC.tv/build.OpenELEC-RPi2.arm-6.0-devel-debug/libnfs-1.9.8/lib/libnfs-sync.c:413
#9 0x005b3060 in XFILE::CNFSFile::Read (this=0x69787ac8,
lpBuf=0x5077cd70, uiBufSize=32768) at NFSFile.cpp:646
#10 0x0057e5c4 in XFILE::CFile::Read (this=0x6106c858,
lpBuf=0x5077cd70, uiBufSize=32768) at File.cpp:540

struct rpc_pdu *rpc_allocate_pdu(struct rpc_context *rpc, int program,
int version, int procedure, rpc_cb cb, void *private_data, zdrproc_t
zdr_decode_fn, int zdr_decode_bufsize)
{
struct rpc_pdu *pdu;
struct rpc_msg msg;

assert(rpc->magic == RPC_CONTEXT_MAGIC);   <=== Here is where it

asserted. The RPC context is bogus.

=== It is called from :

int rpc_nfs3_read_async(struct rpc_context *rpc, rpc_cb cb, struct
READ3args *args, void *private_data)
{
struct rpc_pdu *pdu;

pdu = rpc_allocate_pdu(rpc, NFS_PROGRAM, NFS_V3, NFS3_READ, cb,

private_data, (zdrproc_t)zdr_READ3res, sizeof(READ3res));

=== which is called from :

static int nfs_pread_async_internal(struct nfs_context *nfs, struct
nfsfh *nfsfh, uint64_t offset, uint64_t count, nfs_cb cb, void
*private_data, int update_pos)
{

...

    nfs_fill_READ3args(&args, nfsfh, offset, readcount);

    if (rpc_nfs3_read_async(nfs->rpc, nfs_pread_mcb, &args, mdata) != 0) {

=== which is called from :

int nfs_read_async(struct nfs_context *nfs, struct nfsfh *nfsfh,
uint64_t count, nfs_cb cb, void *private_data)
{
return nfs_pread_async_internal(nfs, nfsfh, nfsfh->offset, count, cb,
private_data, 1);
}

=== which is called from :

int nfs_read(struct nfs_context *nfs, struct nfsfh *nfsfh, uint64_t
count, char *buffer)
{
struct sync_cb_data cb_data;

cb_data.is_finished = 0;
cb_data.return_data = buffer;
cb_data.call = "read";

if (nfs_read_async(nfs, nfsfh, count, pread_cb, &cb_data) != 0) {
    nfs_set_error(nfs, "nfs_read_async failed");
    return -1;
}

which is called from kodi.

So the assert itself is triggered because we do not have a valid RPC
context. The magic field is wrong in this field.

This could happen if

1, we passed a garbage nfs context pointer from kodi->libnfs. Unlikely.

2, if the nfs context is good, and nfs->rpc context is good too, but
you have done a rpc_destroy_context(nfs->rpc). Unliklely too since I
dont think ever call this function.

This is because we clear rpc->magic when we destroy the context, to
catch "use after destroyed".

3, You do call nfs_destroy_context() which in turn calls
rpc_destroy_context(). So this COULD happen if you called
nfs_destroy_context() (there are two places in NFSFile.cpp) and you
would later call nfs_read() on this contecxt that has already been
destroyed/freed. That scenario would trigger this.

But, then, that does not explain why this only happens for this
specific version of libnfs.

(Could you add logging and print when you create a nfs_context, and
when you destroy the context and the pointer value for the context?)

4, there is memory corruption that has caused the rpc context to be
overwritten and thus rpc->magic

is not longer valid.

I think 4 is most likely here, followed by unlikely 3, and 1 and 2
are completely unlikely.

Something that is REALLY interesting though is this part of the stack trace :

#6 0x69cf6fcc in nfs_pread_async_internal (nfs=0x3f5bed0,
nfsfh=0x5039b6e0, offset=561266688, count=7624436665724438864,
cb=0x69ce2048 <pread_cb>, cb@entry=0x3f5bed0, private_data=0x662be720,
private_data@entry=0x69ce2d90 <nfs_read+64>,
update_pos=update_pos@entry=1) at
/home/neil/projects/OpenELEC.tv/build.OpenELEC-RPi2.arm-6.0-devel-debug/libnfs-1.9.8/lib/libnfs.c:2387

...

#9 0x005b3060 in XFILE::CNFSFile::Read (this=0x69787ac8,
lpBuf=0x5077cd70, uiBufSize=32768) at NFSFile.cpp:646

In stackframe #6 count is suddenly a HUGE random value, while it
should be just 32768.

Unfortunately, in the intermediate frames we can not see what this
argument is due to the optimizations (could you disable the
optimizations when compiling libnfs?)

But as both the functions in frames #7 and #8 simply just pass the
value on to the next call. I can not imagine that #7/#8 are part of
the issue anyway.

Very strange. Looks memory memory corruption. Could the stack have
been corrupted during the transit from frame #6 to frame #9 ? And that
could have corrupted the value of count ?

Is it possible to get libnfs built without optimization? so the stack
trace would show the value of count in the intermediate frames ?

I will look further.

On Wed, Oct 14, 2015 at 11:12 AM, MilhouseVH notifications@github.com
wrote:

Hi, there's been another couple of libnfs segfault reports when using
latest libnfs master 82d2a22
82d2a22
with Kodi master:

http://forum.kodi.tv/showthread.php?tid=231092&pid=2133289#pid2133289
(with debug-enabled crashlog)
http://forum.kodi.tv/showthread.php?tid=231092&pid=2133038#pid2133038
(regular crashlog)


Reply to this email directly or view it on GitHub
#124 (comment).

@MilhouseVH

This comment has been minimized.

Copy link
Contributor Author

MilhouseVH commented Oct 16, 2015

(Could you add logging and print when you create a nfs_context, and when you destroy the context and the pointer value for the context?)

If you have a patch with extra logging (or anything else you think might be worth adding) I'll be more than happy to apply it in a test build and get you updated logs.

Is it possible to get libnfs built without optimization? so the stack trace would show the value of count in the intermediate frames ?

I've uploaded a new debug-enabled build (#1015x) which uses -O0 instead of -Os when building libnfs - I've requested updated crashlogs from this build.

@MilhouseVH

This comment has been minimized.

Copy link
Contributor Author

MilhouseVH commented Oct 26, 2015

Hi. Nobody has been able to reproduce the crash problem with build #1015x (optimisations disabled, debug-enabled) in a week of testing.

However after switching back to a regular build, #1022, an NFS-related crash has occurred. The new crash is a little different from before, but does seem to NFS related, but could also be a Kodi core issue (is the nfs_service failed significant)?

Could the fact that the original crash problem didn't happen with #1015x suggest that a compiler optimisation is responsible (gcc4.9 on OpenELEC)?

@sahlberg

This comment has been minimized.

Copy link
Owner

sahlberg commented Oct 27, 2015

I don't think there is a bug in the compiler optimization. It is more
likely a bug in libnfs, but a bug that is easier to trigger when
optimizations are on.

The nfs_service error might is interesting.
It suggests that the issue may be related to connection failures/reconnect.
Pull request 118 contained a bunch of changes to reconnect and socket logic
and I strongly suspect now that this is where the change that causes these
issues come from. It also changes some memory allocation logic.

Since this merge contains both changes to how to deal with socket
disconnects as well as memory allocations, and the crashes also seem to
relate to the same areas makes the merge highly suspect.

I will try to revert these patches over the next few days.

On Sun, Oct 25, 2015 at 7:48 PM, MilhouseVH notifications@github.com
wrote:

Hi. Nobody has been able to reproduce the crash problem with build #1015x
(optimisations disabled, debug-enabled) in a week of testing.

However after switching back to a regular build, #1022, an NFS-related
crash has occurred. The new crash
http://forum.kodi.tv/showthread.php?tid=231092&pid=2143138#pid2143138
is a little different from before, but does seem to NFS related, but could
also be a Kodi core issue (is the nfs_service failed significant)?

Could the fact that the original crash problem didn't happen with #1015x
suggest that a compiler optimisation is responsible (gcc4.9 on OpenELEC)?


Reply to this email directly or view it on GitHub
#124 (comment).

@MilhouseVH

This comment has been minimized.

Copy link
Contributor Author

MilhouseVH commented Oct 28, 2015

I'd be happy to include any patches in the builds that might help narrow down or confirm the issue, for example additional logging - the patches wouldn't have to be committed so if you'd like something tested before committing, let me know.

@MilhouseVH

This comment has been minimized.

Copy link
Contributor Author

MilhouseVH commented Feb 5, 2016

Hi. I've just experienced a SIGSEGV (most likely an assert, due to corruption), this time with x86 Kodi 17 master and latest libnfs master (build details).

Crash log: http://sprunge.us/PNZe (unfortunately not debug-enabled but maybe enough to give a clue?)

The crash happened after I'd paused a video for quite a while (13h16m!) and when I unpaused the video to resume playback Kodi immediately crashed.

Kodi crashing "after pausing a video for a while then unpausing" seems to crop up fairly often on the forums, but is hard to reproduce on demand.

Could Kodi be corrupting something libnfs-related while libnfs keeps the connection alive? Or maybe after a certain amount of time (several hours?) the NFS context is invalidated by the server? My NFS server is a FreeNAS 8.3.x box.

What seems a little odd is this:

07:49:30 17235.697266 T:139814325196928  NOTICE: NFS: sending keep alive after 180 s.
21:05:46 65011.500000 T:139814325196928  NOTICE: Previous line repeats 97 times.

That's 13h16m elapsed, but 97 x 180 is only 17460 seconds or 4h51m - shouldn't there have been more like 265 keep alive attempts during 13h16m?

I'll see if I can get lucky with a debug-enabled build to reproduce and get you a better stacktrace (no promises though - this is a pretty random/hard to reproduce issue).

@sahlberg

This comment has been minimized.

Copy link
Owner

sahlberg commented Feb 5, 2016

On Thu, Feb 4, 2016 at 8:50 PM, MilhouseVH notifications@github.com wrote:

Hi. I've just experienced a SIGSEGV (most likely an assert, due to corruption), this time with x86 Kodi 17 master and latest libnfs master (build details).

Crash log: http://sprunge.us/PNZe (unfortunately not debug-enabled but maybe enough to give a clue?)

The crash happened after I'd paused a video for quite a while (13h15m!) and when I unpaused the video to resume playback Kodi immediately crashed.

Kodi crashing "after pausing a video for a while then unpausing" seems to be crop up fairly often on the forums, but is hard to reproduce on demand.

Could Kodi be corrupting something libnfs-related while libnfs keeps the connection alive? Or maybe after a certain amount of time (several hours?) the NFS context is invalidated by the server? My NFS server is a FreeNAS 8.3.x box.

libnfs itself does not try to keep the connection alive. Instead, IF
the server has decided to tear down the connection after x amount of
inactivity,
on next call through libnfs it will detect this and try to reconnect
the connection.
You can see this in action by for example restarting the nfs server
while you are doing i/o.

I think that kodi even tries to prevent this from happening (server
tearing down connection due to inactivity) by issuing a read() for 32
bytes every 3 minutes or so.
This should prevent the teardown from happening and is probably where
kodi is logging the "sending keep alive" message.

What seems a little odd is this:

07:49:30 17235.697266 T:139814325196928 NOTICE: NFS: sending keep alive after 180 s.
21:05:46 65011.500000 T:139814325196928 NOTICE: Previous line repeats 97 times.

That's 13h16m, but 97 x 180 is only 17460 seconds, or 4h51m - shouldn't there have been more like 265 keep alive attempts in 13h16m?

I think it should. There should definitely be 265 of them.
This is clearly an indication that something has gone quite wrong.

This is what kodi does for the keepalives:

void CNfsConnection::keepAlive(std::string _exportPath, struct nfsfh
__pFileHandle)
{
uint64_t offset = 0;
char buffer[32];
// this also refreshs the last accessed time for the context
// true forces a cachehit regardless the context is timedout
// on this call we are sure its not timedout even if the last accessed
// time suggests it.
struct nfs_context *pContext = getContextFromMap(_exportPath, true);
if (!pContext)// this should normally never happen - paranoia
pContext = m_pNfsContext;
CLog::Log(LOGNOTICE, "NFS: sending keep alive after %i
s.",KEEP_ALIVE_TIMEOUT/2);
CSingleLock lock(_this);
m_pLibNfs->nfs_lseek(pContext, _pFileHandle, 0, SEEK_CUR, &offset);
m_pLibNfs->nfs_read(pContext, _pFileHandle, 32, buffer);
m_pLibNfs->nfs_lseek(pContext, _pFileHandle, offset, SEEK_SET, &offset);
}

Note that while this function does call out to nfs_read() too, it does
so after it has printed the log message.
We did however not crash in the call to nfs_read() here ccording to
the backtrace.

This looks like kodi started writing these log entries as it should at 07:49
then every three minutes for the next 4h51 it did another keepalive
and logged another message.
8-9 hours later when kodi crashed, the logger would then collapse the
97 messages from ~8-9 hours ago that was yet to be written to disk
into a single line and emitted
21:05:46 65011.500000 T:139814325196928 NOTICE: Previous line repeats 97 times.

I.e. the line
21:05:46 65011.500000 T:139814325196928 NOTICE: Previous line repeats 97 times.
comes from the logger flushing all the remaining 97 entries, that are
all 8-9 hours old, to disk as part of the crash handling.

The crash then happened as since you unpaused the video. Again hit a
corrupted nfs context which then caused a segfault.
Thread 1 (Thread 0x7f288bfff700 (LWP 4230)):

#0 0x00007f290ed52ce2 in ?? () from /usr/lib/libnfs.so.8
#1 0x00007f290ed52fd9 in nfs_read_async () from /usr/lib/libnfs.so.8
#2 0x00007f290ed56ab7 in nfs_read () from /usr/lib/libnfs.so.8
#3 0x00000000009f5d08 in XFILE::CNFSFile::Read(void*, unsigned long) ()

We dont have any asserts to check for the magic value in the
nfs_read() codepath, so instead it here
blew up in nfspread_async_internal() at the point where we first
dereference nfs :
if (nfs->rpc->readahead

I.e. just like last time, the nfs structure has been corrupted. But we
segv instead of triggering an assert.

My hypothesis is that everything was running fine for 4h51m.
Then at that point some part of kodi decided that we no longer needed
the context.
At that point it ceased to run the keepalive loop and this is why the
log messages stopped coming.
It also at that point valled nfs_destroy_context() which will free()
the memory used for the context.
This memory was then re-used sometime later and overwritten.

Other parts of kodi still held a pointer to the now free()d context
and 8-9 hours later, when you tried to unpause it
it blew up immediately since the context was bad and the nfs->rpc->
dereference segfaulted.

I think the missing keepalive log messages could be a smoking gun.

Now, what to test.
If something like this is the case, you could test by simply
removing/commenting any and all calls in kodi to nfs_destroy_context()
This will leak a little amount of memory everytime we genuinely would
want to release the context, but is should avoid any existing contexts
from ever beeing free()d. If so, it should manifest as no longer any segfaults.
That doesnt fix the real issue, but would provide a good datapoint
which would either confirm or reject the theory.

What you also could try could be to run kodi.bin under gdb. Then if
there is another segv you can inspect all the data structures.

If you can build and statically link a version of kodi.bin that
contains all debug symbols, I can try running it under gdb and see if
I can reproduce too.

I'll see if I can get lucky with a debug-enabled build to reproduce and get you a better stacktrace (no promises though - this is a pretty random/hard to reproduce issue).


Reply to this email directly or view it on GitHub.

@sahlberg

This comment has been minimized.

Copy link
Owner

sahlberg commented Feb 5, 2016

In kodi, there are 4 places where nfs_destroy_context() is called.
None of which logs at INFO or higher.

Can you add log entries for these 4 places to see if it is triggered
before any crashes ?

On Thu, Feb 4, 2016 at 11:04 PM, ronnie sahlberg
ronniesahlberg@gmail.com wrote:

On Thu, Feb 4, 2016 at 8:50 PM, MilhouseVH notifications@github.com wrote:

Hi. I've just experienced a SIGSEGV (most likely an assert, due to corruption), this time with x86 Kodi 17 master and latest libnfs master (build details).

Crash log: http://sprunge.us/PNZe (unfortunately not debug-enabled but maybe enough to give a clue?)

The crash happened after I'd paused a video for quite a while (13h15m!) and when I unpaused the video to resume playback Kodi immediately crashed.

Kodi crashing "after pausing a video for a while then unpausing" seems to be crop up fairly often on the forums, but is hard to reproduce on demand.

Could Kodi be corrupting something libnfs-related while libnfs keeps the connection alive? Or maybe after a certain amount of time (several hours?) the NFS context is invalidated by the server? My NFS server is a FreeNAS 8.3.x box.

libnfs itself does not try to keep the connection alive. Instead, IF
the server has decided to tear down the connection after x amount of
inactivity,
on next call through libnfs it will detect this and try to reconnect
the connection.
You can see this in action by for example restarting the nfs server
while you are doing i/o.

I think that kodi even tries to prevent this from happening (server
tearing down connection due to inactivity) by issuing a read() for 32
bytes every 3 minutes or so.
This should prevent the teardown from happening and is probably where
kodi is logging the "sending keep alive" message.

What seems a little odd is this:

07:49:30 17235.697266 T:139814325196928 NOTICE: NFS: sending keep alive after 180 s.
21:05:46 65011.500000 T:139814325196928 NOTICE: Previous line repeats 97 times.

That's 13h16m, but 97 x 180 is only 17460 seconds, or 4h51m - shouldn't there have been more like 265 keep alive attempts in 13h16m?

I think it should. There should definitely be 265 of them.
This is clearly an indication that something has gone quite wrong.

This is what kodi does for the keepalives:

void CNfsConnection::keepAlive(std::string _exportPath, struct nfsfh
__pFileHandle)
{
uint64_t offset = 0;
char buffer[32];
// this also refreshs the last accessed time for the context
// true forces a cachehit regardless the context is timedout
// on this call we are sure its not timedout even if the last accessed
// time suggests it.
struct nfs_context *pContext = getContextFromMap(_exportPath, true);
if (!pContext)// this should normally never happen - paranoia
pContext = m_pNfsContext;
CLog::Log(LOGNOTICE, "NFS: sending keep alive after %i
s.",KEEP_ALIVE_TIMEOUT/2);
CSingleLock lock(_this);
m_pLibNfs->nfs_lseek(pContext, _pFileHandle, 0, SEEK_CUR, &offset);
m_pLibNfs->nfs_read(pContext, _pFileHandle, 32, buffer);
m_pLibNfs->nfs_lseek(pContext, _pFileHandle, offset, SEEK_SET, &offset);
}

Note that while this function does call out to nfs_read() too, it does
so after it has printed the log message.
We did however not crash in the call to nfs_read() here ccording to
the backtrace.

This looks like kodi started writing these log entries as it should at 07:49
then every three minutes for the next 4h51 it did another keepalive
and logged another message.
8-9 hours later when kodi crashed, the logger would then collapse the
97 messages from ~8-9 hours ago that was yet to be written to disk
into a single line and emitted
21:05:46 65011.500000 T:139814325196928 NOTICE: Previous line repeats 97 times.

I.e. the line
21:05:46 65011.500000 T:139814325196928 NOTICE: Previous line repeats 97 times.
comes from the logger flushing all the remaining 97 entries, that are
all 8-9 hours old, to disk as part of the crash handling.

The crash then happened as since you unpaused the video. Again hit a
corrupted nfs context which then caused a segfault.
Thread 1 (Thread 0x7f288bfff700 (LWP 4230)):

#0 0x00007f290ed52ce2 in ?? () from /usr/lib/libnfs.so.8
#1 0x00007f290ed52fd9 in nfs_read_async () from /usr/lib/libnfs.so.8
#2 0x00007f290ed56ab7 in nfs_read () from /usr/lib/libnfs.so.8
#3 0x00000000009f5d08 in XFILE::CNFSFile::Read(void*, unsigned long) ()

We dont have any asserts to check for the magic value in the
nfs_read() codepath, so instead it here
blew up in nfspread_async_internal() at the point where we first
dereference nfs :
if (nfs->rpc->readahead

I.e. just like last time, the nfs structure has been corrupted. But we
segv instead of triggering an assert.

My hypothesis is that everything was running fine for 4h51m.
Then at that point some part of kodi decided that we no longer needed
the context.
At that point it ceased to run the keepalive loop and this is why the
log messages stopped coming.
It also at that point valled nfs_destroy_context() which will free()
the memory used for the context.
This memory was then re-used sometime later and overwritten.

Other parts of kodi still held a pointer to the now free()d context
and 8-9 hours later, when you tried to unpause it
it blew up immediately since the context was bad and the nfs->rpc->
dereference segfaulted.

I think the missing keepalive log messages could be a smoking gun.

Now, what to test.
If something like this is the case, you could test by simply
removing/commenting any and all calls in kodi to nfs_destroy_context()
This will leak a little amount of memory everytime we genuinely would
want to release the context, but is should avoid any existing contexts
from ever beeing free()d. If so, it should manifest as no longer any segfaults.
That doesnt fix the real issue, but would provide a good datapoint
which would either confirm or reject the theory.

What you also could try could be to run kodi.bin under gdb. Then if
there is another segv you can inspect all the data structures.

If you can build and statically link a version of kodi.bin that
contains all debug symbols, I can try running it under gdb and see if
I can reproduce too.

I'll see if I can get lucky with a debug-enabled build to reproduce and get you a better stacktrace (no promises though - this is a pretty random/hard to reproduce issue).


Reply to this email directly or view it on GitHub.

@MilhouseVH

This comment has been minimized.

Copy link
Contributor Author

MilhouseVH commented Feb 5, 2016

Many thanks for the insight. I'll add the logging for nfs_destroy_context(), that seems to be the most likely way to confirm what might be happening - removing/disabling the nfs_destroy_context() calls might stop the issue but would then make reproducing not just hard but impossible!

Do you have a Raspberry Pi 1 or 2, or x86 box on which you can run OpenELEC - I could upload a debug-enabled OpenELEC build with gdb and all symbols.

@MilhouseVH

This comment has been minimized.

Copy link
Contributor Author

MilhouseVH commented Feb 7, 2016

Cracked it, I think.

I say "I think" as I've seen both a SIGSEGV and SIGABRT while reproducing this issue - I suspect it's the same cause but perhaps slightly different paths that ultimately result in the crash.

Debug-enabled crash logs:
SIGABRT: http://sprunge.us/RTLh
SIGSEGV: http://sprunge.us/FSfN

Note that this build includes the following patch with additional nfs_destroy_context() logging: http://sprunge.us/PHEM

Basically, the issue involves pausing an nfs:// video, then while the video is paused performing any JSON RPC query that fiddles with the NFS context (ie. refreshing and eventually destroying then creating a new context), then finally unpausing the paused video and... boom.

Steps to reproduce (log details from SIGABRT log):

  1. Play a video from nfs:// source, eg. nfs://192.168.0.3/mnt/share/media/Video/TVShows/Arrow/Season 4/Arrow S04E12.mkv
  2. Pause it, and leave it paused
  3. Wait for the NFS: sending keep alive after 180 s message to appear in the log
17:25:37 248.389450 T:140285740464256  NOTICE: NFS: sending keep alive after 180 s.
  1. Wait 2-3 minutes - I think during this period the NFS connection is marked as idle... maybe?
  2. Initiate a JSON-RPC query that accesses the same nfs:// source, eg. a video library scan:
curl -s --data-binary '{"jsonrpc": "2.0", "method": "VideoLibrary.Scan", "params": {"showdialogs": false}, "id": "libRescan"}' -H 'content-type: application/json;' http://localhost:8080/jsonrpc

produces:

17:29:08 459.145508 T:140283614320384  NOTICE: Previous line repeats 1 times.
17:29:08 459.145599 T:140283614320384   DEBUG: webserver: request received for /jsonrpc
17:29:08 459.182068 T:140284723705600   DEBUG: Thread JobWorker start, auto delete: true
17:29:08 459.365417 T:140284723705600  NOTICE: VideoInfoScanner: Starting scan ..
17:29:08 459.365509 T:140284723705600   DEBUG: CAnnouncementManager - Announcement: OnScanStarted from xbmc
17:29:08 459.365601 T:140284723705600   DEBUG: GOT ANNOUNCEMENT, type: 16, from xbmc, message OnScanStarted
17:29:08 459.367401 T:140284723705600   DEBUG: NFS: Refreshing context for 192.168.0.3/mnt/share, old: 414447, new: 436237
17:29:08 459.367462 T:140284723705600   DEBUG: NFS: Using cached context.
17:29:10 460.841339 T:140284723705600   DEBUG: VideoInfoScanner: Skipping dir 'nfs://192.168.0.3/mnt/share/media/Video-Private/Comedy/' due to no change (fasthash)
...
17:29:58 509.610870 T:140284723705600   DEBUG: Thread JobWorker 140284723705600 terminating (autodelete)
17:31:03 574.402771 T:140285740464256  NOTICE: Samba is idle. Closing the remaining connections
  1. After step 5, the NFS context is refreshed however the NFS "keep alive" messages for the paused video will cease
  2. Wait a few minutes (10 minutes to be sure) for the NFS context to time out, then repeat the JSON query from step 5 - this should now create a new NFS context:
17:40:11 1122.138306 T:140284723705600   DEBUG: webserver: request received for /jsonrpc
17:40:11 1122.164062 T:140283614320384   DEBUG: Thread JobWorker start, auto delete: true
17:40:11 1122.248535 T:140283614320384  NOTICE: VideoInfoScanner: Starting scan ..
17:40:11 1122.248657 T:140283614320384   DEBUG: CAnnouncementManager - Announcement: OnScanStarted from xbmc
17:40:11 1122.248779 T:140283614320384   DEBUG: GOT ANNOUNCEMENT, type: 16, from xbmc, message OnScanStarted
17:40:11 1122.250854 T:140283614320384   DEBUG: NFS: Old context timed out - destroying it
17:40:11 1122.250854 T:140283614320384    INFO: NFS: nfs_destroy_context() #3
17:40:11 1122.255005 T:140283614320384   DEBUG: NFS: Context for 192.168.0.3/mnt/share not open - get a new context.
17:40:11 1122.325684 T:140283614320384   DEBUG: NFS: Connected to server 192.168.0.3 and export /mnt/share
17:40:11 1122.325684 T:140283614320384   DEBUG: NFS: chunks: r/w 32768/32768
17:40:11 1122.431641 T:140283614320384   DEBUG: VideoInfoScanner: Skipping dir 'nfs://192.168.0.3/mnt/share/media/Video-Private/Comedy/' due to no change (fasthash)
  1. Unpause the video from step 2 - kodi.bin will immediately crash

If you want me to open a Kodi trac ticket I can do so, as it's really a Kodi issue rather than libnfs, but thought I'd run this past you first in case you wanted to handle it differently.

@sahlberg

This comment has been minimized.

Copy link
Owner

sahlberg commented Feb 7, 2016

That is good stuff!

Please open a kodi track ticket so this can be fixed.
For now, since you confirmed that it is a use-after-free issue,
just comment out all the calls to nfs_destroy_context in your code.

A very slow memory leak (of < 100 bytes at a time) is pretty harmless
compared to a segfault.

Since kodi will open and destroy multiple contexts at a time.
Please update the log messages to print the pointer to the context with %p
and log this at INFO level
every time a context is created or destroyed.
This will make it much easier to track next time a bug like this surfaces.

On Sun, Feb 7, 2016 at 9:47 AM, MilhouseVH notifications@github.com wrote:

Cracked it, I think.

I say "I think" as I've seen both a SIGSEGV and SIGABRT while reproducing
this issue - I suspect it's the same cause but perhaps slightly different
paths that ultimately result in the crash.

Debug-enabled crash logs:
SIGABRT: http://sprunge.us/RTLh
SIGSEGV: http://sprunge.us/FSfN

Note that this build includes the following patch with additional
nfs_destroy_context() logging: http://sprunge.us/PHEM

Basically, the issue involves pausing a video, then while the video is
paused performing any JSON RPC query that fiddles with the NFS context
(refreshing, destroying/creating), then eventually unpausing the paused
video and... boom.

Steps to reproduce (log details from SIGABRT log):

  1. Play a video from nfs:// source, eg. nfs://
    192.168.0.3/mnt/share/media/Video/TVShows/Arrow/Season 4/Arrow S04E12.mkv
  2. Pause it, and leave it paused
  3. Wait for the NFS: sending keep alive after 180 s message to appear in
    the log

17:25:37 248.389450 T:140285740464256 NOTICE: NFS: sending keep alive after 180 s.

  1. Wait 2-3 minutes - I think during this period the NFS connection is
    marked as idle... maybe?
  2. Initiate a JSON-RPC query that accesses the same nfs:// source, eg. a
    video library scan:

curl -s --data-binary '{"jsonrpc": "2.0", "method": "VideoLibrary.Scan", "params": {"showdialogs": false}, "id": "libRescan"}' -H 'content-type: application/json;' http://localhost:8080/jsonrpc

produces:

17:29:08 459.145508 T:140283614320384 NOTICE: Previous line repeats 1 times.
17:29:08 459.145599 T:140283614320384 DEBUG: webserver: request received for /jsonrpc
17:29:08 459.182068 T:140284723705600 DEBUG: Thread JobWorker start, auto delete: true
17:29:08 459.365417 T:140284723705600 NOTICE: VideoInfoScanner: Starting scan ..
17:29:08 459.365509 T:140284723705600 DEBUG: CAnnouncementManager - Announcement: OnScanStarted from xbmc
17:29:08 459.365601 T:140284723705600 DEBUG: GOT ANNOUNCEMENT, type: 16, from xbmc, message OnScanStarted
17:29:08 459.367401 T:140284723705600 DEBUG: NFS: Refreshing context for 192.168.0.3/mnt/share, old: 414447, new: 436237
17:29:08 459.367462 T:140284723705600 DEBUG: NFS: Using cached context.
17:29:10 460.841339 T:140284723705600 DEBUG: VideoInfoScanner: Skipping dir 'nfs://192.168.0.3/mnt/share/media/Video-Private/Comedy/' due to no change (fasthash)
...
17:29:58 509.610870 T:140284723705600 DEBUG: Thread JobWorker 140284723705600 terminating (autodelete)
17:31:03 574.402771 T:140285740464256 NOTICE: Samba is idle. Closing the remaining connections

  1. After step 5, the NFS context is refreshed however the NFS "keep alive"
    messages for the paused video will cease
  2. Wait a few minutes (10 minutes to be sure) for the NFS context to time
    out, then repeat the JSON query from step 5 - this should now create a new
    NFS context:

17:40:11 1122.138306 T:140284723705600 DEBUG: webserver: request received for /jsonrpc
17:40:11 1122.164062 T:140283614320384 DEBUG: Thread JobWorker start, auto delete: true
17:40:11 1122.248535 T:140283614320384 NOTICE: VideoInfoScanner: Starting scan ..
17:40:11 1122.248657 T:140283614320384 DEBUG: CAnnouncementManager - Announcement: OnScanStarted from xbmc
17:40:11 1122.248779 T:140283614320384 DEBUG: GOT ANNOUNCEMENT, type: 16, from xbmc, message OnScanStarted
17:40:11 1122.250854 T:140283614320384 DEBUG: NFS: Old context timed out - destroying it
17:40:11 1122.250854 T:140283614320384 INFO: NFS: nfs_destroy_context() #3
17:40:11 1122.255005 T:140283614320384 DEBUG: NFS: Context for 192.168.0.3/mnt/share not open - get a new context.
17:40:11 1122.325684 T:140283614320384 DEBUG: NFS: Connected to server 192.168.0.3 and export /mnt/share
17:40:11 1122.325684 T:140283614320384 DEBUG: NFS: chunks: r/w 32768/32768
17:40:11 1122.431641 T:140283614320384 DEBUG: VideoInfoScanner: Skipping dir 'nfs://192.168.0.3/mnt/share/media/Video-Private/Comedy/' due to no change (fasthash)

  1. Unpause the video from step 2 - kodi.bin will immediately crash

If you want me to open a Kodi trac ticket I can do so, as it's really a
Kodi issue rather than libnfs, but thought I'd run this past you first in
case you wanted to handle it differently.


Reply to this email directly or view it on GitHub
#124 (comment).

@MilhouseVH

This comment has been minimized.

Copy link
Contributor Author

MilhouseVH commented Feb 8, 2016

Opened trac #16576.

In terms of the improved logging, is this what you had in mind?

diff --git a/xbmc/filesystem/NFSFile.cpp b/xbmc/filesystem/NFSFile.cpp
index 3a13369..f2d3ea0 100644
--- a/xbmc/filesystem/NFSFile.cpp
+++ b/xbmc/filesystem/NFSFile.cpp
@@ -138,6 +138,7 @@ void CNfsConnection::destroyOpenContexts()
   CSingleLock lock(openContextLock);
   for(tOpenContextMap::iterator it = m_openContextMap.begin();it!=m_openContextMap.end();++it)
   {
+    CLog::Log(LOGINFO, "NFS: Destroying context %p (%s)", it->second.pContext, __FUNCTION__);
     m_pLibNfs->nfs_destroy_context(it->second.pContext);
   }
   m_openContextMap.clear();
@@ -149,6 +150,7 @@ void CNfsConnection::destroyContext(const std::string &exportName)
   tOpenContextMap::iterator it = m_openContextMap.find(exportName.c_str());
   if (it != m_openContextMap.end()) 
   {
+      CLog::Log(LOGINFO, "NFS: Destroying context %p (%s)", it->second.pContext, __FUNCTION__);
       m_pLibNfs->nfs_destroy_context(it->second.pContext);
       m_openContextMap.erase(it);
   }
@@ -179,6 +181,7 @@ struct nfs_context *CNfsConnection::getContextFromMap(const std::string &exportn
       //context is timed out
       //destroy it and return NULL
       CLog::Log(LOGDEBUG, "NFS: Old context timed out - destroying it");
+      CLog::Log(LOGINFO, "NFS: Destroying context %p (%s)", it->second.pContext, __FUNCTION__);
       m_pLibNfs->nfs_destroy_context(it->second.pContext);
       m_openContextMap.erase(it);
     }
@@ -463,6 +466,7 @@ int CNfsConnection::stat(const CURL &url, NFSSTAT *statbuff)
         CLog::Log(LOGERROR,"NFS: Failed to mount nfs share: %s (%s)\n", exportPath.c_str(), m_pLibNfs->nfs_get_error(m_pNfsContext));
       }

+      CLog::Log(LOGINFO, "NFS: Destroying context %p (%s)", pTmpContext, __FUNCTION__);
       m_pLibNfs->nfs_destroy_context(pTmpContext);
       CLog::Log(LOGDEBUG,"NFS: Connected to server %s and export %s in tmpContext\n", url.GetHostName().c_str(), exportPath.c_str());
     }
@sahlberg

This comment has been minimized.

Copy link
Owner

sahlberg commented Feb 8, 2016

That looks like much better logging!

On Sun, Feb 7, 2016 at 4:51 PM, MilhouseVH notifications@github.com wrote:

Opened trac #16576 http://trac.kodi.tv/ticket/16576.

In terms of the improved logging, is this what you had in mind?

diff --git a/xbmc/filesystem/NFSFile.cpp b/xbmc/filesystem/NFSFile.cpp
index 3a13369..f2d3ea0 100644
--- a/xbmc/filesystem/NFSFile.cpp
+++ b/xbmc/filesystem/NFSFile.cpp
@@ -138,6 +138,7 @@ void CNfsConnection::destroyOpenContexts()
CSingleLock lock(openContextLock);
for(tOpenContextMap::iterator it = m_openContextMap.begin();it!=m_openContextMap.end();++it)
{

  • CLog::Log(LOGINFO, "NFS: Destroying context %p (%s)", it->second.pContext, FUNCTION);
    m_pLibNfs->nfs_destroy_context(it->second.pContext);
    }
    m_openContextMap.clear();
    @@ -149,6 +150,7 @@ void CNfsConnection::destroyContext(const std::string &exportName)
    tOpenContextMap::iterator it = m_openContextMap.find(exportName.c_str());
    if (it != m_openContextMap.end())
    {

  •  CLog::Log(LOGINFO, "NFS: Destroying context %p (%s)", it->second.pContext, **FUNCTION**);
    

    m_pLibNfs->nfs_destroy_context(it->second.pContext);
    m_openContextMap.erase(it);
    }
    @@ -179,6 +181,7 @@ struct nfs_context *CNfsConnection::getContextFromMap(const std::string &exportn
    //context is timed out
    //destroy it and return NULL
    CLog::Log(LOGDEBUG, "NFS: Old context timed out - destroying it");

  •  CLog::Log(LOGINFO, "NFS: Destroying context %p (%s)", it->second.pContext, __FUNCTION__);
    

    m_pLibNfs->nfs_destroy_context(it->second.pContext);
    m_openContextMap.erase(it);
    }
    @@ -463,6 +466,7 @@ int CNfsConnection::stat(const CURL &url, NFSSTAT *statbuff)
    CLog::Log(LOGERROR,"NFS: Failed to mount nfs share: %s (%s)\n", exportPath.c_str(), m_pLibNfs->nfs_get_error(m_pNfsContext));
    }

  •  CLog::Log(LOGINFO, "NFS: Destroying context %p (%s)", pTmpContext, __FUNCTION__);
    

    m_pLibNfs->nfs_destroy_context(pTmpContext);
    CLog::Log(LOGDEBUG,"NFS: Connected to server %s and export %s in tmpContext\n", url.GetHostName().c_str(), exportPath.c_str());
    }


Reply to this email directly or view it on GitHub
#124 (comment).

@sahlberg

This comment has been minimized.

Copy link
Owner

sahlberg commented Mar 18, 2016

Closing this since we have pretty much toot caused the issues to a use-after-free

@sahlberg sahlberg closed this Mar 18, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.