You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It seems StashCache clients are keeping long-lived persistent HTTP connections. For each GET request on a file, xrootd increments the use counter for the file, but the counter is not decremented until the HTTP connection is closed. If there's a popular file (we're seeing it happen with BLAST databases in particular) with enough clients requesting various ranges inside the file, the counter eventually reaches 65536 and xrootd crashes.
Backtrace of a crash on hcc-stash.unl.edu (xrootd-4.7.1-1.osg33.el7.x86_64):
Core was generated by `/usr/bin/xrootd -l /var/log/xrootd/xrootd.log -c /etc/xrootd/xrootd-stashcache-'.
Program terminated with signal 11, Segmentation fault.
#0 0x00007fd5b464b327 in XrdFileCache::IOEntireFile::FSize (this=0x7fd59000a0b0) at /usr/src/debug/xrootd-4.7.1/src/XrdFileCache/XrdFileCacheIOEntireFile.cc:76
76 return m_file->GetFileSize();
(gdb) bt
#0 0x00007fd5b464b327 in XrdFileCache::IOEntireFile::FSize (this=0x7fd59000a0b0) at /usr/src/debug/xrootd-4.7.1/src/XrdFileCache/XrdFileCacheIOEntireFile.cc:76
#1 0x00007fd5b464b584 in XrdFileCache::IOEntireFile::Read (this=0x7fd59000a0b0,
buff=0x7fd4dc718000 "\217\377t\351N\270QG\\Q\375\004oU\234Wy\226\256\345\267\335N\031\320[<\022\337xQO\035\355\235\037\337Q\251\026\256Tƥ\317{w\364Lh\021\354\265\177\260\257\317\362\001\274\070\337\177\064]W\353\365\275w\273r\317\363<_\340\371\177\376\337\035\204\004L\361\263\367\312\317\361\363O\240~\363\\\312\311\345\337", <incomplete sequence \337>, off=246530048, size=1048576) at /usr/src/debug/xrootd-4.7.1/src/XrdFileCache/XrdFileCacheIOEntireFile.cc:174
#2 0x00007fd5b541dd3b in XrdPosixXrootd::Pread (fildes=<optimized out>, buf=0x7fd4dc718000, nbyte=1048576, offset=246530048) at /usr/src/debug/xrootd-4.7.1/src/XrdPosix/XrdPosixXrootd.cc:677
#3 0x00007fd5b5848079 in XrdPssFile::Read (this=<optimized out>, buff=<optimized out>, offset=<optimized out>, blen=<optimized out>) at /usr/src/debug/xrootd-4.7.1/src/XrdPss/XrdPss.cc:758
#4 0x00007fd5bef58c83 in XrdOfsFile::read (this=0x7fd3790ae960, offset=246530048,
buff=0x7fd4dc718000 "\217\377t\351N\270QG\\Q\375\004oU\234Wy\226\256\345\267\335N\031\320[<\022\337xQO\035\355\235\037\337Q\251\026\256Tƥ\317{w\364Lh\021\354\265\177\260\257\317\362\001\274\070\337\177\064]W\353\365\275w\273r\317\363<_\340\371\177\376\337\035\204\004L\361\263\367\312\317\361\363O\240~\363\\\312\311\345\337", <incomplete sequence \337>, blen=1048576) at /usr/src/debug/xrootd-4.7.1/src/XrdOfs/XrdOfs.cc:891
#5 0x00007fd5bef4e2be in XrdXrootdProtocol::do_ReadAll (this=this@entry=0x7fd58c5adac8, asyncOK=asyncOK@entry=1) at /usr/src/debug/xrootd-4.7.1/src/XrdXrootd/XrdXrootdXeq.cc:2001
#6 0x00007fd5bef4e6e6 in XrdXrootdProtocol::do_Read (this=this@entry=0x7fd58c5adac8) at /usr/src/debug/xrootd-4.7.1/src/XrdXrootd/XrdXrootdXeq.cc:1938
#7 0x00007fd5bef44f80 in XrdXrootdProtocol::Process2 (this=this@entry=0x7fd58c5adac8) at /usr/src/debug/xrootd-4.7.1/src/XrdXrootd/XrdXrootdProtocol.cc:463
#8 0x00007fd5bef48820 in XrdXrootdTransit::Process (this=0x7fd58c5adac0, lp=0x7fd5903cf7b8) at /usr/src/debug/xrootd-4.7.1/src/XrdXrootd/XrdXrootdTransit.cc:369
#9 0x00007fd5beccc719 in XrdLink::DoIt (this=0x7fd5903cf7b8) at /usr/src/debug/xrootd-4.7.1/src/Xrd/XrdLink.cc:426
#10 0x00007fd5beccfcff in XrdScheduler::Run (this=0x610e98 <XrdMain::Config+440>) at /usr/src/debug/xrootd-4.7.1/src/Xrd/XrdScheduler.cc:357
#11 0x00007fd5beccfe49 in XrdStartWorking (carg=<optimized out>) at /usr/src/debug/xrootd-4.7.1/src/Xrd/XrdScheduler.cc:87
#12 0x00007fd5bec8c4d7 in XrdSysThread_Xeq (myargs=0x7fd37c0d3810) at /usr/src/debug/xrootd-4.7.1/src/XrdSys/XrdSysPthread.cc:86
#13 0x00007fd5be848e25 in start_thread (arg=0x7fd3196ba700) at pthread_create.c:308
#14 0x00007fd5bdb4e34d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
(gdb) print m_file
$1 = (XrdFileCache::File *) 0x0
It can be reproduced with a small python script doing 65536 requests on a persistent connection (can run several in parallel to speed things along):
Indeed, even though there is code to handle a counter overflow that code is just wrong and can lead to an eventual SEGV. I think the solution is to increase the size of the counter to an unsigned int.
We've been seeing segfaults in XrdFileCache on a StashCache server. It appears to be related to persistent HTTP connections and heavily used files.
There's an unsigned short
use
counter for the active links: https://github.com/xrootd/xrootd/blob/v4.7.1/src/XrdOfs/XrdOfsHandle.hh#L53When the counter reaches maximum value, it appears to trigger a segfault:
It seems StashCache clients are keeping long-lived persistent HTTP connections. For each GET request on a file, xrootd increments the use counter for the file, but the counter is not decremented until the HTTP connection is closed. If there's a popular file (we're seeing it happen with BLAST databases in particular) with enough clients requesting various ranges inside the file, the counter eventually reaches 65536 and xrootd crashes.
Backtrace of a crash on hcc-stash.unl.edu (xrootd-4.7.1-1.osg33.el7.x86_64):
It can be reproduced with a small python script doing 65536 requests on a persistent connection (can run several in parallel to speed things along):
The text was updated successfully, but these errors were encountered: