Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

segfault in XrdFileCache #623

Closed
jthiltges opened this issue Nov 14, 2017 · 2 comments
Closed

segfault in XrdFileCache #623

jthiltges opened this issue Nov 14, 2017 · 2 comments

Comments

@jthiltges
Copy link
Contributor

We've been seeing segfaults in XrdFileCache on a StashCache server. It appears to be related to persistent HTTP connections and heavily used files.

There's an unsigned short use counter for the active links: https://github.com/xrootd/xrootd/blob/v4.7.1/src/XrdOfs/XrdOfsHandle.hh#L53

When the counter reaches maximum value, it appears to trigger a segfault:

171113 04:11:36 3933800 unknown.343:254@cms-h006.rcac.purdue.edu ofs_open: attach use=65533 fn=/user/eharstad/public/blast_database/nt.fa.nsq
171113 04:11:39 3812035 unknown.463:314@cms-h007.rcac.purdue.edu ofs_open: attach use=65534 fn=/user/eharstad/public/blast_database/nt.fa.nsq
171113 04:11:40 3919073 unknown.343:254@cms-h006.rcac.purdue.edu ofs_open: attach use=65535 fn=/user/eharstad/public/blast_database/nt.fa.nsq
*segfault*

It seems StashCache clients are keeping long-lived persistent HTTP connections. For each GET request on a file, xrootd increments the use counter for the file, but the counter is not decremented until the HTTP connection is closed. If there's a popular file (we're seeing it happen with BLAST databases in particular) with enough clients requesting various ranges inside the file, the counter eventually reaches 65536 and xrootd crashes.

Backtrace of a crash on hcc-stash.unl.edu (xrootd-4.7.1-1.osg33.el7.x86_64):

Core was generated by `/usr/bin/xrootd -l /var/log/xrootd/xrootd.log -c /etc/xrootd/xrootd-stashcache-'.
Program terminated with signal 11, Segmentation fault.
#0  0x00007fd5b464b327 in XrdFileCache::IOEntireFile::FSize (this=0x7fd59000a0b0) at /usr/src/debug/xrootd-4.7.1/src/XrdFileCache/XrdFileCacheIOEntireFile.cc:76
76       return m_file->GetFileSize();
(gdb) bt
#0  0x00007fd5b464b327 in XrdFileCache::IOEntireFile::FSize (this=0x7fd59000a0b0) at /usr/src/debug/xrootd-4.7.1/src/XrdFileCache/XrdFileCacheIOEntireFile.cc:76
#1  0x00007fd5b464b584 in XrdFileCache::IOEntireFile::Read (this=0x7fd59000a0b0,
    buff=0x7fd4dc718000 "\217\377t\351N\270QG\\Q\375\004oU\234Wy\226\256\345\267\335N\031\320[<\022\337xQO\035\355\235\037\337Q\251\026\256Tƥ\317{w\364Lh\021\354\265\177\260\257\317\362\001\274\070\337\177\064]W\353\365\275w\273r\317\363<_\340\371\177\376\337\035\204\004L\361\263\367\312\317\361\363O\240~\363\\\312\311\345\337", <incomplete sequence \337>, off=246530048, size=1048576) at /usr/src/debug/xrootd-4.7.1/src/XrdFileCache/XrdFileCacheIOEntireFile.cc:174
#2  0x00007fd5b541dd3b in XrdPosixXrootd::Pread (fildes=<optimized out>, buf=0x7fd4dc718000, nbyte=1048576, offset=246530048) at /usr/src/debug/xrootd-4.7.1/src/XrdPosix/XrdPosixXrootd.cc:677
#3  0x00007fd5b5848079 in XrdPssFile::Read (this=<optimized out>, buff=<optimized out>, offset=<optimized out>, blen=<optimized out>) at /usr/src/debug/xrootd-4.7.1/src/XrdPss/XrdPss.cc:758
#4  0x00007fd5bef58c83 in XrdOfsFile::read (this=0x7fd3790ae960, offset=246530048,
    buff=0x7fd4dc718000 "\217\377t\351N\270QG\\Q\375\004oU\234Wy\226\256\345\267\335N\031\320[<\022\337xQO\035\355\235\037\337Q\251\026\256Tƥ\317{w\364Lh\021\354\265\177\260\257\317\362\001\274\070\337\177\064]W\353\365\275w\273r\317\363<_\340\371\177\376\337\035\204\004L\361\263\367\312\317\361\363O\240~\363\\\312\311\345\337", <incomplete sequence \337>, blen=1048576) at /usr/src/debug/xrootd-4.7.1/src/XrdOfs/XrdOfs.cc:891
#5  0x00007fd5bef4e2be in XrdXrootdProtocol::do_ReadAll (this=this@entry=0x7fd58c5adac8, asyncOK=asyncOK@entry=1) at /usr/src/debug/xrootd-4.7.1/src/XrdXrootd/XrdXrootdXeq.cc:2001
#6  0x00007fd5bef4e6e6 in XrdXrootdProtocol::do_Read (this=this@entry=0x7fd58c5adac8) at /usr/src/debug/xrootd-4.7.1/src/XrdXrootd/XrdXrootdXeq.cc:1938
#7  0x00007fd5bef44f80 in XrdXrootdProtocol::Process2 (this=this@entry=0x7fd58c5adac8) at /usr/src/debug/xrootd-4.7.1/src/XrdXrootd/XrdXrootdProtocol.cc:463
#8  0x00007fd5bef48820 in XrdXrootdTransit::Process (this=0x7fd58c5adac0, lp=0x7fd5903cf7b8) at /usr/src/debug/xrootd-4.7.1/src/XrdXrootd/XrdXrootdTransit.cc:369
#9  0x00007fd5beccc719 in XrdLink::DoIt (this=0x7fd5903cf7b8) at /usr/src/debug/xrootd-4.7.1/src/Xrd/XrdLink.cc:426
#10 0x00007fd5beccfcff in XrdScheduler::Run (this=0x610e98 <XrdMain::Config+440>) at /usr/src/debug/xrootd-4.7.1/src/Xrd/XrdScheduler.cc:357
#11 0x00007fd5beccfe49 in XrdStartWorking (carg=<optimized out>) at /usr/src/debug/xrootd-4.7.1/src/Xrd/XrdScheduler.cc:87
#12 0x00007fd5bec8c4d7 in XrdSysThread_Xeq (myargs=0x7fd37c0d3810) at /usr/src/debug/xrootd-4.7.1/src/XrdSys/XrdSysPthread.cc:86
#13 0x00007fd5be848e25 in start_thread (arg=0x7fd3196ba700) at pthread_create.c:308
#14 0x00007fd5bdb4e34d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
(gdb) print m_file
$1 = (XrdFileCache::File *) 0x0

It can be reproduced with a small python script doing 65536 requests on a persistent connection (can run several in parallel to speed things along):

#!/usr/bin/env python

import requests
import logging

logging.basicConfig()
logging.getLogger().setLevel(logging.DEBUG)

s = requests.Session()
headers = {"Range": "bytes=0-100"}

url = 'http://stash.example.edu:8000/testfile'

for i in range(0, 65537):
    print "Request #%d" % i
    r = s.get(url, headers=headers)
    print "Status = %d\n" % r.status_code
@abh3
Copy link
Member

abh3 commented Nov 15, 2017

Thanks for making this reproducible. I will look into this and hopefully have a fix by Monday.

@abh3
Copy link
Member

abh3 commented Nov 15, 2017

Indeed, even though there is code to handle a counter overflow that code is just wrong and can lead to an eventual SEGV. I think the solution is to increase the size of the counter to an unsigned int.

@abh3 abh3 closed this as completed in e1388ed Nov 19, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants