-
Notifications
You must be signed in to change notification settings - Fork 149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
XRootDFS was aborted due to XrdFfsWCache_destroy function. #1116
Comments
Could you tell us the version you are running and the Linux release?
…On Wed, 15 Jan 2020, Geonmo Ryu wrote:
Hello, XRootD Developers.
I found that our xrootdfs mount point was disappeared frequently.
I acquired a core file which was related to the program aborting.
Here is a gdb message for this.
Could you check this?
```
(gdb) where
#0 0x0000003741c324f5 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1 0x0000003741c33cd5 in abort () at abort.c:92
#2 0x0000003741c70417 in __libc_message (do_abort=2, fmt=0x3741d58c00 "*** glibc detected *** %s: %s: 0x%s ***\n")
at ../sysdeps/unix/sysv/linux/libc_fatal.c:198
#3 0x0000003741c75e5e in malloc_printerr (action=3, str=0x3741d56bbe "corrupted double-linked list", ptr=<value optimized out>,
ar_ptr=<value optimized out>) at malloc.c:6360
#4 0x0000003741c79066 in _int_free (av=0x7f5578000020, p=0x7f5578026d90, have_lock=0) at malloc.c:5030
#5 0x0000003407a093f1 in XrdFfsWcache_destroy (fd=<value optimized out>) at /usr/src/debug/xrootd/xrootd/src/XrdFfs/XrdFfsWcache.cc:139
#6 0x0000003407a094ca in XrdFfsWcache_create (fd=0) at /usr/src/debug/xrootd/xrootd/src/XrdFfs/XrdFfsWcache.cc:114
#7 0x0000000000403ed1 in xrootdfs_create (
path=0x7f53d00008c0 "/store/unmerged/SAM/testSRM/SAM-cms-se.sdfarm.kr/lcg-util/testfile-put-nospacetoken-1576995075-0026e73184df.txt",
mode=<value optimized out>, fi=0x7f53debfcde0) at /usr/src/debug/xrootd/xrootd/src/XrdFfs/XrdFfsXrootdfs.cc:444
#8 0x0000003743c08c2f in fuse_fs_create (fs=0x1ff1a10,
path=0x7f53d00008c0 "/store/unmerged/SAM/testSRM/SAM-cms-se.sdfarm.kr/lcg-util/testfile-put-nospacetoken-1576995075-0026e73184df.txt", mode=33188,
fi=0x7f53debfcde0) at fuse.c:1428
#9 0x0000003743c1090a in fuse_lib_create (req=0x7f53d00013e0, parent=289472, name=0x7f53d40009c8 "testfile-put-nospacetoken-1576995075-0026e73184df.txt",
mode=33188, fi=0x7f53debfcde0) at fuse.c:2407
#10 0x0000003743c146f3 in do_create (req=<value optimized out>, nodeid=<value optimized out>, inarg=<value optimized out>) at fuse_lowlevel.c:705
#11 0x0000003743c120ef in fuse_do_work (data=0x7f53d40008c0) at fuse_loop_mt.c:107
#12 0x0000003742007aa1 in start_thread (arg=0x7f53debfd700) at pthread_create.c:301
#13 0x0000003741ce8c4d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115
(gdb) fram 5
#5 0x0000003407a093f1 in XrdFfsWcache_destroy (fd=<value optimized out>) at /usr/src/debug/xrootd/xrootd/src/XrdFfs/XrdFfsWcache.cc:139
139 free(XrdFfsWcacheFbufs[fd].buf);
```
--
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
#1116
|
@abh3 The server has a kernel as 2.6.32-754.23.1.el6.x86_64 on Scientific Linux release 6.10 (Carbon). |
I suspected that XrdFfsWcache_pwrite can corrupt XrdFfsWcacheFbufs[fd].buf contents due to xrootd/src/XrdFfs/XrdFfsWcache.cc Line 193 in cfdc6ac
|
Hi Geonmo,
I suppose that is possible if offset were not zero but len was zero as he
offset would not be set to zero in the _flush() function. However, I don't
immediately see why that would ever happen. Is that what you mean here?
Andy
…On Wed, 15 Jan 2020, Geonmo Ryu wrote:
I suspected that **XrdFfsWcache_pwrite** can corrupt **XrdFfsWcacheFbufs[fd].buf** contents due to https://github.com/xrootd/xrootd/blob/cfdc6aca0bde74f7b65000987f846512c42a87ab/src/XrdFfs/XrdFfsWcache.cc#L193 if **XrdFfsWcacheFbufs[fd].offset** is not 0.
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#1116 (comment)
|
Just to check this could you display the contents of XrdFfsWcacheFbufs[fd]
in the core file (I assume you have one)?
Andy
…On Wed, 15 Jan 2020, Geonmo Ryu wrote:
I suspected that **XrdFfsWcache_pwrite** can corrupt **XrdFfsWcacheFbufs[fd].buf** contents due to https://github.com/xrootd/xrootd/blob/cfdc6aca0bde74f7b65000987f846512c42a87ab/src/XrdFfs/XrdFfsWcache.cc#L193 if **XrdFfsWcacheFbufs[fd].offset** is not 0.
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#1116 (comment)
|
Unfortunately, I can not see the values due to compiler optimized. (gdb) p XrdFfsWcacheFbufs[fd] |
Ah, typically, it's not always optimized out everywhere. The trick is to
find a frame where you still have access to the structure and get the fd
value. For instance in frame 6:
#6 0x0000003407a094ca in XrdFfsWcache_create (fd=0) at
/usr/src/debug/xrootd/xrootd/src/XrdFfs/XrdFfsWcache.cc:114
we know fd is zero and in that frame we can look to see what
XrdFfsWcacheFbufs[0] look like. The reason it's optimized out in future
frames is because the compiler eliminates stack frames whete i thinks it
doesn't need them and simply keeps the value in a register.
Andy
…On Wed, 15 Jan 2020, Geonmo Ryu wrote:
Unfortunately, I can not see the values due to compiler optimized.
(gdb) p XrdFfsWcacheFbufs[fd]
value has been optimized out
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#1116 (comment)
|
@abh3 I checked the value of that variable, but it is very confusing because it is strange. The buf value is NULL in the previous line, and it is very strange that the value is 0x0. I think it's because of the lack of GDB experience.
|
It is very strange to see #6 XrdFfsWcache_create (fd=0). Usually the fd number is pretty large. In my case, the first file has fd = 10240.
Note the fd here: Though the array XrdFfsWcacheFbufs[i] has i = 0 – maxfs (“maxfs” a parameter you can set when starting xrootdfs), the fd = baseFD + i. See line 127 - 130 in XrdFfsXrootdfs.cc
…--
Wei Yang | yangw@slac.stanford.edu<mailto:yangw@slac.stanford.edu> | 650-926-3338(O)
From: Geonmo Ryu <notifications@github.com>
Reply-To: xrootd/xrootd <reply@reply.github.com>
Date: Wednesday, January 15, 2020 at 6:40 PM
To: xrootd/xrootd <xrootd@noreply.github.com>
Cc: Subscribed <subscribed@noreply.github.com>
Subject: [xrootd/xrootd] XRootDFS was aborted due to XrdFfsWCache_destroy function. (#1116)
Hello, XRootD Developers.
I found that our xrootdfs mount point was disappeared frequently.
I acquired a core file which was related to the program aborting.
Here is a gdb message for this.
Could you check this?
(gdb) where
#0 0x0000003741c324f5 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1 0x0000003741c33cd5 in abort () at abort.c:92
#2 0x0000003741c70417 in __libc_message (do_abort=2, fmt=0x3741d58c00 "*** glibc detected *** %s: %s: 0x%s ***\n")
at ../sysdeps/unix/sysv/linux/libc_fatal.c:198
#3 0x0000003741c75e5e in malloc_printerr (action=3, str=0x3741d56bbe "corrupted double-linked list", ptr=<value optimized out>,
ar_ptr=<value optimized out>) at malloc.c:6360
#4 0x0000003741c79066 in _int_free (av=0x7f5578000020, p=0x7f5578026d90, have_lock=0) at malloc.c:5030
#5 0x0000003407a093f1 in XrdFfsWcache_destroy (fd=<value optimized out>) at /usr/src/debug/xrootd/xrootd/src/XrdFfs/XrdFfsWcache.cc:139
#6 0x0000003407a094ca in XrdFfsWcache_create (fd=0) at /usr/src/debug/xrootd/xrootd/src/XrdFfs/XrdFfsWcache.cc:114
#7 0x0000000000403ed1 in xrootdfs_create (
path=0x7f53d00008c0 "/store/unmerged/SAM/testSRM/SAM-cms-se.sdfarm.kr/lcg-util/testfile-put-nospacetoken-1576995075-0026e73184df.txt",
mode=<value optimized out>, fi=0x7f53debfcde0) at /usr/src/debug/xrootd/xrootd/src/XrdFfs/XrdFfsXrootdfs.cc:444
#8 0x0000003743c08c2f in fuse_fs_create (fs=0x1ff1a10,
path=0x7f53d00008c0 "/store/unmerged/SAM/testSRM/SAM-cms-se.sdfarm.kr/lcg-util/testfile-put-nospacetoken-1576995075-0026e73184df.txt", mode=33188,
fi=0x7f53debfcde0) at fuse.c:1428
#9 0x0000003743c1090a in fuse_lib_create (req=0x7f53d00013e0, parent=289472, name=0x7f53d40009c8 "testfile-put-nospacetoken-1576995075-0026e73184df.txt",
mode=33188, fi=0x7f53debfcde0) at fuse.c:2407
#10 0x0000003743c146f3 in do_create (req=<value optimized out>, nodeid=<value optimized out>, inarg=<value optimized out>) at fuse_lowlevel.c:705
#11 0x0000003743c120ef in fuse_do_work (data=0x7f53d40008c0) at fuse_loop_mt.c:107
#12 0x0000003742007aa1 in start_thread (arg=0x7f53debfd700) at pthread_create.c:301
#13 0x0000003741ce8c4d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115
(gdb) fram 5
#5 0x0000003407a093f1 in XrdFfsWcache_destroy (fd=<value optimized out>) at /usr/src/debug/xrootd/xrootd/src/XrdFfs/XrdFfsWcache.cc:139
139 free(XrdFfsWcacheFbufs[fd].buf);
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub<#1116?email_source=notifications&email_token=ABHVGA62TI2IYMXL667HIKTQ57CLFA5CNFSM4KHM7UU2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4IGQSEKA>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ABHVGAYQ64BMNVZIKX337QTQ57CLFANCNFSM4KHM7UUQ>.
|
@wyang007 Well, the fd value seems to be because the XrdFfsPosix_baseFD value is already subtracted.
|
@wyang007 Ah, I misunderstood your answer. I don't know due to lack of gdb experience. |
@wyang007 As you questioned, the fd value at frame 6 should be 20000, but it's hard to understand what is written as 0. |
Hi Geonmo,
I was able to attach gdb to my xrootdfs instance and check the fs passed to XrdFfsWcache_create(fd). To do so, I had to uninstall all xrootd rpms from EPEL and install them again from xrootd-stable repo (https://xrootd.slac.stanford.edu/binaries/xrootd-stable-slc7.repo), along with the xrootd-debuginfo rpm. This is because I couldn't find the xrootd-debuginfo rpm from EPEL (maybe I didn't look hard). Perhaps you need the xrootd-debuginfo rpm in order to debug. I notice that one can't mix xrootd rpms from one repo with xrootd-debuginfo from another repo.
regards,
--
Wei Yang | mailto:yangw@slac.stanford.edu | 650-926-3338(O)
|
@wyang007 Hello, I already installed "xrootd-debuginfo-4.11.0-1.osg34.el6.x86_64" in osg-release. So, I guess it is not a version problem. However, I will re-install the package due to possibility of corruption. |
In that case, you can still reinstall from OSG, to rule out corruption, but don’t have to change your normal yum repo sources.
…--
Wei Yang | yangw@slac.stanford.edu<mailto:yangw@slac.stanford.edu> | 650-926-3338(O)
|
@wyang007 Hello, I found that the glibc and glibc debuginfo version was different. Here is gdb message after installing. (gdb) where
#0 XrdFfsWcache_destroy (fd=<value optimized out>) at /usr/src/debug/xrootd/xrootd/src/XrdFfs/XrdFfsWcache.cc:138
#1 0x0000003de3c094ca in XrdFfsWcache_create (fd=0) at /usr/src/debug/xrootd/xrootd/src/XrdFfs/XrdFfsWcache.cc:114
#2 0x0000000000403ed1 in xrootdfs_create (
path=0x7fb4a0000b20 "/store/user/jhchoi/Latino/HWWNano/Fall2017_102X_nAODv5_Full2017v6/MCl1loose2017v6__MCCorr2017v6/nanoLatino_TTToSemiLeptonic_PSweights__part34.root", mode=<value optimized out>, fi=0x7fb5c99bede0) at /usr/src/debug/xrootd/xrootd/src/XrdFfs/XrdFfsXrootdfs.cc:444
#3 0x0000003743c08c2f in fuse_fs_create (fs=0x1227a10,
path=0x7fb4a0000b20 "/store/user/jhchoi/Latino/HWWNano/Fall2017_102X_nAODv5_Full2017v6/MCl1loose2017v6__MCCorr2017v6/nanoLatino_TTToSemiLeptonic_PSweights__part34.root", mode=33188, fi=0x7fb5c99bede0) at fuse.c:1428
#4 0x0000003743c1090a in fuse_lib_create (req=0x7fb4a0006ee0, parent=46516, name=0x7fb53804b0c8 "nanoLatino_TTToSemiLeptonic_PSweights__part34.root",
mode=33188, fi=0x7fb5c99bede0) at fuse.c:2407
#5 0x0000003743c146f3 in do_create (req=<value optimized out>, nodeid=<value optimized out>, inarg=<value optimized out>) at fuse_lowlevel.c:705
#6 0x0000003743c120ef in fuse_do_work (data=0x7fb538071c40) at fuse_loop_mt.c:107
#7 0x0000003742007aa1 in start_thread (arg=0x7fb5c99bf700) at pthread_create.c:301
#8 0x0000003741ce8c4d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 I think that the xrootdfs_create function on XrdFfsXrootdfs has a bug due to a lack of the return value checking. On https://github.com/xrootd/xrootd/blob/master/src/XrdFfs/XrdFfsXrootdfs.cc#L442, |
@geonmo, you are probably right that xrootdfs_create() should check the return code of res = xrootdfs_do_create(), and return res if res < 0 (and errno = -res). Two questions:
|
@wyang007 Our xrootd system has the following characteristics: It's hard to tell when the problem happened, but if my guess is correct, the problem probably started when the NAS mount directory sometimes hangs. If the NAS hangs, the xrootd server does not seem to be aware of it and will not reduce its capacity immediately, so if the write operation to that space is executed before the loss is reported, the xrootdfs_do_create () function will be able to fail. If you can get the patch as an RPM format, I can apply it to that server. Of course, it can take quite a while, since the NAS mount must be hanged to reproduce it. However, if you give me a patch, I will continue the test. |
add a commit to fix this issue 8ed3d69 |
@wyang007 Thanks for commit. I will close this ticket. |
Hello, XRootD Developers.
I found that our xrootdfs mount point was disappeared frequently.
I acquired a core file which was related to the program aborting.
Here is a gdb message for this.
Could you check this?
The text was updated successfully, but these errors were encountered: