Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XRootDFS was aborted due to XrdFfsWCache_destroy function. #1116

Closed
geonmo opened this issue Jan 16, 2020 · 20 comments
Closed

XRootDFS was aborted due to XrdFfsWCache_destroy function. #1116

geonmo opened this issue Jan 16, 2020 · 20 comments
Assignees

Comments

@geonmo
Copy link

geonmo commented Jan 16, 2020

Hello, XRootD Developers.

I found that our xrootdfs mount point was disappeared frequently.

I acquired a core file which was related to the program aborting.

Here is a gdb message for this.

Could you check this?

(gdb) where
#0  0x0000003741c324f5 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1  0x0000003741c33cd5 in abort () at abort.c:92
#2  0x0000003741c70417 in __libc_message (do_abort=2, fmt=0x3741d58c00 "*** glibc detected *** %s: %s: 0x%s ***\n")
    at ../sysdeps/unix/sysv/linux/libc_fatal.c:198
#3  0x0000003741c75e5e in malloc_printerr (action=3, str=0x3741d56bbe "corrupted double-linked list", ptr=<value optimized out>,
    ar_ptr=<value optimized out>) at malloc.c:6360
#4  0x0000003741c79066 in _int_free (av=0x7f5578000020, p=0x7f5578026d90, have_lock=0) at malloc.c:5030
#5  0x0000003407a093f1 in XrdFfsWcache_destroy (fd=<value optimized out>) at /usr/src/debug/xrootd/xrootd/src/XrdFfs/XrdFfsWcache.cc:139
#6  0x0000003407a094ca in XrdFfsWcache_create (fd=0) at /usr/src/debug/xrootd/xrootd/src/XrdFfs/XrdFfsWcache.cc:114
#7  0x0000000000403ed1 in xrootdfs_create (
    path=0x7f53d00008c0 "/store/unmerged/SAM/testSRM/SAM-cms-se.sdfarm.kr/lcg-util/testfile-put-nospacetoken-1576995075-0026e73184df.txt",
    mode=<value optimized out>, fi=0x7f53debfcde0) at /usr/src/debug/xrootd/xrootd/src/XrdFfs/XrdFfsXrootdfs.cc:444
#8  0x0000003743c08c2f in fuse_fs_create (fs=0x1ff1a10,
    path=0x7f53d00008c0 "/store/unmerged/SAM/testSRM/SAM-cms-se.sdfarm.kr/lcg-util/testfile-put-nospacetoken-1576995075-0026e73184df.txt", mode=33188,
    fi=0x7f53debfcde0) at fuse.c:1428
#9  0x0000003743c1090a in fuse_lib_create (req=0x7f53d00013e0, parent=289472, name=0x7f53d40009c8 "testfile-put-nospacetoken-1576995075-0026e73184df.txt",
    mode=33188, fi=0x7f53debfcde0) at fuse.c:2407
#10 0x0000003743c146f3 in do_create (req=<value optimized out>, nodeid=<value optimized out>, inarg=<value optimized out>) at fuse_lowlevel.c:705
#11 0x0000003743c120ef in fuse_do_work (data=0x7f53d40008c0) at fuse_loop_mt.c:107
#12 0x0000003742007aa1 in start_thread (arg=0x7f53debfd700) at pthread_create.c:301
#13 0x0000003741ce8c4d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115
(gdb) fram 5
#5  0x0000003407a093f1 in XrdFfsWcache_destroy (fd=<value optimized out>) at /usr/src/debug/xrootd/xrootd/src/XrdFfs/XrdFfsWcache.cc:139
139             free(XrdFfsWcacheFbufs[fd].buf);
@abh3
Copy link
Member

abh3 commented Jan 16, 2020 via email

@geonmo
Copy link
Author

geonmo commented Jan 16, 2020

@abh3 The server has a kernel as 2.6.32-754.23.1.el6.x86_64 on Scientific Linux release 6.10 (Carbon).
Then, the xrootd-fuse version is 4.11.0-1.osg34.el6.x86_64.

@geonmo
Copy link
Author

geonmo commented Jan 16, 2020

I suspected that XrdFfsWcache_pwrite can corrupt XrdFfsWcacheFbufs[fd].buf contents due to

(off_t)(offset + len) > (XrdFfsWcacheFbufs[fd].offset + XrdFfsWcacheBufsize))
if XrdFfsWcacheFbufs[fd].offset is not 0.

@abh3
Copy link
Member

abh3 commented Jan 16, 2020 via email

@abh3
Copy link
Member

abh3 commented Jan 16, 2020 via email

@geonmo
Copy link
Author

geonmo commented Jan 16, 2020

Unfortunately, I can not see the values due to compiler optimized.

(gdb) p XrdFfsWcacheFbufs[fd]
value has been optimized out

@abh3
Copy link
Member

abh3 commented Jan 16, 2020 via email

@geonmo
Copy link
Author

geonmo commented Jan 16, 2020

@abh3 I checked the value of that variable, but it is very confusing because it is strange. The buf value is NULL in the previous line, and it is very strange that the value is 0x0. I think it's because of the lack of GDB experience.

(gdb) f 5
#5  0x0000003407a093f1 in XrdFfsWcache_destroy (fd=<value optimized out>) at /usr/src/debug/xrootd/xrootd/src/XrdFfs/XrdFfsWcache.cc:139
139             free(XrdFfsWcacheFbufs[fd].buf);
(gdb) p XrdFfsWcacheFbufs[0]
$1 = {offset = 0, len = 0, buf = 0x0, mlock = 0x0}
(gdb) list
134         fd -= XrdFfsPosix_baseFD;
135
136         XrdFfsWcacheFbufs[fd].offset = 0;
137         XrdFfsWcacheFbufs[fd].len = 0;
138         if (XrdFfsWcacheFbufs[fd].buf != NULL) 
139             free(XrdFfsWcacheFbufs[fd].buf);
140         XrdFfsWcacheFbufs[fd].buf = NULL;
141         if (XrdFfsWcacheFbufs[fd].mlock != NULL)
142         {
143             pthread_mutex_destroy(XrdFfsWcacheFbufs[fd].mlock);
(gdb) f 6
#6  0x0000003407a094ca in XrdFfsWcache_create (fd=0) at /usr/src/debug/xrootd/xrootd/src/XrdFfs/XrdFfsWcache.cc:114
114         XrdFfsWcache_destroy(fd);
(gdb) p XrdFfsWcacheFbufs[fd]
$4 = {offset = 0, len = 0, buf = 0x0, mlock = 0x0}

@wyang007
Copy link
Member

wyang007 commented Jan 16, 2020 via email

@geonmo
Copy link
Author

geonmo commented Jan 16, 2020

@wyang007 Well, the fd value seems to be because the XrdFfsPosix_baseFD value is already subtracted.

(gdb) p XrdFfsPosix_baseFD
$1 = 20000

@geonmo
Copy link
Author

geonmo commented Jan 16, 2020

@wyang007 Ah, I misunderstood your answer. I don't know due to lack of gdb experience.

@geonmo
Copy link
Author

geonmo commented Jan 16, 2020

@wyang007 As you questioned, the fd value at frame 6 should be 20000, but it's hard to understand what is written as 0.

@xrootd-dev
Copy link

xrootd-dev commented Jan 16, 2020 via email

@geonmo
Copy link
Author

geonmo commented Jan 17, 2020

@wyang007 Hello, I already installed "xrootd-debuginfo-4.11.0-1.osg34.el6.x86_64" in osg-release. So, I guess it is not a version problem. However, I will re-install the package due to possibility of corruption.

@wyang007
Copy link
Member

wyang007 commented Jan 17, 2020 via email

@geonmo
Copy link
Author

geonmo commented Jan 29, 2020

@wyang007 Hello, I found that the glibc and glibc debuginfo version was different. Here is gdb message after installing.

(gdb) where
#0  XrdFfsWcache_destroy (fd=<value optimized out>) at /usr/src/debug/xrootd/xrootd/src/XrdFfs/XrdFfsWcache.cc:138
#1  0x0000003de3c094ca in XrdFfsWcache_create (fd=0) at /usr/src/debug/xrootd/xrootd/src/XrdFfs/XrdFfsWcache.cc:114
#2  0x0000000000403ed1 in xrootdfs_create (
    path=0x7fb4a0000b20 "/store/user/jhchoi/Latino/HWWNano/Fall2017_102X_nAODv5_Full2017v6/MCl1loose2017v6__MCCorr2017v6/nanoLatino_TTToSemiLeptonic_PSweights__part34.root", mode=<value optimized out>, fi=0x7fb5c99bede0) at /usr/src/debug/xrootd/xrootd/src/XrdFfs/XrdFfsXrootdfs.cc:444
#3  0x0000003743c08c2f in fuse_fs_create (fs=0x1227a10, 
    path=0x7fb4a0000b20 "/store/user/jhchoi/Latino/HWWNano/Fall2017_102X_nAODv5_Full2017v6/MCl1loose2017v6__MCCorr2017v6/nanoLatino_TTToSemiLeptonic_PSweights__part34.root", mode=33188, fi=0x7fb5c99bede0) at fuse.c:1428
#4  0x0000003743c1090a in fuse_lib_create (req=0x7fb4a0006ee0, parent=46516, name=0x7fb53804b0c8 "nanoLatino_TTToSemiLeptonic_PSweights__part34.root", 
    mode=33188, fi=0x7fb5c99bede0) at fuse.c:2407
#5  0x0000003743c146f3 in do_create (req=<value optimized out>, nodeid=<value optimized out>, inarg=<value optimized out>) at fuse_lowlevel.c:705
#6  0x0000003743c120ef in fuse_do_work (data=0x7fb538071c40) at fuse_loop_mt.c:107
#7  0x0000003742007aa1 in start_thread (arg=0x7fb5c99bf700) at pthread_create.c:301
#8  0x0000003741ce8c4d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

I think that the xrootdfs_create function on XrdFfsXrootdfs has a bug due to a lack of the return value checking.

On https://github.com/xrootd/xrootd/blob/master/src/XrdFfs/XrdFfsXrootdfs.cc#L442,
the res did not be checked after calling of xrootdfs_do_create. If xrootdfs_do_create is failed, fd can be 0.

@wyang007
Copy link
Member

wyang007 commented Feb 2, 2020

@geonmo, you are probably right that xrootdfs_create() should check the return code of res = xrootdfs_do_create(), and return res if res < 0 (and errno = -res). Two questions:

  1. do you know why xrootdfs_do_create() (essentially XrdFfsPosix_open()) fail?
  2. Can I give you a patch to see if that helps? The patch will check the return code of xrootdfs_do_create() and return res if res<0.

@geonmo
Copy link
Author

geonmo commented Feb 3, 2020

@wyang007 Our xrootd system has the following characteristics:
There are a total of five XRootD Servers, some of which have a mix of SAN and NAS. The servers are running two XRootD disk server daemons, each managing files with different localroot directories.

It's hard to tell when the problem happened, but if my guess is correct, the problem probably started when the NAS mount directory sometimes hangs. If the NAS hangs, the xrootd server does not seem to be aware of it and will not reduce its capacity immediately, so if the write operation to that space is executed before the loss is reported, the xrootdfs_do_create () function will be able to fail.

If you can get the patch as an RPM format, I can apply it to that server. Of course, it can take quite a while, since the NAS mount must be hanged to reproduce it. However, if you give me a patch, I will continue the test.

@wyang007
Copy link
Member

wyang007 commented Mar 4, 2020

add a commit to fix this issue 8ed3d69

@geonmo
Copy link
Author

geonmo commented Mar 17, 2020

@wyang007 Thanks for commit.

I will close this ticket.

@geonmo geonmo closed this as completed Mar 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants