Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XRootD 5.4.3 memory corruption for pgRead #1743

Closed
esindril opened this issue Jul 18, 2022 · 4 comments · Fixed by #1748
Closed

XRootD 5.4.3 memory corruption for pgRead #1743

esindril opened this issue Jul 18, 2022 · 4 comments · Fixed by #1748
Assignees

Comments

@esindril
Copy link
Contributor

We are using a custom build version of XRootD 5.4.3 with 3 extra commits to address some bugs that were affecting some of the more demanding EOS instances at CERN. The 3 extra commits are the following:
4df4cda
624daad
50da3f0

Unfortunately, using this XRootD 5.4.3++ version we see crashes (SEGV) in "random" places of the code which don't make much sense. Therefore, we deployed an ASAN enabled version of EOS on some of the diskservers that were crashing and it detected a memory corruption when handling pgRead operations. These operations come most likely from new xrdcp commands that are probably the only ones that trigger the pgRead functionality on the server side.

Below you have a sample output of the ASAN report:

=================================================================
==33310==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x7f3ba0b0e000 at pc 0x7f3d1ffe1f5d bp 0x7f3b6e775da0 sp 0x7f3b6e775548
    #0 0x7f3d1ffe1f5c  (/usr/lib64/libasan.so.5+0x57f5c)
    #1 0x7f3d015b8a2d in eos::fst::XrdFstOssFile::Read(void*, long, unsigned long) /root/rpmbuild/BUILD/eos-5.0.26-1/fst/XrdFstOssFile.cc:267
    #2 0x7f3d1f9c349c in XrdOfsFile::read(long long, char*, int) (/opt/eos/xrootd/lib64/libXrdServer.so.3+0x13949c)
    #3 0x7f3d15f5768e in eos::fst::XrdFstOfsFile::readofs(long long, char*, int) /root/rpmbuild/BUILD/eos-5.0.26-1/fst/XrdFstOfsFile.cc:2235
    #4 0x7f3d1618e94a in eos::fst::LocalIo::fileRead(long long, char*, int, unsigned short) /root/rpmbuild/BUILD/eos-5.0.26-1/fst/io/local/LocalIo.cc:114
    #5 0x7f3d14951ae3 in eos::fst::ReplicaParLayout::Read(long long, char*, int, bool) /root/rpmbuild/BUILD/eos-5.0.26-1/fst/layout/ReplicaParLayout.cc:211
    #6 0x7f3d15f5ca5f in eos::fst::XrdFstOfsFile::read(long long, char*, int) /root/rpmbuild/BUILD/eos-5.0.26-1/fst/XrdFstOfsFile.cc:787
    #7 0x7f3d15f4ed0d in eos::fst::XrdFstOfsFile::pgRead(long long, char*, int, unsigned int*, unsigned long) /root/rpmbuild/BUILD/eos-5.0.26-1/fst/XrdFstOfsFile.cc:858
    #8 0x7f3d1f9b9561 in XrdXrootdProtocol::do_PgRIO() (/opt/eos/xrootd/lib64/libXrdServer.so.3+0x12f561)
    #9 0x7f3d1f9ba65f in XrdXrootdProtocol::do_PgRead() (/opt/eos/xrootd/lib64/libXrdServer.so.3+0x13065f)
    #10 0x7f3d1f9735f7 in XrdXrootdProtocol::Process2() (/opt/eos/xrootd/lib64/libXrdServer.so.3+0xe95f7)
    #11 0x7f3d1f58de76 in XrdLinkXeq::DoIt() (/opt/eos/xrootd/lib64/libXrdUtils.so.3+0x216e76)
    #12 0x7f3d1f585319 in XrdLink::setProtocol(XrdProtocol*, bool, bool) (/opt/eos/xrootd/lib64/libXrdUtils.so.3+0x20e319)
    #13 0x7f3d1f595bf5 in XrdScheduler::Run() (/opt/eos/xrootd/lib64/libXrdUtils.so.3+0x21ebf5)
    #14 0x7f3d1f595f28 in XrdStartWorking(void*) (/opt/eos/xrootd/lib64/libXrdUtils.so.3+0x21ef28)
    #15 0x7f3d1f432199 in XrdSysThread_Xeq (/opt/eos/xrootd/lib64/libXrdUtils.so.3+0xbb199)
    #16 0x7f3d1e536ea4 in start_thread (/lib64/libpthread.so.0+0x7ea4)
    #17 0x7f3d1e25fb0c in clone (/lib64/libc.so.6+0xfeb0c)

0x7f3ba0b0e000 is located 0 bytes to the right of 2097152-byte region [0x7f3ba090e000,0x7f3ba0b0e000)
allocated by thread T589561 here:
    #0 0x7f3d200974fd in posix_memalign (/usr/lib64/libasan.so.5+0x10d4fd)
    #1 0x7f3d1f580e2a in XrdBuffManager::Obtain(int) (/opt/eos/xrootd/lib64/libXrdUtils.so.3+0x209e2a)

SUMMARY: AddressSanitizer: heap-buffer-overflow (/usr/lib64/libasan.so.5+0x57f5c)
Shadow bytes around the buggy address:
  0x0fe7f4159bb0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x0fe7f4159bc0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x0fe7f4159bd0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x0fe7f4159be0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x0fe7f4159bf0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
=>0x0fe7f4159c00:[fa]fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0fe7f4159c10: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0fe7f4159c20: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0fe7f4159c30: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0fe7f4159c40: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0fe7f4159c50: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
  Shadow gap:              cc
==33310==ABORTING

Has anyone experienced similar crashes with the latest XRootD 5.4.3? We assume this is not a side effect of any of the 3 extra commits that we are using.

Thanks,
Elvin

@abh3
Copy link
Member

abh3 commented Jul 18, 2022

Hi Elvin,

Most of the pgread traffic comes via xcache and no one reported crashes except when a read timeout occurs in a particular part of the code (that is being addressed as we speak0. From the trace that does not look like anything that has been reported so far. I did reply via a separate thread to Michal who brought this up first. The traceback looks fine the issue is we need to see who has decided to read more than 2 MB into a properly allocated 2 MB buffer. Can you provide access to a core file I can look at?

@esindril
Copy link
Contributor Author

Hi Andy,

Let me have another look at this, since I might have rushed a bit - too much enthusiasm after the holidays. The problem might actually come from the EOS code. Put this aside for the moment until I confirm everything looks ok on the EOS side.

Thanks,
Elvin

esindril added a commit to esindril/xrootd that referenced this issue Jul 19, 2022
@esindril
Copy link
Contributor Author

Hi @abh3 ,

Could you please review the linked pull request? The problem is very simple to reproduce, at least inside EOS, by doing a simple pgRead requests with an offset which is not aligned. For example a pgRead with offset 1 and length 4MB triggers this memory corruption. After applying this patch all the tests in EOS pass.

Thanks,
Elvin

esindril added a commit to esindril/xrootd that referenced this issue Jul 19, 2022
@abh3
Copy link
Member

abh3 commented Jul 19, 2022

Thank you for catching that! I will merge it as soon as all the checks complete.

esindril added a commit to esindril/xrootd that referenced this issue Aug 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants