Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak / hoard with XRootD 5.0.2 (in checksumming code?) #1291

Closed
olifre opened this issue Sep 25, 2020 · 10 comments
Closed

Memory leak / hoard with XRootD 5.0.2 (in checksumming code?) #1291

olifre opened this issue Sep 25, 2020 · 10 comments

Comments

@olifre
Copy link
Contributor

olifre commented Sep 25, 2020

Since upgrading from 4.12.4 to 5.0.2, we observe huge memory usage for XRootD processes even after just a few hours of runtime. Sadly, not easily visible in our test setup, but only with a heavy rate of incoming requests as seen in production.

It seems to affect only the data transfer nodes, not the redirector. On the transfer nodes, I see RSS up to 27 GB after 4-6 hours of heavy transfers (thousands of connections). One less ugly example is:

VmPeak: 10657644 kB
VmSize: 10657644 kB
VmLck:         0 kB
VmPin:         0 kB
VmHWM:   8422532 kB
VmRSS:   8422532 kB
RssAnon:         8413632 kB
RssFile:            8900 kB
RssShmem:              0 kB
VmData: 10496300 kB
VmStk:       132 kB
VmExe:        76 kB
VmLib:     18944 kB
VmPTE:     17140 kB
VmSwap:    18024 kB

I first wanted to ask if there's a "recommended debugging" way for these matters — of course I know my way around valgrind and gdb, but attaching these to a production xrootd instance is not really a possibility. I will try to reproduce this in our test setup when a sufficient time slot pops up, but if there is a "best practice" or something like a dump function to dump current allocated memory segments and their use in XRootD please let me know.

@osschar
Copy link
Contributor

osschar commented Sep 25, 2020

I'd use tcmalloc / gperftools and dump memory profile every time total grows by 1GB, HEAP_PROFILE_INUSE_INTERVAL=1073741824, then look at the difference. If it's so bad, it should be obvious ;)

@osschar
Copy link
Contributor

osschar commented Sep 25, 2020

https://gperftools.github.io/gperftools/heapprofile.html

You can LD_PRELOAD libtcmalloc.so.

@olifre
Copy link
Contributor Author

olifre commented Sep 25, 2020

Thanks, that looks like something I can even try in production :-). Will do that tomorrow.

@olifre olifre changed the title Memory leak / hoard with XRootD 5.0.2 Memory leak / hoard with XRootD 5.0.2 (in checksumming code?) Sep 26, 2020
@olifre
Copy link
Contributor Author

olifre commented Sep 26, 2020

I actually did a quick run just now, doing dumps every 100 MB, and diffing an early and a later one.
I find quite quickly:

Total: 1496.2 MB
  1470.0  98.2%  98.2%   1470.0  98.2% XrdCksManOss::Calc
    23.7   1.6%  99.8%     23.7   1.6% XrdBuffManager::Obtain (inline)
     3.4   0.2% 100.1%      6.4   0.4% C_GetFunctionList
     2.8   0.2% 100.2%      2.8   0.2% PL_ArenaAllocate
     0.4   0.0% 100.3%      7.6   0.5% curl_formget
     0.2   0.0% 100.3%      0.3   0.0% curl_easy_send

Indeed, I do not see where this buffer:

// Compute read size and allocate a buffer
//
ioSize = (fileSize < (off_t)rdSz ? fileSize : rdSz); rc = 0;
buffP = (char *)malloc(ioSize);
if (!buffP) return -ENOMEM;

is ever freed.
But I wonder even more why this did never bite us before XRootD 5...

Indeed, we sometimes see (when a lot of transfers happen) that the other end claims there has been a timeout during checksum from our end. Maybe this leads to repeated checksumming, enhancing the problem only now.

@osschar
Copy link
Contributor

osschar commented Sep 28, 2020

I'm asking this for Andy, as I don't know this part of the code :)

Where is this being called from? If you do graphics output, it draws a graph where boxes show different stack trace locations.

@olifre
Copy link
Contributor Author

olifre commented Sep 28, 2020

I reproduced it and collected a callgraph from the difference of two heap traces):
callgraph.pdf
So it seems that's called via XrdXrootdProtocol::CheckSum => XrdOfs::chlksum.
Does that help?

@osschar
Copy link
Contributor

osschar commented Sep 28, 2020

Yup, perfect ... thanks!

@abh3 abh3 closed this as completed in 062cd42 Sep 28, 2020
@olifre
Copy link
Contributor Author

olifre commented Sep 28, 2020

@abh3 and @osschar Many thanks for the quick fix! Indeed the fix seems as trivial as expected, I'm just really astonished I've never hit that (or maybe just did not hit it so hard) before upgrading to 5.x (the code was there before...).
But let's not worry about the past, this plugs a significant leak (if checksumming is used) and should help a lot of users :-).

@xrootd-dev
Copy link

xrootd-dev commented Sep 28, 2020 via email

@olifre
Copy link
Contributor Author

olifre commented Sep 28, 2020

Hi Andy,

ah, thanks for the explanation! And I learned a bit of pprof in the process, that never hurts (only knew valgrind / callgrind before which would have been more detailed, but too much overhead here). 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants