-
Notifications
You must be signed in to change notification settings - Fork 149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OOM issue with XRootD 4.12.6 (OSG version) #1430
Comments
Are you running a lot of checksums through that server? If so, there is a
known memory leak issue in that release.
…On Wed, 17 Mar 2021, Kenyi Hurtado wrote:
When using XRootD 4.12.6-1 (osg version) for our XRootD data servers, we notice OOM issues. This doesn't happen when we use 4.11 (we reverted back to that version)
```
(primeradiant06)
[Mon Mar 8 13:16:45 2021][316012.240505] xrootd invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
[Mon Mar 8 13:16:45 2021][316012.248957] xrootd cpuset=/ mems_allowed=0-1
[Mon Mar 8 13:16:45 2021][316012.253811] CPU: 4 PID: 2890 Comm: xrootd Kdump: loaded Tainted: P OE ------------ 3.10.0-1160.15.2.el7.x86_64 #1
[Mon Mar 8 13:16:45 2021][316012.266621] Hardware name: Dell Inc. PowerEdge R730xd/0599V5, BIOS 2.11.0 11/02/2019
[Mon Mar 8 13:16:45 2021][316012.275358] Call Trace:
[Mon Mar 8 13:16:45 2021][316012.278184] [<ffffffffac381fba>] dump_stack+0x19/0x1b
[Mon Mar 8 13:16:45 2021][316012.284014] [<ffffffffac37c8da>] dump_header+0x90/0x229
[Mon Mar 8 13:16:45 2021][316012.290042] [<ffffffffabd065a2>] ? ktime_get_ts64+0x52/0xf0
[Mon Mar 8 13:16:45 2021][316012.296457] [<ffffffffabd5dd7f>] ? delayacct_end+0x8f/0xb0
[Mon Mar 8 13:16:45 2021][316012.302773] [<ffffffffabdc227d>] oom_kill_process+0x2cd/0x490
[Mon Mar 8 13:16:45 2021][316012.309380] [<ffffffffabdc1c6d>] ? oom_unkillable_task+0xcd/0x120
[Mon Mar 8 13:16:45 2021][316012.316375] [<ffffffffabdc296a>] out_of_memory+0x31a/0x500
[Mon Mar 8 13:16:45 2021][316012.322691] [<ffffffffac37d3f7>] __alloc_pages_slowpath+0x5db/0x729
[Mon Mar 8 13:16:45 2021][316012.329878] [<ffffffffabdc8ee6>] __alloc_pages_nodemask+0x436/0x450
[Mon Mar 8 13:16:45 2021][316012.337066] [<ffffffffabe18bb8>] alloc_pages_current+0x98/0x110
[Mon Mar 8 13:16:45 2021][316012.343866] [<ffffffffabdbdd37>] __page_cache_alloc+0x97/0xb0
[Mon Mar 8 13:16:45 2021][316012.350475] [<ffffffffabdc0cd0>] filemap_fault+0x270/0x420
[Mon Mar 8 13:16:45 2021][316012.356799] [<ffffffffc05c491e>] __xfs_filemap_fault+0x7e/0x1d0 [xfs]
[Mon Mar 8 13:16:45 2021][316012.364192] [<ffffffffc05c4b1c>] xfs_filemap_fault+0x2c/0x30 [xfs]
[Mon Mar 8 13:16:45 2021][316012.371288] [<ffffffffabdedfba>] __do_fault.isra.61+0x8a/0x100
[Mon Mar 8 13:16:45 2021][316012.377991] [<ffffffffabdee56c>] do_read_fault.isra.63+0x4c/0x1b0
[Mon Mar 8 13:16:45 2021][316012.384985] [<ffffffffabdf5db0>] handle_mm_fault+0xa20/0xfb0
[Mon Mar 8 13:16:45 2021][316012.391497] [<ffffffffabe4efde>] ? do_readv_writev+0x19e/0x260
[Mon Mar 8 13:16:45 2021][316012.398191] [<ffffffffac38f653>] __do_page_fault+0x213/0x500
[Mon Mar 8 13:16:45 2021][316012.404699] [<ffffffffac38f975>] do_page_fault+0x35/0x90
[Mon Mar 8 13:16:45 2021][316012.410819] [<ffffffffac38b778>] page_fault+0x28/0x30
[Mon Mar 8 13:16:45 2021][316012.416649] Mem-Info:
[Mon Mar 8 13:16:45 2021][316012.419282] active_anon:15536411 inactive_anon:5562 isolated_anon:0
[Mon Mar 8 13:16:45 2021][316012.419282] active_file:0 inactive_file:0 isolated_file:0
[Mon Mar 8 13:16:45 2021][316012.419282] unevictable:0 dirty:0 writeback:0 unstable:0
[Mon Mar 8 13:16:45 2021][316012.419282] slab_reclaimable:217834 slab_unreclaimable:33046
[Mon Mar 8 13:16:45 2021][316012.419282] mapped:4831 shmem:6745 pagetables:34002 bounce:0
[Mon Mar 8 13:16:45 2021][316012.419282] free:41893 free_pcp:0 free_cma:0
```
The client side messages are the following:
```
cmd: 08-Mar-2021 13:39:24 EST Initiating request to open file root://deepthought.crc.nd.edu//store/user/kmohrman/postLHE_step/FullR2Studies/ULChecks/ttXJet-tXq_testFullULWFonCRC_ULCheck_UL17/v5/gen_step_tHq4f_testUpdateGenproddim6TopMay20GST_run2/GEN-00000_242.root
>> cmd: 08-Mar-2021 13:39:29 EST Successfully opened file root://deepthought.crc.nd.edu//store/user/kmohrman/postLHE_step/FullR2Studies/ULChecks/ttXJet-tXq_testFullULWFonCRC_ULCheck_UL17/v5/gen_step_tHq4f_testUpdateGenproddim6TopMay20GST_run2/GEN-00000_242.root
>> cmd: Begin processing the 1st record. Run 1, Event 1, LumiSection 93 on stream 4 at 08-Mar-2021 13:41:25.583 EST
>> cmd: Begin processing the 2nd record. Run 1, Event 2, LumiSection 93 on stream 3 at 08-Mar-2021 13:41:25.586 EST
>> cmd: Begin processing the 3rd record. Run 1, Event 3, LumiSection 93 on stream 0 at 08-Mar-2021 13:41:25.589 EST
>> cmd: Begin processing the 4th record. Run 1, Event 4, LumiSection 93 on stream 2 at 08-Mar-2021 13:41:25.592 EST
>> cmd: Begin processing the 5th record. Run 1, Event 5, LumiSection 93 on stream 5 at 08-Mar-2021 13:41:25.596 EST
>> cmd: Begin processing the 6th record. Run 1, Event 6, LumiSection 93 on stream 1 at 08-Mar-2021 13:41:25.599 EST
>> cmd: Begin processing the 7th record. Run 1, Event 7, LumiSection 93 on stream 2 at 08-Mar-2021 13:42:14.144 EST
>> cmd: Begin processing the 8th record. Run 1, Event 8, LumiSection 93 on stream 1 at 08-Mar-2021 13:42:18.072 EST
>> cmd: Begin processing the 9th record. Run 1, Event 9, LumiSection 93 on stream 4 at 08-Mar-2021 13:42:26.414 EST
>> cmd: Begin processing the 10th record. Run 1, Event 10, LumiSection 93 on stream 0 at 08-Mar-2021 13:42:48.198 EST
>> cmd: Begin processing the 11th record. Run 1, Event 11, LumiSection 93 on stream 2 at 08-Mar-2021 13:43:03.165 EST
>> cmd: Begin processing the 12th record. Run 1, Event 12, LumiSection 93 on stream 3 at 08-Mar-2021 13:43:05.578 EST
>> cmd: Begin processing the 13th record. Run 1, Event 13, LumiSection 93 on stream 5 at 08-Mar-2021 13:43:07.977 EST
>> cmd: Begin processing the 14th record. Run 1, Event 14, LumiSection 93 on stream 4 at 08-Mar-2021 13:43:16.565 EST
>> cmd: Begin processing the 15th record. Run 1, Event 15, LumiSection 93 on stream 1 at 08-Mar-2021 13:43:19.416 EST
>> cmd: Begin processing the 16th record. Run 1, Event 16, LumiSection 93 on stream 0 at 08-Mar-2021 13:43:43.391 EST
>> cmd: Begin processing the 17th record. Run 1, Event 17, LumiSection 93 on stream 2 at 08-Mar-2021 13:43:44.128 EST
>> cmd: Begin processing the 18th record. Run 1, Event 18, LumiSection 93 on stream 4 at 08-Mar-2021 13:43:53.116 EST
>> cmd: Begin processing the 19th record. Run 1, Event 19, LumiSection 93 on stream 3 at 08-Mar-2021 13:44:02.181 EST
>> cmd: Begin processing the 20th record. Run 1, Event 20, LumiSection 93 on stream 5 at 08-Mar-2021 13:44:20.865 EST
>> cmd: Begin processing the 21st record. Run 1, Event 21, LumiSection 93 on stream 2 at 08-Mar-2021 13:44:30.397 EST
>> cmd: [2021-03-08 13:47:41.409148 -0500][Error ][XRootD ] [primeradiant06.crc.nd.edu:1096] Unable to get the response to request kXR_readv (handle: 0x00000000, chunks: 39, total size: 18360925)
>> cmd: [2021-03-08 13:47:41.410099 -0500][Error ][File ] ***@***.***://deepthought.crc.nd.edu:1094//store/user/kmohrman/postLHE_step/FullR2Studies/ULChecks/ttXJet-tXq_testFullULWFonCRC_ULCheck_UL17/v5/gen_step_tHq4f_testUpdateGenproddim6TopMay20GST_run2/GEN-00000_242.root] Fatal file state error. Message kXR_readv (handle: 0x00000000, chunks: 39, total size: 18360925) returned with [ERROR] Operation expired
>> cmd: %MSG-w XrdAdaptorInternal: (NoModuleName) 08-Mar-2021 13:47:41 EST pre-events
>> cmd: XrdRequestManager::handle(name='root://deepthought.crc.nd.edu//store/user/kmohrman/postLHE_step/FullR2Studies/ULChecks/ttXJet-tXq_testFullULWFonCRC_ULCheck_UL17/v5/gen_step_tHq4f_testUpdateGenproddim6TopMay20GST_run2/GEN-00000_242.root) failure when reading from primeradiant06.crc.nd.edu:1096 (site T3_US_NotreDame); failed with error '[ERROR] Operation expired' (errno=0, code=206).
>> cmd: %MSG```
--
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
#1430
|
Do you mean at the client level?
|
@khurtado - is this a XCache setup or a "plain old data server"? |
@bbockelm This happened on our "plain old data server". |
Hm. Nothing is ringing a bell here. Does it only happen under production load? Is it reproducible if you isolate the server and just run the same download in a loop (i.e., memory leak or just increased memory usage?)? |
Unfortunately, neither the user that found the issue nor I were able to reproduce this via e.g.: xrdcp, so it's something that happened under production load only. We got back to 4.11 before attempting to debug the issue further with 4.12 because it was affecting other users too. |
I suspect it's 3rd party copy coupled with checksum calculation. There is
a known memory leak in that combination. That needs to be verified and can
be done by looking at the log.
…On Wed, 17 Mar 2021, Kenyi Hurtado wrote:
Unfortunately, neither the user that found the issue nor I were able to reproduce this via e.g.: xrdcp, so it's something that happened under production load only. We got back to 4.11 before attempting to debug the issue further with 4.12 because it was affecting other users too.
--
You are receiving this because you commented.
Reply to this email directly or view it on GitHub:
#1430 (comment)
|
It likely would be easier if you just make the log file available via
download or send it privately to me.
Andy
…On Wed, 17 Mar 2021, Kenyi Hurtado wrote:
> I suspect it's 3rd party copy coupled with checksum calculation. There is a known memory leak in that combination. That needs to be verified and can be done by looking at the log.
> [?](#)
> On Wed, 17 Mar 2021, Kenyi Hurtado wrote: Unfortunately, neither the user that found the issue nor I were able to reproduce this via e.g.: xrdcp, so it's something that happened under production load only. We got back to 4.11 before attempting to debug the issue further with 4.12 because it was affecting other users too. -- You are receiving this because you commented. Reply to this email directly or view it on GitHub: [#1430 (comment)](#1430 (comment))
We can do that! What kind of message should we be looking at? I looked for "checksum" on the server side xrootd logfile and didn't find anything.
--
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
#1430 (comment)
########################################################################
Use REPLY-ALL to reply to list
To unsubscribe from the XROOTD-DEV list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1
|
It this problem still occurring? If so, please send a link to the log file. |
I assume you have upgraded to 5.x and no longer see the issue, yes? |
This release is no longer supported. |
When using XRootD 4.12.6-1 (osg version) for our XRootD data servers, we notice OOM issues. This doesn't happen when we use 4.11 (we reverted back to that version)
The client side messages are the following:
The text was updated successfully, but these errors were encountered: