Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OOM issue with XRootD 4.12.6 (OSG version) #1430

Closed
khurtado opened this issue Mar 17, 2021 · 11 comments
Closed

OOM issue with XRootD 4.12.6 (OSG version) #1430

khurtado opened this issue Mar 17, 2021 · 11 comments
Assignees

Comments

@khurtado
Copy link

When using XRootD 4.12.6-1 (osg version) for our XRootD data servers, we notice OOM issues. This doesn't happen when we use 4.11 (we reverted back to that version)

(primeradiant06)
[Mon Mar  8 13:16:45 2021][316012.240505] xrootd invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0

[Mon Mar  8 13:16:45 2021][316012.248957] xrootd cpuset=/ mems_allowed=0-1

[Mon Mar  8 13:16:45 2021][316012.253811] CPU: 4 PID: 2890 Comm: xrootd Kdump: loaded Tainted: P           OE  ------------   3.10.0-1160.15.2.el7.x86_64 #1

[Mon Mar  8 13:16:45 2021][316012.266621] Hardware name: Dell Inc. PowerEdge R730xd/0599V5, BIOS 2.11.0 11/02/2019
[Mon Mar  8 13:16:45 2021][316012.275358] Call Trace:
[Mon Mar  8 13:16:45 2021][316012.278184]  [<ffffffffac381fba>] dump_stack+0x19/0x1b
[Mon Mar  8 13:16:45 2021][316012.284014]  [<ffffffffac37c8da>] dump_header+0x90/0x229
[Mon Mar  8 13:16:45 2021][316012.290042]  [<ffffffffabd065a2>] ? ktime_get_ts64+0x52/0xf0
[Mon Mar  8 13:16:45 2021][316012.296457]  [<ffffffffabd5dd7f>] ? delayacct_end+0x8f/0xb0
[Mon Mar  8 13:16:45 2021][316012.302773]  [<ffffffffabdc227d>] oom_kill_process+0x2cd/0x490
[Mon Mar  8 13:16:45 2021][316012.309380]  [<ffffffffabdc1c6d>] ? oom_unkillable_task+0xcd/0x120
[Mon Mar  8 13:16:45 2021][316012.316375]  [<ffffffffabdc296a>] out_of_memory+0x31a/0x500
[Mon Mar  8 13:16:45 2021][316012.322691]  [<ffffffffac37d3f7>] __alloc_pages_slowpath+0x5db/0x729
[Mon Mar  8 13:16:45 2021][316012.329878]  [<ffffffffabdc8ee6>] __alloc_pages_nodemask+0x436/0x450
[Mon Mar  8 13:16:45 2021][316012.337066]  [<ffffffffabe18bb8>] alloc_pages_current+0x98/0x110
[Mon Mar  8 13:16:45 2021][316012.343866]  [<ffffffffabdbdd37>] __page_cache_alloc+0x97/0xb0
[Mon Mar  8 13:16:45 2021][316012.350475]  [<ffffffffabdc0cd0>] filemap_fault+0x270/0x420
[Mon Mar  8 13:16:45 2021][316012.356799]  [<ffffffffc05c491e>] __xfs_filemap_fault+0x7e/0x1d0 [xfs]
[Mon Mar  8 13:16:45 2021][316012.364192]  [<ffffffffc05c4b1c>] xfs_filemap_fault+0x2c/0x30 [xfs]
[Mon Mar  8 13:16:45 2021][316012.371288]  [<ffffffffabdedfba>] __do_fault.isra.61+0x8a/0x100
[Mon Mar  8 13:16:45 2021][316012.377991]  [<ffffffffabdee56c>] do_read_fault.isra.63+0x4c/0x1b0
[Mon Mar  8 13:16:45 2021][316012.384985]  [<ffffffffabdf5db0>] handle_mm_fault+0xa20/0xfb0
[Mon Mar  8 13:16:45 2021][316012.391497]  [<ffffffffabe4efde>] ? do_readv_writev+0x19e/0x260
[Mon Mar  8 13:16:45 2021][316012.398191]  [<ffffffffac38f653>] __do_page_fault+0x213/0x500
[Mon Mar  8 13:16:45 2021][316012.404699]  [<ffffffffac38f975>] do_page_fault+0x35/0x90
[Mon Mar  8 13:16:45 2021][316012.410819]  [<ffffffffac38b778>] page_fault+0x28/0x30
[Mon Mar  8 13:16:45 2021][316012.416649] Mem-Info:
[Mon Mar  8 13:16:45 2021][316012.419282] active_anon:15536411 inactive_anon:5562 isolated_anon:0
[Mon Mar  8 13:16:45 2021][316012.419282]  active_file:0 inactive_file:0 isolated_file:0
[Mon Mar  8 13:16:45 2021][316012.419282]  unevictable:0 dirty:0 writeback:0 unstable:0
[Mon Mar  8 13:16:45 2021][316012.419282]  slab_reclaimable:217834 slab_unreclaimable:33046
[Mon Mar  8 13:16:45 2021][316012.419282]  mapped:4831 shmem:6745 pagetables:34002 bounce:0
[Mon Mar  8 13:16:45 2021][316012.419282]  free:41893 free_pcp:0 free_cma:0

The client side messages are the following:

cmd: 08-Mar-2021 13:39:24 EST  Initiating request to open file root://deepthought.crc.nd.edu//store/user/kmohrman/postLHE_step/FullR2Studies/ULChecks/ttXJet-tXq_testFullULWFonCRC_ULCheck_UL17/v5/gen_step_tHq4f_testUpdateGenproddim6TopMay20GST_run2/GEN-00000_242.root
>> cmd: 08-Mar-2021 13:39:29 EST  Successfully opened file root://deepthought.crc.nd.edu//store/user/kmohrman/postLHE_step/FullR2Studies/ULChecks/ttXJet-tXq_testFullULWFonCRC_ULCheck_UL17/v5/gen_step_tHq4f_testUpdateGenproddim6TopMay20GST_run2/GEN-00000_242.root
>> cmd: Begin processing the 1st record. Run 1, Event 1, LumiSection 93 on stream 4 at 08-Mar-2021 13:41:25.583 EST
>> cmd: Begin processing the 2nd record. Run 1, Event 2, LumiSection 93 on stream 3 at 08-Mar-2021 13:41:25.586 EST
>> cmd: Begin processing the 3rd record. Run 1, Event 3, LumiSection 93 on stream 0 at 08-Mar-2021 13:41:25.589 EST
>> cmd: Begin processing the 4th record. Run 1, Event 4, LumiSection 93 on stream 2 at 08-Mar-2021 13:41:25.592 EST
>> cmd: Begin processing the 5th record. Run 1, Event 5, LumiSection 93 on stream 5 at 08-Mar-2021 13:41:25.596 EST
>> cmd: Begin processing the 6th record. Run 1, Event 6, LumiSection 93 on stream 1 at 08-Mar-2021 13:41:25.599 EST
>> cmd: Begin processing the 7th record. Run 1, Event 7, LumiSection 93 on stream 2 at 08-Mar-2021 13:42:14.144 EST
>> cmd: Begin processing the 8th record. Run 1, Event 8, LumiSection 93 on stream 1 at 08-Mar-2021 13:42:18.072 EST
>> cmd: Begin processing the 9th record. Run 1, Event 9, LumiSection 93 on stream 4 at 08-Mar-2021 13:42:26.414 EST
>> cmd: Begin processing the 10th record. Run 1, Event 10, LumiSection 93 on stream 0 at 08-Mar-2021 13:42:48.198 EST
>> cmd: Begin processing the 11th record. Run 1, Event 11, LumiSection 93 on stream 2 at 08-Mar-2021 13:43:03.165 EST
>> cmd: Begin processing the 12th record. Run 1, Event 12, LumiSection 93 on stream 3 at 08-Mar-2021 13:43:05.578 EST
>> cmd: Begin processing the 13th record. Run 1, Event 13, LumiSection 93 on stream 5 at 08-Mar-2021 13:43:07.977 EST
>> cmd: Begin processing the 14th record. Run 1, Event 14, LumiSection 93 on stream 4 at 08-Mar-2021 13:43:16.565 EST
>> cmd: Begin processing the 15th record. Run 1, Event 15, LumiSection 93 on stream 1 at 08-Mar-2021 13:43:19.416 EST
>> cmd: Begin processing the 16th record. Run 1, Event 16, LumiSection 93 on stream 0 at 08-Mar-2021 13:43:43.391 EST
>> cmd: Begin processing the 17th record. Run 1, Event 17, LumiSection 93 on stream 2 at 08-Mar-2021 13:43:44.128 EST
>> cmd: Begin processing the 18th record. Run 1, Event 18, LumiSection 93 on stream 4 at 08-Mar-2021 13:43:53.116 EST
>> cmd: Begin processing the 19th record. Run 1, Event 19, LumiSection 93 on stream 3 at 08-Mar-2021 13:44:02.181 EST
>> cmd: Begin processing the 20th record. Run 1, Event 20, LumiSection 93 on stream 5 at 08-Mar-2021 13:44:20.865 EST
>> cmd: Begin processing the 21st record. Run 1, Event 21, LumiSection 93 on stream 2 at 08-Mar-2021 13:44:30.397 EST
>> cmd: [2021-03-08 13:47:41.409148 -0500][Error  ][XRootD            ] [primeradiant06.crc.nd.edu:1096] Unable to get the response to request kXR_readv (handle: 0x00000000, chunks: 39, total size: 18360925)
>> cmd: [2021-03-08 13:47:41.410099 -0500][Error  ][File              ] [0x91851600@root://deepthought.crc.nd.edu:1094//store/user/kmohrman/postLHE_step/FullR2Studies/ULChecks/ttXJet-tXq_testFullULWFonCRC_ULCheck_UL17/v5/gen_step_tHq4f_testUpdateGenproddim6TopMay20GST_run2/GEN-00000_242.root] Fatal file state error. Message kXR_readv (handle: 0x00000000, chunks: 39, total size: 18360925) returned with [ERROR] Operation expired
>> cmd: %MSG-w XrdAdaptorInternal:  (NoModuleName) 08-Mar-2021 13:47:41 EST pre-events
>> cmd: XrdRequestManager::handle(name='root://deepthought.crc.nd.edu//store/user/kmohrman/postLHE_step/FullR2Studies/ULChecks/ttXJet-tXq_testFullULWFonCRC_ULCheck_UL17/v5/gen_step_tHq4f_testUpdateGenproddim6TopMay20GST_run2/GEN-00000_242.root) failure when reading from primeradiant06.crc.nd.edu:1096 (site T3_US_NotreDame); failed with error '[ERROR] Operation expired' (errno=0, code=206).
>> cmd: %MSG```
@abh3
Copy link
Member

abh3 commented Mar 17, 2021 via email

@khurtado
Copy link
Author

Do you mean at the client level?
We use a lot of xrdfs <> stat <> before streaming the data to the executable, but not checksum.

>>> executing 'env XRD_LOGLEVEL=Debug timeout 300 xrdfs deepthought.crc.nd.edu stat /store/user/kmohrman/postLHE_step/FullR2Studies/ULChecks/ttXJet-tXq_testFullULWFonCRC_ULCheck_UL17/v4/digi_step_ttHJet_testUpdateGenproddim6TopMay20GST_run1/DIGI-00000_5775.root' @ Thu Mar  4 19:45:38 2021

@bbockelm
Copy link
Contributor

@khurtado - is this a XCache setup or a "plain old data server"?

@khurtado
Copy link
Author

@bbockelm This happened on our "plain old data server".

@bbockelm
Copy link
Contributor

Hm. Nothing is ringing a bell here.

Does it only happen under production load? Is it reproducible if you isolate the server and just run the same download in a loop (i.e., memory leak or just increased memory usage?)?

@khurtado
Copy link
Author

Unfortunately, neither the user that found the issue nor I were able to reproduce this via e.g.: xrdcp, so it's something that happened under production load only. We got back to 4.11 before attempting to debug the issue further with 4.12 because it was affecting other users too.

@abh3
Copy link
Member

abh3 commented Mar 17, 2021 via email

@xrootd-dev
Copy link

xrootd-dev commented Mar 17, 2021 via email

@abh3
Copy link
Member

abh3 commented Jul 29, 2021

It this problem still occurring? If so, please send a link to the log file.

@abh3 abh3 self-assigned this Jul 29, 2021
@abh3
Copy link
Member

abh3 commented Sep 21, 2021

I assume you have upgraded to 5.x and no longer see the issue, yes?

@abh3
Copy link
Member

abh3 commented Dec 13, 2021

This release is no longer supported.

@abh3 abh3 closed this as completed Dec 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants