Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large transfers failing with XCache (#2) #2168

Open
RHofsaess opened this issue Jan 22, 2024 · 5 comments
Open

Large transfers failing with XCache (#2) #2168

RHofsaess opened this issue Jan 22, 2024 · 5 comments
Assignees
Labels
Pending Info Waiting on additional information from issue reporter

Comments

@RHofsaess
Copy link

Hi all,
I am currently observing this issue: #1893 again when transfering many CMS PREMIX files (~20GB) via XCache.
E.g.:

5304679:[2024-01-22 16:31:02.121581 +0100][Error  ][PostMaster        ] [u78@cmsxrootd-1.gridka.de:1094] Forcing error on disconnect: [ERROR] Operation interrupted.
5393494:[2024-01-22 16:31:09.021942 +0100][Error  ][PostMaster        ] [u76@eoscms-ns-01.cern.ch:1098] Forcing error on disconnect: [ERROR] Operation interrupted.

On top of that, I sometimes see this issue, not sure if it is connected:

43795917:[2024-01-22 17:19:58.495355 +0100][Error  ][File              ] [0xa6508b0@root://u351@xrootd-cms.infn.it:1094//store/mc/Run3Summer21PrePremix/Neutrino_E-10_gun/PREMIX/Summer22_124X_mcRun3_2022_realistic_v11-v2/30002/4f020071-9cf2-4bdd-bdb4-ddc476d1575f.root?tried=172.26.19.197&triedrc=resel&xrdcl.requuid=f367fa06-539e-4e7b-8591-5f91ee8fcf86] Fatal file state error. Message kXR_readv (handle: 0x00000000, chunks: [(offset: 11020533760, size: 2097152); ], total size: 2097152) returned with [ERROR] Server responded with an error: [3008] Single readv transfer is too large

43796142:240122 17:19:58 432 XrdPfc_File: error Read(), direct read finished with error 12 cannot allocate memory /store/mc/Run3Summer21PrePremix/Neutrino_E-10_gun/PREMIX/Summer22_124X_mcRun3_2022_realistic_v11-v2/30002/4f020071-9cf2-4bdd-bdb4-ddc476d1575f.root

I am using v5.6.4 on CentOs7.
Does anyone have an idea what can cause this or how to fix it?
Thanks
Robin

@abh3
Copy link
Member

abh3 commented Jan 22, 2024

What are you using to transfer these files?

@RHofsaess
Copy link
Author

Well, good question. Those are CMS MC production jobs. I assume, they are just opening and streaming the files within CMSSW.
But how excatly they are behaving I need to ask around.

@abh3 abh3 self-assigned this Jan 25, 2024
@amadio
Copy link
Member

amadio commented Feb 6, 2024

At least the "Operation Interrupted" type of error might go away with 5.6.6 or later (see #2169). I would recommend to try again with XRootD 5.6.7, and if this is still an issue, we need more information on how to reproduce the problem to be able to investigate further the underlying cause (crash dump and/or full debug logs of a failed operation).

@amadio amadio added Pending Info Waiting on additional information from issue reporter and removed Under Investigation labels Feb 6, 2024
@RHofsaess
Copy link
Author

Sure, sorry, I was a little busy lately 😅
Okay thanks, I will give it a try.

Some more info:
The jobs are just doing a TFile:Open(root://...) and stream the data.
I now increased the pfc.blocksize from 128k to 4m and the error does happen way less regularly (which also could be because of a different job mix at the site. It is quiet hard to test and debug those things in production...)
On top of that, I deactivated IPoIB, which may contribute to the problems.

I will keep an eye on it and keep you updated!

@abh3
Copy link
Member

abh3 commented Feb 6, 2024

After talking with you yesterday, it would appear you are using a 2MB page size in Xcache. The problem is that a 2MB page size incompatible with a readv() since the maximum size is actually 2MB-16. I'd suggest a somewhat smaller page size. You might still get an error if Xcache wants to read too many large pages and you will exceed the total transfer limit for a readv. So, make sure the read ahead count is not excessive. Of course, we never discussed what is actually driving such a large page size in he first place. We can do that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Pending Info Waiting on additional information from issue reporter
Projects
None yet
Development

No branches or pull requests

3 participants