Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transfers to Xcache via memcache proxy errors with "numerical argument out of domain" in 5.3.X #1507

Closed
snafus opened this issue Sep 6, 2021 · 4 comments

Comments

@snafus
Copy link
Contributor

snafus commented Sep 6, 2021

Hi,
We are observing transfer failings to an Xcache endpoint with error returned as "Error response: numerical argument out of domain".
Oxford runs an Xcache pointing at the RAL endpoint. Oxford runs with 5.3.1. When RAL had 5.2.0 these errors did not appear, but in testing with 5.3.1 we see issues.
I've managed to reproduce the problem with a local Xcache instance, and also with a xrootd server using local storage (e.g. oss.localroot /xroot-test, and not using the XrdCeph plugin).
It seems to occur in the last read of the file, and for files with a non 2**x size. e.g. 64MiB seems to work ok, but 64MiB+4bytes fails.

The flow is Xcache <- memcache proxy <- server
If I remove the memcache proxy (e.g. Xcache <- server), then the problem disappears !

The client sees:

[2021-09-06 14:47:37.062940 +0000][Dump   ][XRootD            ] [ceph-dev-gw1.gridpp.rl.ac.uk:1094] Got a kXR_error response to request kXR_read (handle: 0x00000000, offset: 67108864, size: 33554432) [3019] Unable to read /root:/ceph-dev-gw1.gridpp.rl.ac.uk:1096/xrootd-test/test/jwalder/test_64MiBp1; numerical argument out of domain
[2021-09-06 14:47:37.063005 +0000][Debug  ][XRootD            ] [ceph-dev-gw1.gridpp.rl.ac.uk:1094] Handling error while processing kXR_read (handle: 0x00000000, offset: 67108864, size: 33554432): [ERROR] Error response: numerical argument out of domain.
[2021-09-06 14:47:37.063043 +0000][Debug  ][ExDbgMsg          ] [ceph-dev-gw1.gridpp.rl.ac.uk:1094] Calling MsgHandler: 0xef5370 (message: kXR_read (handle: 0x00000000, offset: 67108864, size: 33554432) ) with status: [ERROR] Error response: numerical argument out of domain.
[2021-09-06 14:47:37.063073 +0000][Dump   ][File              ] [0xef0720@root://ceph-dev-gw1.gridpp.rl.ac.uk:1094//root://ceph-dev-gw1.gridpp.rl.ac.uk:1096//xrootd-test/test/jwalder/test_64MiBp1?xrdcl.requuid=41b966fc-6a6f-4b9f-a428-56c838d2d9e6] File state error encountered. Message kXR_read (handle: 0x00000000, offset: 67108864, size: 33554432) returned with [ERROR] Server responded with an error: [3019] Unable to read /root:/ceph-dev-gw1.gridpp.rl.ac.uk:1096/xrootd-test/test/jwalder/test_64MiBp1; numerical argument out of domain
[2021-09-06 14:47:37.063102 +0000][Error  ][File              ] [0xef0720@root://ceph-dev-gw1.gridpp.rl.ac.uk:1094//root://ceph-dev-gw1.gridpp.rl.ac.uk:1096//xrootd-test/test/jwalder/test_64MiBp1?xrdcl.requuid=41b966fc-6a6f-4b9f-a428-56c838d2d9e6] Fatal file state error. Message kXR_read (handle: 0x00000000, offset: 67108864, size: 33554432) returned with [ERROR] Server responded with an error: [3019] Unable to read /root:/ceph-dev-gw1.gridpp.rl.ac.uk:1096/xrootd-test/test/jwalder/test_64MiBp1; numerical argument out of domain
[2021-09-06 14:47:37.063143 +0000][Dump   ][File              ] [0xef0720@root://ceph-dev-gw1.gridpp.rl.ac.uk:1094//root://ceph-dev-gw1.gridpp.rl.ac.uk:1096//xrootd-test/test/jwalder/test_64MiBp1?xrdcl.requuid=41b966fc-6a6f-4b9f-a428-56c838d2d9e6] Failing message kXR_read (handle: 0x00000000, offset: 67108864, size: 33554432) with [ERROR] Server responded with an error: [3019] Unable to read /root:/ceph-dev-gw1.gridpp.rl.ac.uk:1096/xrootd-test/test/jwalder/test_64MiBp1; numerical argument out of domain

The Xcache reports

210906 15:47:37 962677 orl67423.21467:23@host-172-16-112-239.nubes.stfc.ac.uk XrootdProtocol: 0100 req=read dlen=8
210906 15:47:37 962677 orl67423.21467:23@host-172-16-112-239.nubes.stfc.ac.uk XrootdProtocol: 0100 0 fh=0 read 33554432@67108864
210906 15:47:37 962706 XrdSched: running  inq=0
210906 15:47:37 962706 XrdPfc_File: error ProcessBlockResponse block 0x7f5de8026d20, idx=2, off=67108864 error=-33 /xrootd-test/test/jwalder/test_64MiBp1
210906 15:47:37 962677 XrdPfc_File: error Read() io 0x7f5de8015f40, block 2 finished with error 33 numerical argument out of domain /xrootd-test/test/jwalder/test_64MiBp1
210906 15:47:37 962677 XrdPfc_IO: warning Read() error in File::Read(), exit status=-33, error=numerical argument out of domain root://u23@ceph-dev-gw1.gridpp.rl.ac.uk:1096//xrootd-test/test/jwalder/test_64MiBp1
210906 15:47:37 962677 ofs_read: orl67423.21467:23@host-172-16-112-239.nubes.stfc.ac.uk Unable to read /root:/ceph-dev-gw1.gridpp.rl.ac.uk:1096/xrootd-test/test/jwalder/test_64MiBp1; numerical argument out of domain
210906 15:47:37 962677 orl67423.21467:23@host-172-16-112-239.nubes.stfc.ac.uk XrootdResponse: 0100 sending err 3019: Unable to read /root:/ceph-dev-gw1.gridpp.rl.ac.uk:1096/xrootd-test/test/jwalder/test_64MiBp1; numerical argument out of domain
210906 15:47:37 962677 orl67423.21467:23@host-172-16-112-239.nubes.stfc.ac.uk XrootdProtocol: 0100 req=close dlen=0

There are no obvious errors reported from the memcache proxy or server instances.
The last read from the proxy appears to be a 4k request (note only 4 bytes are needed at this point):

210906 15:56:01 962700 u23.962666:30@ceph-dev-gw1 XrootdResponse: 0100 sending final 8 info and 8 data bytes
210906 15:56:01 962700 u23.962666:30@ceph-dev-gw1 XrootdProtocol: 0100 req=pgread dlen=2
210906 15:56:01 962700 u23.962666:30@ceph-dev-gw1 XrootdProtocol: 0100 0 pgread 4096@67108864 fn=/xrootd-test/test/jwalder/test_64MiBp1
210906 15:56:01 962700 u23.962666:30@ceph-dev-gw1 ofs_pgRead: 4096@67108864 fn=/xrootd-test/test/jwalder/test_64MiBp1
Rdr: 4096@67108864 pr=0
Cache: Hit slot 8 sz 4 nio 1 uc -1
Cache: Ref 300000008 slot 8 sz 4 uc 0
Rdr: ret 4 hits 1 pr 0
210906 15:56:01 962700 u23.962666:30@ceph-dev-gw1 XrootdResponse: 0100 sending final 8 info and 8 data bytes
210906 15:56:01 962700 u23.962666:30@ceph-dev-gw1 XrootdProtocol: 0100 req=close dlen=0
210906 15:56:01 962700 u23.962666:30@ceph-dev-gw1 ofs_close: use=1 fn=/xrootd-test/test/jwalder/test_64MiBp1
Cache: 0 att; rel 1 slots; 0 Faults; 100 root://u30@ceph-dev-gw1.gridpp.rl.ac.uk:1095/xrootd-test/test/jwalder/test_64MiBp1
Cache: Stats: 4 Read; 16 Get; 0 Pass; 0 Write; 0 Put; 3 Hits; 1 Miss; 0 pead; 0 HitsPR; 0 MissPR; Path root://u30@ceph-dev-gw1.gridpp.rl.ac.uk:1095/xrootd-test/test/jwalder/test_64MiBp1

I'm not quite sure where to look now, there are a few places where new EDOM errors were returned between 5.2.0 and 5.3.0, but it wasn't obvious to me where the best place to start looking was, or if any config settings altered the behaviour

If there are some details of the configs that are relevant to look at please let me know.

Cheers,
James

@smithdh
Copy link
Contributor

smithdh commented Sep 7, 2021

I think I found the problem; at least I could reproduce the issue described and verify proposed PR #1510 avoids the EDOM.

@snafus
Copy link
Contributor Author

snafus commented Sep 7, 2021

Hi @smithdh ,
Many thanks. On my test setup I also see this working.
Is this a fix on the XCache side, or the server side (I tested with all instances on the same host, so with the same patched version of code, and I don't know which machine will require the fix)?

Thanks again,
James

@smithdh
Copy link
Contributor

smithdh commented Sep 7, 2021

I think it would be needed on the memory caching proxy (in this case).

@abh3
Copy link
Member

abh3 commented Jan 4, 2022

I believe this has been fixed.

@abh3 abh3 closed this as completed Jan 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants