Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large transfers failing with XCache #1893

Closed
bbockelm opened this issue Feb 2, 2023 · 11 comments
Closed

Large transfers failing with XCache #1893

bbockelm opened this issue Feb 2, 2023 · 11 comments

Comments

@bbockelm
Copy link
Contributor

bbockelm commented Feb 2, 2023

We are observing that large transfers ( >>10GB) through an XCache are consistently failing.

The symptoms are that, somewhere in the transfer, pgread fails with a complaint about a checksum error. This triggers a retry of all pages in the block (strange, as I'd assume that only a single page would be corrupted, not all 128 each time). The retries result in either minutes-long stalls (causing a timeout in the client accessing xcache) or, as it will reopen the file internally as part of the recovery, a failure due to an expired token. Basically, we were never able to get a 200GB file to completely transfer (though I think it would have eventually succeeded after a few more hours of a retry loop as it at least made it further on each transfer).

This changed, however, when I switched the origin to require TLS. Instead of a corrupted page + recovery, every few minutes I'd get a disconnect along the lines off:

[2023-02-02 19:20:07.214508 +0000][Error  ][PostMaster        ] [u80@example.com:1095] Forcing error on disconnect: [ERROR] Operation interrupted.

(the underlying error is unclear to me here). XCache cleanly recovered from this error and I've not had any failures since enabling TLS (compared to no successes without TLS).

At this point, I said "ah-ha! I have discovered a problematic network and we are seeing TCP packet corruptions!"

Unfortunately, this also doesn't appear to be true. The TCP rates are reasonable (60MB/s) for a transfer going over 2,000 miles, suggesting the TCP corruption rate can't be that bad. Further, none of the corruptions seem to occur when I utilize read instead of pgread or if I do a HTTPS-based transfer. It seems unlikely that there's a TCP issue only when pgread is used!

This is with XRootD 5.5.1 on both the origin and cache.

I'm stumped on what could be going on here and would appreciate it if someone else could try a similar setup to see if it trivially duplicates. I think I've got a reasonably simple configuration with the only somewhat-unique thing here is the size of the file (~200GB).

@amadio
Copy link
Member

amadio commented Feb 6, 2023

This looks like it could be a duplicate of a problem reported by Michał here: #1864. Hopefully it will be fixed in 5.5.2, which will be released soon.

@bbockelm
Copy link
Contributor Author

bbockelm commented Feb 6, 2023

That's definitely a part of it! There's may be one more bug in the recovery logic as periodically the data transfer rate will drop to zero during these recovery periods.

However, I'm not sure it solves my root issue: it's unclear why the checksums fail only when using the pgread API (and not when transferring with other protocols).

@abh3
Copy link
Member

abh3 commented Feb 6, 2023 via email

@bbockelm
Copy link
Contributor Author

bbockelm commented Feb 7, 2023

When using TLS is a block cannot be decrypted (i.e. the internal checksum doesn't match) then TLS simply closes the connection

Correct. The mystery is that, when we're doing HTTPS (or xrdcp without pgread over ROOTS) there are no TLS errors. Why are there only TLS errors when using pgread?

is expected that the transfer rate will slow as corrections occur in single 4k page units

Particularly, there are 128 4k pages and a handful of outstanding "normal" size reads from XCache (I believe there were 5 x 512KB). So, 6 x 512KB = 3MB of outstanding data.

The origin was running at around 60MB/s at a distance of about 50ms. So, if the requests were handled serially (I don't think they are?), this would take about 2 seconds to recover. Yet the timeouts were hit at around 1 minute of zero bytes of throughput.

Anyhow -- the basic question remains: does anyone else observe a ~0% success rate when handling files >>10GB with pgread?

@xrootd-dev
Copy link

xrootd-dev commented Feb 7, 2023 via email

@amadio
Copy link
Member

amadio commented Feb 15, 2023

Could you please try again with 5.5.2 and let us know if the problem has been fixed? Thank you.

@amadio
Copy link
Member

amadio commented Feb 22, 2023

@bbockelm Is this still happening with the latest release?

@bbockelm
Copy link
Contributor Author

Please keep it open for now -- waiting on a container rebuild from the team that rolls in the various XCache bugfixes from the last week.

@amadio
Copy link
Member

amadio commented Feb 22, 2023

Ok, thanks for the update.

@abh3
Copy link
Member

abh3 commented Oct 12, 2023

The last action was February 22nd of this year. Has anything been resolved here? Should this stay open?

@amadio
Copy link
Member

amadio commented Oct 27, 2023

I'm closing this due to inactivity. Please open a new ticket (possibly linking back to this one) if you notice this is still a problem with the latest release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants