-
Notifications
You must be signed in to change notification settings - Fork 150
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Large transfers failing with XCache #1893
Comments
This looks like it could be a duplicate of a problem reported by Michał here: #1864. Hopefully it will be fixed in 5.5.2, which will be released soon. |
That's definitely a part of it! There's may be one more bug in the recovery logic as periodically the data transfer rate will drop to zero during these recovery periods. However, I'm not sure it solves my root issue: it's unclear why the checksums fail only when using the |
There are no checksums per se in other protocols. When using TLS is a
block cannot be decrypted (i.e. the internal checksum doesn't match) then
TLS simply closes the connection. Non-TLS transfers always succeed as long
as the length isn't corrupted. So, I suppose none of this should be
suprising. As for the retry logic errfor, indeed, that was a bug taht got
corrected in 5.5.2. When correction occurs, it is expected that the
transfer rate will slow as corrections occur in single 4k page units. That
bug causes a slew of corrections that definitely were not needed.
Andy
…On Mon, 6 Feb 2023, Brian P Bockelman wrote:
That's definitely a part of it! There's may be one more bug in the recovery logic as periodically the data transfer rate will drop to zero during these recovery periods.
However, I'm not sure it solves my root issue: it's unclear why the checksums fail *only* when using the `pgread` API (and not when transferring with other protocols).
--
Reply to this email directly or view it on GitHub:
#1893 (comment)
You are receiving this because you are subscribed to this thread.
Message ID: ***@***.***>
|
Correct. The mystery is that, when we're doing HTTPS (or xrdcp without
Particularly, there are 128 4k pages and a handful of outstanding "normal" size reads from XCache (I believe there were 5 x 512KB). So, 6 x 512KB = 3MB of outstanding data. The origin was running at around 60MB/s at a distance of about 50ms. So, if the requests were handled serially (I don't think they are?), this would take about 2 seconds to recover. Yet the timeouts were hit at around 1 minute of zero bytes of throughput. Anyhow -- the basic question remains: does anyone else observe a ~0% success rate when handling files >>10GB with pgread? |
There is nothing particularly special about pgread other than it has two
checksums full duplex and none of the other reads have any. As for the
basic question, I think you are the first to notice something amiss. Mind
you that handling pgread is rather complicated because of interleaved
checksums. So, what is the OS on each end (type and version).
…On Mon, 6 Feb 2023, Brian P Bockelman wrote:
> When using TLS is a block cannot be decrypted (i.e. the internal checksum doesn't match) then TLS simply closes the connection
Correct. The mystery is that, when we're doing HTTPS (or xrdcp without `pgread` over ROOTS) there are no TLS errors. Why are there only TLS errors when using `pgread`?
> is expected that the transfer rate will slow as corrections occur in single 4k page units
Particularly, there are 128 4k pages and a handful of outstanding "normal" size reads from XCache (I believe there were 5 x 512KB). So, 6 x 512KB = 3MB of outstanding data.
The origin was running at around 60MB/s at a distance of about 50ms. So, if the requests were handled serially (I don't think they are?), this would take about 2 seconds to recover. Yet the timeouts were hit at around 1 minute of zero bytes of throughput.
Anyhow -- the basic question remains: does anyone else observe a ~0% success rate when handling files >>10GB with pgread?
--
Reply to this email directly or view it on GitHub:
#1893 (comment)
You are receiving this because you are subscribed to this thread.
Message ID: ***@***.***>
########################################################################
Use REPLY-ALL to reply to list
To unsubscribe from the XROOTD-DEV list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1
|
Could you please try again with 5.5.2 and let us know if the problem has been fixed? Thank you. |
@bbockelm Is this still happening with the latest release? |
Please keep it open for now -- waiting on a container rebuild from the team that rolls in the various XCache bugfixes from the last week. |
Ok, thanks for the update. |
The last action was February 22nd of this year. Has anything been resolved here? Should this stay open? |
I'm closing this due to inactivity. Please open a new ticket (possibly linking back to this one) if you notice this is still a problem with the latest release. |
We are observing that large transfers (
>>10GB
) through an XCache are consistently failing.The symptoms are that, somewhere in the transfer,
pgread
fails with a complaint about a checksum error. This triggers a retry of all pages in the block (strange, as I'd assume that only a single page would be corrupted, not all 128 each time). The retries result in either minutes-long stalls (causing a timeout in the client accessing xcache) or, as it will reopen the file internally as part of the recovery, a failure due to an expired token. Basically, we were never able to get a 200GB file to completely transfer (though I think it would have eventually succeeded after a few more hours of a retry loop as it at least made it further on each transfer).This changed, however, when I switched the origin to require TLS. Instead of a corrupted page + recovery, every few minutes I'd get a disconnect along the lines off:
(the underlying error is unclear to me here). XCache cleanly recovered from this error and I've not had any failures since enabling TLS (compared to no successes without TLS).
At this point, I said "ah-ha! I have discovered a problematic network and we are seeing TCP packet corruptions!"
Unfortunately, this also doesn't appear to be true. The TCP rates are reasonable (60MB/s) for a transfer going over 2,000 miles, suggesting the TCP corruption rate can't be that bad. Further, none of the corruptions seem to occur when I utilize
read
instead ofpgread
or if I do a HTTPS-based transfer. It seems unlikely that there's a TCP issue only whenpgread
is used!This is with XRootD 5.5.1 on both the origin and cache.
I'm stumped on what could be going on here and would appreciate it if someone else could try a similar setup to see if it trivially duplicates. I think I've got a reasonably simple configuration with the only somewhat-unique thing here is the size of the file (~200GB).
The text was updated successfully, but these errors were encountered: