Large transfers failing with XCache #1893

bbockelm · 2023-02-02T22:30:54Z

We are observing that large transfers ( >>10GB) through an XCache are consistently failing.

The symptoms are that, somewhere in the transfer, pgread fails with a complaint about a checksum error. This triggers a retry of all pages in the block (strange, as I'd assume that only a single page would be corrupted, not all 128 each time). The retries result in either minutes-long stalls (causing a timeout in the client accessing xcache) or, as it will reopen the file internally as part of the recovery, a failure due to an expired token. Basically, we were never able to get a 200GB file to completely transfer (though I think it would have eventually succeeded after a few more hours of a retry loop as it at least made it further on each transfer).

This changed, however, when I switched the origin to require TLS. Instead of a corrupted page + recovery, every few minutes I'd get a disconnect along the lines off:

[2023-02-02 19:20:07.214508 +0000][Error  ][PostMaster        ] [u80@example.com:1095] Forcing error on disconnect: [ERROR] Operation interrupted.

(the underlying error is unclear to me here). XCache cleanly recovered from this error and I've not had any failures since enabling TLS (compared to no successes without TLS).

At this point, I said "ah-ha! I have discovered a problematic network and we are seeing TCP packet corruptions!"

Unfortunately, this also doesn't appear to be true. The TCP rates are reasonable (60MB/s) for a transfer going over 2,000 miles, suggesting the TCP corruption rate can't be that bad. Further, none of the corruptions seem to occur when I utilize read instead of pgread or if I do a HTTPS-based transfer. It seems unlikely that there's a TCP issue only when pgread is used!

This is with XRootD 5.5.1 on both the origin and cache.

I'm stumped on what could be going on here and would appreciate it if someone else could try a similar setup to see if it trivially duplicates. I think I've got a reasonably simple configuration with the only somewhat-unique thing here is the size of the file (~200GB).

The text was updated successfully, but these errors were encountered:

amadio · 2023-02-06T14:02:08Z

This looks like it could be a duplicate of a problem reported by Michał here: #1864. Hopefully it will be fixed in 5.5.2, which will be released soon.

bbockelm · 2023-02-06T15:11:52Z

That's definitely a part of it! There's may be one more bug in the recovery logic as periodically the data transfer rate will drop to zero during these recovery periods.

However, I'm not sure it solves my root issue: it's unclear why the checksums fail only when using the pgread API (and not when transferring with other protocols).

abh3 · 2023-02-06T23:32:10Z

There are no checksums per se in other protocols. When using TLS is a block cannot be decrypted (i.e. the internal checksum doesn't match) then TLS simply closes the connection. Non-TLS transfers always succeed as long as the length isn't corrupted. So, I suppose none of this should be suprising. As for the retry logic errfor, indeed, that was a bug taht got corrected in 5.5.2. When correction occurs, it is expected that the transfer rate will slow as corrections occur in single 4k page units. That bug causes a slew of corrections that definitely were not needed. Andy

…

On Mon, 6 Feb 2023, Brian P Bockelman wrote: That's definitely a part of it! There's may be one more bug in the recovery logic as periodically the data transfer rate will drop to zero during these recovery periods. However, I'm not sure it solves my root issue: it's unclear why the checksums fail *only* when using the `pgread` API (and not when transferring with other protocols). -- Reply to this email directly or view it on GitHub: #1893 (comment) You are receiving this because you are subscribed to this thread. Message ID: ***@***.***>

bbockelm · 2023-02-07T02:00:26Z

When using TLS is a block cannot be decrypted (i.e. the internal checksum doesn't match) then TLS simply closes the connection

Correct. The mystery is that, when we're doing HTTPS (or xrdcp without pgread over ROOTS) there are no TLS errors. Why are there only TLS errors when using pgread?

is expected that the transfer rate will slow as corrections occur in single 4k page units

Particularly, there are 128 4k pages and a handful of outstanding "normal" size reads from XCache (I believe there were 5 x 512KB). So, 6 x 512KB = 3MB of outstanding data.

The origin was running at around 60MB/s at a distance of about 50ms. So, if the requests were handled serially (I don't think they are?), this would take about 2 seconds to recover. Yet the timeouts were hit at around 1 minute of zero bytes of throughput.

Anyhow -- the basic question remains: does anyone else observe a ~0% success rate when handling files >>10GB with pgread?

xrootd-dev · 2023-02-07T03:55:41Z

There is nothing particularly special about pgread other than it has two checksums full duplex and none of the other reads have any. As for the basic question, I think you are the first to notice something amiss. Mind you that handling pgread is rather complicated because of interleaved checksums. So, what is the OS on each end (type and version).

…

On Mon, 6 Feb 2023, Brian P Bockelman wrote: > When using TLS is a block cannot be decrypted (i.e. the internal checksum doesn't match) then TLS simply closes the connection Correct. The mystery is that, when we're doing HTTPS (or xrdcp without `pgread` over ROOTS) there are no TLS errors. Why are there only TLS errors when using `pgread`? > is expected that the transfer rate will slow as corrections occur in single 4k page units Particularly, there are 128 4k pages and a handful of outstanding "normal" size reads from XCache (I believe there were 5 x 512KB). So, 6 x 512KB = 3MB of outstanding data. The origin was running at around 60MB/s at a distance of about 50ms. So, if the requests were handled serially (I don't think they are?), this would take about 2 seconds to recover. Yet the timeouts were hit at around 1 minute of zero bytes of throughput. Anyhow -- the basic question remains: does anyone else observe a ~0% success rate when handling files >>10GB with pgread? -- Reply to this email directly or view it on GitHub: #1893 (comment) You are receiving this because you are subscribed to this thread. Message ID: ***@***.***> ######################################################################## Use REPLY-ALL to reply to list To unsubscribe from the XROOTD-DEV list, click the following link: https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1

amadio · 2023-02-15T18:48:19Z

Could you please try again with 5.5.2 and let us know if the problem has been fixed? Thank you.

amadio · 2023-02-22T16:21:44Z

@bbockelm Is this still happening with the latest release?

bbockelm · 2023-02-22T16:29:22Z

Please keep it open for now -- waiting on a container rebuild from the team that rolls in the various XCache bugfixes from the last week.

amadio · 2023-02-22T16:31:07Z

Ok, thanks for the update.

abh3 · 2023-10-12T10:49:24Z

The last action was February 22nd of this year. Has anything been resolved here? Should this stay open?

amadio · 2023-10-27T17:36:33Z

I'm closing this due to inactivity. Please open a new ticket (possibly linking back to this one) if you notice this is still a problem with the latest release.

abh3 added the Under Investigation label Feb 23, 2023

amadio closed this as completed Oct 27, 2023

amadio removed the Under Investigation label Jan 19, 2024

RHofsaess mentioned this issue Jan 22, 2024

Large transfers failing with XCache (#2) #2168

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large transfers failing with XCache #1893

Large transfers failing with XCache #1893

bbockelm commented Feb 2, 2023

amadio commented Feb 6, 2023

bbockelm commented Feb 6, 2023

abh3 commented Feb 6, 2023 via email

bbockelm commented Feb 7, 2023

xrootd-dev commented Feb 7, 2023 via email

amadio commented Feb 15, 2023

amadio commented Feb 22, 2023

bbockelm commented Feb 22, 2023

amadio commented Feb 22, 2023

abh3 commented Oct 12, 2023

amadio commented Oct 27, 2023

Large transfers failing with XCache #1893

Large transfers failing with XCache #1893

Comments

bbockelm commented Feb 2, 2023

amadio commented Feb 6, 2023

bbockelm commented Feb 6, 2023

abh3 commented Feb 6, 2023 via email

bbockelm commented Feb 7, 2023

xrootd-dev commented Feb 7, 2023 via email

amadio commented Feb 15, 2023

amadio commented Feb 22, 2023

bbockelm commented Feb 22, 2023

amadio commented Feb 22, 2023

abh3 commented Oct 12, 2023

amadio commented Oct 27, 2023