Xrootd not passing some blocks to the undelying storage plugin, leading to file corruption (block full of 0s) #1305

ericcano · 2020-10-14T08:47:31Z

We have a low rate of file corruptions in CASTOR when writing to the xrootd-ceph. I instrumented the xrootd-ceph layer to keep track of the async writes (which is what xrootd uses most of the time). This instrumentation indicates that the async writes all competed at close time, yet the last part of the file is full of 0s. We also saw blocks of 0s inside the file during the investigation.

We finally caught a file corrupted with the instrumentation in, and made the following observations:

The file size is 8444089. The checksum stored by xrootd in the namespace is 6d769179. When reading the file, we get a different checksum: 38c5fcaa. When truncating the file to 8M and computing the checksum we find yet a different one: a8aefcaa.

Investigating the file, we can see the structure with 0s after the 8MB boundary:

$ hexdump -C corrupt.root | grep -A 2 '00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00'
00000040  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000060  00 00 00 00 00 00 00 70  00 04 00 00 00 46 66 94  |.......p.....Ff.|
--
00800000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
0080d8b0  00 00 00 00 00 00 00 00  00                       |.........|

The logs of the xrootd-ceph layer as as follows:

   ceph_close: closed fd 35583 for file 1815889915@castorns.47439017245, read ops count 0, write ops count 0, async write ops 2/2, async pending write bytes 0, async read ops 0/0, bytes written/max offset 8388608/8388607, last async op age 0.102200

The interpretation is:

2 blocks were written asynchronously and the writes had completed at the time of close.
8MB were written, and the maximum offset writen shows that the write was not sparse.
The close was issued 102ms after the last async write was started (which seems OK, ruling out timeout effecs).

So it seems that the xrootd layer received the complete file (with the checksum stored in the namespace), yet one block failed to be passed to the underlying layer. The Ceph rados striper layer supports sparse files, so this does not generate errors.

The text was updated successfully, but these errors were encountered:

ericcano · 2020-10-15T08:38:58Z

We have another example from a few days ago which, unfortunately, went through a gateway without the new instrumentation. Nevertheless, it has the usual 8M aligned block of 0s signature:

$ hexdump -C corrupt2.root | grep -A 2 "00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00"
00000040  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000060  00 00 00 00 00 00 00 c6  00 04 00 00 00 71 65 10  |.............qe.|
--
07800000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
0780daa0  00 00

ericcano · 2020-10-16T10:31:48Z

We got another example again, this time with full, latest logs (more timings):

ceph_close: closed fd 2462 for file 1816654013@castorns.47465071984, read ops count 0, write ops count 0, async write ops 74/74, async pending write bytes 0, async read ops 0/0, bytes written/max offset 310378496/310378495, longest async write 0.686074, longest callback invocation 0.000081, last async op age 0.371806

Timings are perfect, so the issue is not with them. Again, the last block was not passed to the storage plugin (the full file size is 310381850), and the cutoff address is again 8MB aligned.

abh3 · 2020-10-16T18:23:08Z

So, is it correct to assume async was set to off this time? If so, we will need to turn on tracing. It's hard to believe that the ;ast block was not sent to at least the oss plugin (a.k.a Ceph plugin). Somehow, I think it's getting lost past that point but only a trace record will show if that is the case. Let me get the right option for you. Andy

…

On Fri, 16 Oct 2020, Eric Cano wrote: We got another example again, this time with full, latest logs (more timings): ``` ceph_close: closed fd 2462 for file ***@***.***, read ops count 0, write ops count 0, async write ops 74/74, async pending write bytes 0, async read ops 0/0, bytes written/max offset 310378496/310378495, longest async write 0.686074, longest callback invocation 0.000081, last async op age 0.371806 ``` Timings are perfect, so the issue is not with them. Again, the last block was not passed to the storage plugin (the full file size is 310381850), and the cutoff address is again 8MB aligned. -- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: #1305 (comment)

ericcano · 2020-10-16T19:43:28Z

No, this was also transferred async. We only detect the failed transfers after a few days, when the tape systems fails to write to tape because the checksum of the actual data is found to not match the one in the namespace. This file was written Oct 13, 04:04, CEST.

I set a subset (>50%) of machines with async off. The transfer logs confirm this. The last async age is bug, fixed in upstream but not in this version. It's just cosmetic. Synchronous writes also log the max offets/byte count so we will be able to detect missing end of files or sparse writes.

Here is an example (not problematic):

ceph_close: closed fd 0 for file 1817314031@castorns.47488487468, read ops count 0, write ops count 16, async write ops 0/0, async pending write bytes 0, async read ops 0/0, bytes written/max offset 130619785/130619784, longest async write 0.000000, longest callback invocation 0.000000, last async op age 1602876988.999397

Eric

So, is it correct to assume async was set to off this time? If so, we will need to turn on tracing. It's hard to believe that the ;ast block was not sent to at least the oss plugin (a.k.a Ceph plugin). Somehow, I think it's getting lost past that point but only a trace record will show if that is the case. Let me get the right option for you. Andy
…
On Fri, 16 Oct 2020, Eric Cano wrote: We got another example again, this time with full, latest logs (more timings): ceph_close: closed fd 2462 for file ***@***.***, read ops count 0, write ops count 0, async write ops 74/74, async pending write bytes 0, async read ops 0/0, bytes written/max offset 310378496/310378495, longest async write 0.686074, longest callback invocation 0.000081, last async op age 0.371806 Timings are perfect, so the issue is not with them. Again, the last block was not passed to the storage plugin (the full file size is 310381850), and the cutoff address is again 8MB aligned. -- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: #1305 (comment)

bbockelm · 2020-10-20T20:13:24Z

@abh3 - is it possible something strange is occurring with write retries? It does seem quite strange that the expected write doesn't make it to the xrootd-ceph layer. Having trouble coming up with a possible explanation.

abh3 · 2020-10-20T20:17:05Z

I agree that is strange and why we need to turn on write tracing server-side in the ofs layer. I can't see why the write would never make it to Ceph but I can imagine that it may not be happy with an over-write that may occur on a retry. I also pointed out that when async is on the writes may occur out of order which is another thing that Ceph may be sensitive to.

…

On Tue, 20 Oct 2020, Brian P Bockelman wrote: @abh3 - is it possible something strange is occurring with write retries? It does seem quite strange that the expected write doesn't make it to the `xrootd-ceph` layer. Having trouble coming up with a possible explanation. -- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: #1305 (comment)

ericcano · 2020-11-04T13:27:07Z

I was away for 2 weeks, during which we had the transfers set to sync and no issue arose. So the problem happens only with async enabled. We will stick to this setup for the time being.

To be complete, we also automatically restarted the xrootd daemon every day (to minimize the odds of having accumulations of pending transfers), so this could play a role. This mechanism had been now removed. I will report if new issues pop up.

abh3 · 2020-11-04T23:19:51Z

Thank you for the update Eric. If async is the issue then it would indicate that, perhaps, it is a problem when a particular block of data is written out of order. It would nice if we could catch that somehow. Of course, that would mean bringing back the error which may be too disruptive. Andy

…

On Wed, 4 Nov 2020, Eric Cano wrote: I was away for 2 weeks, during which we had the transfers set to sync and no issue arose. So the problem happens only with async enabled. We will stick to this setup for the time being. To be complete, we also automatically restarted the xrootd daemon every day (to minimize the odds of having accumulations of pending transfers), so this could play a role. This mechanism had been now removed. I will report if new issues pop up. -- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: #1305 (comment)

abh3 · 2020-11-13T09:45:20Z

I think the last time we surmised it was because the blocks were being written out of order and that is something that upsets Ceph. As far as I can see, all writes have been issued, albeit not in sequential order. I am ready to close this ticket unless someone complains.

ericcano · 2020-11-13T10:10:50Z

Hi Andy,

We still did not have any new case, but from my instrumentation, the last block did not make is at all to ceph. The logs show the last block was not provided to ceph at all.

That said, I think the workaround we have is sufficient, and we will eventually decommission CASTOR, so you can close the ticket.

glpatcern · 2020-11-13T10:37:33Z

Hi all,

If I can add a comment, just make sure at RAL they're aware of this for ECHO, which is not going to be decommissioned any time soon as far as I know. Also, the workaround may not be feasible for them.

abh3 · 2020-11-13T10:38:59Z

Closed by request...

e still did not have any new case, but from my instrumentation, the last block did
not make is at all to ceph. The logs show the last block was not provided to ceph at
all.

That said, I think the workaround we have is sufficient, and we will eventually
decommission CASTOR, so you can close the ticket.

bbockelm · 2020-11-13T14:04:42Z

For posterity -

Ceph is just fine with out-of-order writes. However, there are some configurations of the system (such as RAL's use of ECs) that require aligned writes.

abh3 closed this as completed Nov 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Xrootd not passing some blocks to the undelying storage plugin, leading to file corruption (block full of 0s) #1305

Xrootd not passing some blocks to the undelying storage plugin, leading to file corruption (block full of 0s) #1305

ericcano commented Oct 14, 2020

ericcano commented Oct 15, 2020 •

edited

ericcano commented Oct 16, 2020

abh3 commented Oct 16, 2020 via email

ericcano commented Oct 16, 2020

bbockelm commented Oct 20, 2020

abh3 commented Oct 20, 2020 via email

ericcano commented Nov 4, 2020

abh3 commented Nov 4, 2020 via email

abh3 commented Nov 13, 2020

ericcano commented Nov 13, 2020

glpatcern commented Nov 13, 2020

abh3 commented Nov 13, 2020

bbockelm commented Nov 13, 2020

Xrootd not passing some blocks to the undelying storage plugin, leading to file corruption (block full of 0s) #1305

Xrootd not passing some blocks to the undelying storage plugin, leading to file corruption (block full of 0s) #1305

Comments

ericcano commented Oct 14, 2020

ericcano commented Oct 15, 2020 • edited

ericcano commented Oct 16, 2020

abh3 commented Oct 16, 2020 via email

ericcano commented Oct 16, 2020

bbockelm commented Oct 20, 2020

abh3 commented Oct 20, 2020 via email

ericcano commented Nov 4, 2020

abh3 commented Nov 4, 2020 via email

abh3 commented Nov 13, 2020

ericcano commented Nov 13, 2020

glpatcern commented Nov 13, 2020

abh3 commented Nov 13, 2020

bbockelm commented Nov 13, 2020

ericcano commented Oct 15, 2020 •

edited