Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Xrootd not passing some blocks to the undelying storage plugin, leading to file corruption (block full of 0s) #1305

Closed
ericcano opened this issue Oct 14, 2020 · 13 comments

Comments

@ericcano
Copy link
Contributor

We have a low rate of file corruptions in CASTOR when writing to the xrootd-ceph. I instrumented the xrootd-ceph layer to keep track of the async writes (which is what xrootd uses most of the time). This instrumentation indicates that the async writes all competed at close time, yet the last part of the file is full of 0s. We also saw blocks of 0s inside the file during the investigation.

We finally caught a file corrupted with the instrumentation in, and made the following observations:

  • The file size is 8444089. The checksum stored by xrootd in the namespace is 6d769179. When reading the file, we get a different checksum: 38c5fcaa. When truncating the file to 8M and computing the checksum we find yet a different one: a8aefcaa.

Investigating the file, we can see the structure with 0s after the 8MB boundary:

$ hexdump -C corrupt.root | grep -A 2 '00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00'
00000040  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000060  00 00 00 00 00 00 00 70  00 04 00 00 00 46 66 94  |.......p.....Ff.|
--
00800000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
0080d8b0  00 00 00 00 00 00 00 00  00                       |.........|

The logs of the xrootd-ceph layer as as follows:

   ceph_close: closed fd 35583 for file 1815889915@castorns.47439017245, read ops count 0, write ops count 0, async write ops 2/2, async pending write bytes 0, async read ops 0/0, bytes written/max offset 8388608/8388607, last async op age 0.102200

The interpretation is:

  • 2 blocks were written asynchronously and the writes had completed at the time of close.
  • 8MB were written, and the maximum offset writen shows that the write was not sparse.
  • The close was issued 102ms after the last async write was started (which seems OK, ruling out timeout effecs).

So it seems that the xrootd layer received the complete file (with the checksum stored in the namespace), yet one block failed to be passed to the underlying layer. The Ceph rados striper layer supports sparse files, so this does not generate errors.

@ericcano
Copy link
Contributor Author

ericcano commented Oct 15, 2020

We have another example from a few days ago which, unfortunately, went through a gateway without the new instrumentation. Nevertheless, it has the usual 8M aligned block of 0s signature:

$ hexdump -C corrupt2.root | grep -A 2 "00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00"
00000040  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000060  00 00 00 00 00 00 00 c6  00 04 00 00 00 71 65 10  |.............qe.|
--
07800000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
0780daa0  00 00                      

@ericcano
Copy link
Contributor Author

We got another example again, this time with full, latest logs (more timings):

ceph_close: closed fd 2462 for file 1816654013@castorns.47465071984, read ops count 0, write ops count 0, async write ops 74/74, async pending write bytes 0, async read ops 0/0, bytes written/max offset 310378496/310378495, longest async write 0.686074, longest callback invocation 0.000081, last async op age 0.371806

Timings are perfect, so the issue is not with them. Again, the last block was not passed to the storage plugin (the full file size is 310381850), and the cutoff address is again 8MB aligned.

@abh3
Copy link
Member

abh3 commented Oct 16, 2020 via email

@ericcano
Copy link
Contributor Author

No, this was also transferred async. We only detect the failed transfers after a few days, when the tape systems fails to write to tape because the checksum of the actual data is found to not match the one in the namespace. This file was written Oct 13, 04:04, CEST.

I set a subset (>50%) of machines with async off. The transfer logs confirm this. The last async age is bug, fixed in upstream but not in this version. It's just cosmetic. Synchronous writes also log the max offets/byte count so we will be able to detect missing end of files or sparse writes.

Here is an example (not problematic):

ceph_close: closed fd 0 for file 1817314031@castorns.47488487468, read ops count 0, write ops count 16, async write ops 0/0, async pending write bytes 0, async read ops 0/0, bytes written/max offset 130619785/130619784, longest async write 0.000000, longest callback invocation 0.000000, last async op age 1602876988.999397

Eric

So, is it correct to assume async was set to off this time? If so, we will need to turn on tracing. It's hard to believe that the ;ast block was not sent to at least the oss plugin (a.k.a Ceph plugin). Somehow, I think it's getting lost past that point but only a trace record will show if that is the case. Let me get the right option for you. Andy

On Fri, 16 Oct 2020, Eric Cano wrote: We got another example again, this time with full, latest logs (more timings): ceph_close: closed fd 2462 for file ***@***.***, read ops count 0, write ops count 0, async write ops 74/74, async pending write bytes 0, async read ops 0/0, bytes written/max offset 310378496/310378495, longest async write 0.686074, longest callback invocation 0.000081, last async op age 0.371806 Timings are perfect, so the issue is not with them. Again, the last block was not passed to the storage plugin (the full file size is 310381850), and the cutoff address is again 8MB aligned. -- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: #1305 (comment)

@bbockelm
Copy link
Contributor

@abh3 - is it possible something strange is occurring with write retries? It does seem quite strange that the expected write doesn't make it to the xrootd-ceph layer. Having trouble coming up with a possible explanation.

@abh3
Copy link
Member

abh3 commented Oct 20, 2020 via email

@ericcano
Copy link
Contributor Author

ericcano commented Nov 4, 2020

I was away for 2 weeks, during which we had the transfers set to sync and no issue arose. So the problem happens only with async enabled. We will stick to this setup for the time being.

To be complete, we also automatically restarted the xrootd daemon every day (to minimize the odds of having accumulations of pending transfers), so this could play a role. This mechanism had been now removed. I will report if new issues pop up.

@abh3
Copy link
Member

abh3 commented Nov 4, 2020 via email

@abh3
Copy link
Member

abh3 commented Nov 13, 2020

I think the last time we surmised it was because the blocks were being written out of order and that is something that upsets Ceph. As far as I can see, all writes have been issued, albeit not in sequential order. I am ready to close this ticket unless someone complains.

@ericcano
Copy link
Contributor Author

Hi Andy,

We still did not have any new case, but from my instrumentation, the last block did not make is at all to ceph. The logs show the last block was not provided to ceph at all.

That said, I think the workaround we have is sufficient, and we will eventually decommission CASTOR, so you can close the ticket.

@glpatcern
Copy link
Contributor

Hi all,

If I can add a comment, just make sure at RAL they're aware of this for ECHO, which is not going to be decommissioned any time soon as far as I know. Also, the workaround may not be feasible for them.

@abh3
Copy link
Member

abh3 commented Nov 13, 2020

Closed by request...

e still did not have any new case, but from my instrumentation, the last block did
not make is at all to ceph. The logs show the last block was not provided to ceph at
all.

That said, I think the workaround we have is sufficient, and we will eventually
decommission CASTOR, so you can close the ticket.

@abh3 abh3 closed this as completed Nov 13, 2020
@bbockelm
Copy link
Contributor

For posterity -

Ceph is just fine with out-of-order writes. However, there are some configurations of the system (such as RAL's use of ECs) that require aligned writes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants