-
Notifications
You must be signed in to change notification settings - Fork 149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Xrootd not passing some blocks to the undelying storage plugin, leading to file corruption (block full of 0s) #1305
Comments
We have another example from a few days ago which, unfortunately, went through a gateway without the new instrumentation. Nevertheless, it has the usual 8M aligned block of 0s signature:
|
We got another example again, this time with full, latest logs (more timings):
Timings are perfect, so the issue is not with them. Again, the last block was not passed to the storage plugin (the full file size is 310381850), and the cutoff address is again 8MB aligned. |
So, is it correct to assume async was set to off this time? If so, we will
need to turn on tracing. It's hard to believe that the ;ast block was not
sent to at least the oss plugin (a.k.a Ceph plugin). Somehow, I think it's
getting lost past that point but only a trace record will show if that is
the case. Let me get the right option for you.
Andy
…On Fri, 16 Oct 2020, Eric Cano wrote:
We got another example again, this time with full, latest logs (more timings):
```
ceph_close: closed fd 2462 for file ***@***.***, read ops count 0, write ops count 0, async write ops 74/74, async pending write bytes 0, async read ops 0/0, bytes written/max offset 310378496/310378495, longest async write 0.686074, longest callback invocation 0.000081, last async op age 0.371806
```
Timings are perfect, so the issue is not with them. Again, the last block was not passed to the storage plugin (the full file size is 310381850), and the cutoff address is again 8MB aligned.
--
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
#1305 (comment)
|
No, this was also transferred async. We only detect the failed transfers after a few days, when the tape systems fails to write to tape because the checksum of the actual data is found to not match the one in the namespace. This file was written Oct 13, 04:04, CEST. I set a subset (>50%) of machines with async off. The transfer logs confirm this. The last async age is bug, fixed in upstream but not in this version. It's just cosmetic. Synchronous writes also log the max offets/byte count so we will be able to detect missing end of files or sparse writes. Here is an example (not problematic):
Eric
|
@abh3 - is it possible something strange is occurring with write retries? It does seem quite strange that the expected write doesn't make it to the |
I agree that is strange and why we need to turn on write tracing
server-side in the ofs layer. I can't see why the write would never make
it to Ceph but I can imagine that it may not be happy with an over-write
that may occur on a retry. I also pointed out that when async is on the
writes may occur out of order which is another thing that Ceph may be
sensitive to.
…On Tue, 20 Oct 2020, Brian P Bockelman wrote:
@abh3 - is it possible something strange is occurring with write retries? It does seem quite strange that the expected write doesn't make it to the `xrootd-ceph` layer. Having trouble coming up with a possible explanation.
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#1305 (comment)
|
I was away for 2 weeks, during which we had the transfers set to sync and no issue arose. So the problem happens only with async enabled. We will stick to this setup for the time being. To be complete, we also automatically restarted the xrootd daemon every day (to minimize the odds of having accumulations of pending transfers), so this could play a role. This mechanism had been now removed. I will report if new issues pop up. |
Thank you for the update Eric. If async is the issue then it would
indicate that, perhaps, it is a problem when a particular block of data is
written out of order. It would nice if we could catch that somehow. Of
course, that would mean bringing back the error which may be too
disruptive.
Andy
…On Wed, 4 Nov 2020, Eric Cano wrote:
I was away for 2 weeks, during which we had the transfers set to sync and no issue arose. So the problem happens only with async enabled. We will stick to this setup for the time being.
To be complete, we also automatically restarted the xrootd daemon every day (to minimize the odds of having accumulations of pending transfers), so this could play a role. This mechanism had been now removed. I will report if new issues pop up.
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#1305 (comment)
|
I think the last time we surmised it was because the blocks were being written out of order and that is something that upsets Ceph. As far as I can see, all writes have been issued, albeit not in sequential order. I am ready to close this ticket unless someone complains. |
Hi Andy, We still did not have any new case, but from my instrumentation, the last block did not make is at all to ceph. The logs show the last block was not provided to ceph at all. That said, I think the workaround we have is sufficient, and we will eventually decommission CASTOR, so you can close the ticket. |
Hi all, If I can add a comment, just make sure at RAL they're aware of this for ECHO, which is not going to be decommissioned any time soon as far as I know. Also, the workaround may not be feasible for them. |
Closed by request... e still did not have any new case, but from my instrumentation, the last block did That said, I think the workaround we have is sufficient, and we will eventually |
For posterity - Ceph is just fine with out-of-order writes. However, there are some configurations of the system (such as RAL's use of ECs) that require aligned writes. |
We have a low rate of file corruptions in CASTOR when writing to the xrootd-ceph. I instrumented the xrootd-ceph layer to keep track of the async writes (which is what xrootd uses most of the time). This instrumentation indicates that the async writes all competed at close time, yet the last part of the file is full of 0s. We also saw blocks of 0s inside the file during the investigation.
We finally caught a file corrupted with the instrumentation in, and made the following observations:
Investigating the file, we can see the structure with 0s after the 8MB boundary:
The logs of the xrootd-ceph layer as as follows:
The interpretation is:
So it seems that the xrootd layer received the complete file (with the checksum stored in the namespace), yet one block failed to be passed to the underlying layer. The Ceph rados striper layer supports sparse files, so this does not generate errors.
The text was updated successfully, but these errors were encountered: