New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
metalink:: errOperationInterrupted is not IsRecoverable #2169
Comments
The "operation Interrupted" was due to a stream timeout (i.e. the operation took too long, likely opening a file. The error was with the mgm redirector eos-mgm.grid.pub.ro. Did the connection to the mgm use a metalink file? Clearly, the same thing happened with the connection to eos-fst-storage7.grid.pub.r but that one appeared to recover likely because the error was merely "socket timeout", though the recovery appears to last far longer than it should as it continuously timeout afterwards; which I don't quite understand so I'm not sure this was a good fix. I guess the bottom like is what connection was using a metalink file? |
Yes, i was using a metalink to access the file (i can send the file to you over email). |
Hi Adrian,
You are correct. Using a metalink file should simply iterate to the next entry regardless of the failure. I suspect it does for practically all errors except errOperationInterrupted for some strange reason, likely an oversight. Please cut a ticket explaining that metalinks should be immune from failure modes otherwise their value is diminished.
Andy
…________________________________
From: Adrian Sevcenco ***@***.***>
Sent: Monday, January 22, 2024 9:16 PM
To: xrootd/xrootd ***@***.***>
Cc: Andrew Hanushevsky ***@***.***>; Comment ***@***.***>
Subject: Re: [xrootd/xrootd] metalink:: errOperationInterrupted is not IsRecoverable (Issue #2169)
Yes, i was using a metalink to access the file (i can send the file to you over email).
Actually it is the way that the ALICE GRID client<https://github.com/adriansev/jalien_py> is doing the downloads (to have fallback between replicas).
The fallback did not worked in this case and i blame the subject of this issue :)
Is my conclusion correct? If not why the next source in metalink was not tried and how can it be fixed?
Thanks a lot!
—
Reply to this email directly, view it on GitHub<#2169 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAUIW52DYFPS53DOD6CWKHLYP5BRNAVCNFSM6AAAAABCFZSSNCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMBVGMYDOMRZGM>.
You are receiving this because you commented.Message ID: ***@***.***>
|
Hi @abh3 :) Well, last time i had this discussion with Michal, i remember that there are a errors that would be incompatible with the whole download while others should be recoverable in order to continue to the next source... but was not easy to identify which is which so after the last fix the conclusion was that if i encounter any other case just to announce my problem :) so here i am :) and this is the ticket :) xrootd/src/XrdCl/XrdClFileStateHandler.cc Lines 2944 to 2945 in 6a1dfa3
And to answer to your request: the metalink sources errors should be recoverable to allow the metalink to proceed to the next source and the whole value of metalinks is the ability to have fallback between sources which are replicas of the same file. (there is also the multi source download but i do not use it because in ALICE case replicas can be way too geo distributed) |
Thanks. Sorry, I thought you were using xrootd-l not github for this exchange. Thank you for logging the problem., |
Thank you for the bug report. I am adding |
xrdlog.txt
Hi! Just to have this tracked here: i stumbled upon this case (see attached log) where
a weird error stops metalink to proceed to the next source (so no fallback)
It seems to me that the problem is that errOperationInterrupted is not IsRecoverable here
I previously had this case with errSocketTimeout and it was fixed this way by Michal :)
So, could the experts take a look please? :)
Thanks a lot!!
The text was updated successfully, but these errors were encountered: