Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

metalink:: errOperationInterrupted is not IsRecoverable #2169

Closed
adriansev opened this issue Jan 22, 2024 · 6 comments
Closed

metalink:: errOperationInterrupted is not IsRecoverable #2169

adriansev opened this issue Jan 22, 2024 · 6 comments
Assignees
Milestone

Comments

@adriansev
Copy link
Contributor

xrdlog.txt
Hi! Just to have this tracked here: i stumbled upon this case (see attached log) where
a weird error stops metalink to proceed to the next source (so no fallback)
It seems to me that the problem is that errOperationInterrupted is not IsRecoverable here
I previously had this case with errSocketTimeout and it was fixed this way by Michal :)
So, could the experts take a look please? :)
Thanks a lot!!

@abh3
Copy link
Member

abh3 commented Jan 22, 2024

The "operation Interrupted" was due to a stream timeout (i.e. the operation took too long, likely opening a file. The error was with the mgm redirector eos-mgm.grid.pub.ro. Did the connection to the mgm use a metalink file? Clearly, the same thing happened with the connection to eos-fst-storage7.grid.pub.r but that one appeared to recover likely because the error was merely "socket timeout", though the recovery appears to last far longer than it should as it continuously timeout afterwards; which I don't quite understand so I'm not sure this was a good fix.

I guess the bottom like is what connection was using a metalink file?

@adriansev
Copy link
Contributor Author

Yes, i was using a metalink to access the file (i can send the file to you over email).
Actually it is the way that the ALICE GRID client is doing the downloads (to have fallback between replicas).
The fallback did not worked in this case and i blame the subject of this issue :)
Is my conclusion correct? If not why the next source in metalink was not tried and how can it be fixed?
Thanks a lot!

@abh3
Copy link
Member

abh3 commented Jan 23, 2024 via email

@adriansev
Copy link
Contributor Author

Hi @abh3 :) Well, last time i had this discussion with Michal, i remember that there are a errors that would be incompatible with the whole download while others should be recoverable in order to continue to the next source... but was not easy to identify which is which so after the last fix the conclusion was that if i encounter any other case just to announce my problem :) so here i am :) and this is the ticket :)
This is why i suggested to add errOperationInterrupted here just after errSocketTimeout

if( status.code == errSocketError || status.code == errInvalidSession ||
status.code == errTlsError || status.code == errSocketTimeout )

And to answer to your request: the metalink sources errors should be recoverable to allow the metalink to proceed to the next source and the whole value of metalinks is the ability to have fallback between sources which are replicas of the same file. (there is also the multi source download but i do not use it because in ALICE case replicas can be way too geo distributed)

@abh3
Copy link
Member

abh3 commented Jan 23, 2024

Thanks. Sorry, I thought you were using xrootd-l not github for this exchange. Thank you for logging the problem.,

@amadio
Copy link
Member

amadio commented Jan 24, 2024

Thank you for the bug report. I am adding errOperationInterrupted to the list of recoverable errors, and this will be in our next release, which should come out shortly due to a problem with 5.6.5.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants