-
Notifications
You must be signed in to change notification settings - Fork 149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Redirection failure loop #132
Comments
Hi Brian, did you observe the same issue with xrdcopy? Cheers, |
Hi, I copy the xrdcopy behavior below. It didn't even get this far - it was unable to connect to the data server and immediately gave up. If it asks cms-xrd-global.cern.ch for a different replica, there is another one available. Brian
|
Thanks, I will have a look at both. |
I cannot reproduce, I get redirected to Spain starting from both CERN and SLAC. I will recreate these conditions artificially on Monday and see what's going wrong. |
Shoot - sorry, I think we booted Beijing from the federation the following day and I forgot to update the ticket. |
That's OK, I have enough info to reproduce with the imposters. |
I have reproduced with the imposters. This is what happens: The new client works as expected, but that's probably not really what we want. It goes from FNAL to global to IHEP redirector to IHEP disk server. It fails at the disk server, it goes back to global adding the disk server to tired CGI, it gets back to IHEP redirector where it gets: "No servers have read access to the file" since the disk server that has the file is excluded by "tried". It should probably add the whole chain from load balancer to failing server to the tried CGI, or should it? I am not quite sure. I would probably opt for marking everything from the disk server up to the first encountered manager? Any opinions? When using the old client the whole CGI gets lost. It's not hugely complicated to fix, but not trivial either because, to properly do it, I would have to add functionality to track whole redirection chains. Currently the old client tracks only the last connection. It's probably not worth it. |
Hi Lukasz, It would seem that we should also exclude the first encountered manager. Honestly, I don't know the whole history behind the behavior of the old client in this regards - it might be nice to hear about @abh3 as he has a long memory (and perhaps there's a downside we're forgetting). For the old client - I think it might be sufficient to just fix the fact it doesn't blacklist the disk server on a TCP connection failure. Brian |
Hi Brian, |
Hi Brian, I can do it, no problem, but I would also like to hear Andy's opinion on the matter. @abh3? Cheers, |
We ran into an interesting Xrootd failure. See below. My interpretation of the problem:
We have no idea why cmsdbs.ihep.ac.cn gets the redirection 100% of the time in testing; best guess is it is related to redirector load (another connected redirect, xrootd.ba.infn.it also has the file).
None of the servers require CMS auth (well, except the one that is behind a firewall causing the issue in the first place!) so a dev should be able to retry.
The text was updated successfully, but these errors were encountered: