-
Notifications
You must be signed in to change notification settings - Fork 149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v5.4.3's libXrdCl.3.0.0 hangs on macOS 12.5.1 #1779
Comments
@Axel-Naumann : thanks for reporting this issue! Could you please provide client side logs (just export |
Hi @simonmichal , thank you for the quick reaction, here's the full log:
|
Looks like broken ipv6 routing? EDIT: confirmed, turning off ipv6 works around this. Is it xrootd's responsibility to be more resilient in this case? |
Hmm, according to the log it tried IPv6 and found that it did not work and then tried IPv4 and everything worked after that. So, where do you see that it is not resilient -- I must be missing it. |
Indeed! Otoh it seems to try for every operation (given the overall slowdown of the reading to >40 minutes, from originally 30 seconds), and I'd hope that the resilience could happen a bit faster than 2.5 minutes - but I do not know the details nor whether that's cause by our usage in ROOT or xrootd. This seems to have been caused by broken ipv6 routing; renewing the ipv6 address resurrected things to a point where now ipv6 works on that machine. I could try with a hand-assigned, hopefully not routed ipv6 address to get more of a log dump of what happens after the initial connection should that be useful? |
A common solution would be RFC 8305 - the "happy eyeballs" algorithm. This is implemented by browsers to smooth the transition to V6 -- basically, after ~100ms waiting for a IPv6 TCP connection, it starts a second parallel connection attempt with V4 and then use the first connection to complete. Of course, this somewhat tricky concurrency and the extra complexity may only be worthwhile for big applications like a browser. The problem here is that IPv6 is fully functional according to the OS (there's a route to the destination, there's a global IP address, nothing is sending back ICMP packets with "no route to host" to quickly terminate the connection attempt) so the routing configuration error is indistinguishable from the case of "server is unresponsive". Hence, you're hitting the connection timeout before the "second" DNS address is tried. Perhaps a simpler "fail fast" algorithm than Happy Eyeballs is, for hostnames which resolve to The downside of the latter idea is "complexity kills" for what is ultimately an end-user misconfiguration. The potential upside is that it'd help immensely with cases where the |
Hi guys, sorry the late replay but I took few days off.
@Axel-Naumann : I had a look at the logs and here's what I see: The first request is being send at:
this triggers the client to open a connection, the client first fails to connect with IPv6 and then retries and succeeds with IPv4 (this takes about ~1.5 min):
then the client gets redirected to a data server and again fails to connect to the IPv6 address but then succeeds with IPv4 (this again takes less than 1.5 min):
After a successful open, the client sends 3 read requests, a query and finally closes the file at:
In total it takes about 2 minutes 40 seconds, now I don't understand where are the 40 minutes you mention coming from? Once the connection is established it is reused for all the request the client issues.
It is tunable, you can set the connection window with
@bbockelm : we could implement this as a feature to be enabled by the user |
The original application that demonstrated the problem made several |
@eguiraud : OK this explains the 40 minutes, thanks! |
P. S. well, it explains the 40 minutes as long as the files that were opened were located on different storage servers (which most likely is true) |
Just as a cross check, could you verify that the full debug log (attached) of the >20 minute run is indeed due to the different storage servers, and not due to ROOT / xrootd forgetting that ipv6 isn't a viable option? |
Well, in a sense it is both. For each file you suffer from the connection timeout because you need to open a new connection to a different data server and the IPv6 address is not reachable. If all the files would be on the same data server, once the connection is open it would be reused for each of them. You could avoid this effect of aggregated connection timeout if you would open those files in parallel. |
P. S. As you might know, we have now the so called declarative API that makes it much easier to implement parallel access, maybe there's room for some collaboration in this area? |
I think this is understood, and there isn't really a bug to be fixed on the XRootD side (I looked at the logs and confirmed that the timeouts are for different servers), so I'm closing this ticket. Please open a new ticket if necessary. |
Hi,
With ROOT's build of its builtin xrootd-5.4.3 client on macOS 12.5.1 (and only there), just opening the file with
always hangs for several minutes with
Is this a known issue? Are we doing something wrong?
FYI @eguiraud who diagnosed this.
The text was updated successfully, but these errors were encountered: