Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

http TPC fails in mixed environments (IPv6 / dual stack vs. IPv4 only) #968

Closed
olifre opened this issue Apr 17, 2019 · 11 comments · Fixed by #969
Closed

http TPC fails in mixed environments (IPv6 / dual stack vs. IPv4 only) #968

olifre opened this issue Apr 17, 2019 · 11 comments · Fixed by #969

Comments

@olifre
Copy link
Contributor

olifre commented Apr 17, 2019

Example from the log on a redirector which is IPv4 only:

190415 05:50:29 9023 sysMacaroonGen: ID=a6d5a5c9-6d2a-48ec-8c2e-527ea117e541, resource=/cephfs/grid/atlas/atlaslocalgroupdisk/rucio/mc16_13TeV/b6/69/DAOD_HIGG4D2.16801697._000005.pool.root.1, name=/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=ddmadmin/CN=531497/CN=Robot: ATLAS Data Management, host=[::ffff:128.142.201.195], vorg=atlas, role=production, groups=/atlas /atlas/Role=production /atlas/Role=production/Capability=NULL, endorsements=/atlas/Role=production/Capability=NULL, base_activities=activity:READ_METADATA,UPLOAD,DOWNLOAD,DELETE,MANAGE,UPDATE_METADATA,LIST, user_caveat=activity:DOWNLOAD,LIST, expires=2019-04-15T06:01:29Z
190415 05:50:30 9022 XrootdBridge: /DC=ch/D.227:30@fts106.cern.ch login as /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=ddmadmin/CN=531497/CN=Robot: ATLAS Data Management
190415 05:50:30 9031 Decode xrootd redirects /DC=ch/D.227:30@fts106.cern.ch to xrootd005.physik.uni-bonn.de:1094 /cephfs/grid/atlas/atlaslocalgroupdisk/rucio/mc16_13TeV/b6/69/DAOD_HIGG4D2.16801697._000005.pool.root.1
190415 05:50:30 9022 XrootdXeq: /DC=ch/D.227:30@fts106.cern.ch disc 0:00:00 (send failure)
190415 05:50:30 9039 XrootdBridge: ddmadmin.228:26@fts106.cern.ch login as ddmadmin
190415 05:50:30 9031 Decode xrootd redirects ddmadmin.228:26@fts106.cern.ch to xrootd003.physik.uni-bonn.de:1094 /cephfs/grid/atlas/atlaslocalgroupdisk/rucio/mc16_13TeV/b6/69/DAOD_HIGG4D2.16801697._000005.pool.root.1
190415 05:50:30 9039 XrootdXeq: ddmadmin.228:26@fts106.cern.ch disc 0:00:00 (send failure)
190415 05:50:30 9132 sysProcessPushReq: Starting a push request for resource https://webdav.grid.surfsara.nl:2882/pnfs/grid.sara.nl/data/atlas/atlasscratchdisk/rucio/mc16_13TeV/b6/69/DAOD_HIGG4D2.16801697._000005.pool.root.1
190415 05:50:30 9031 Decode xrootd gave ddmadmin err -2 'No servers are reachable via public IPv6 network to read the file.' /cephfs/grid/atlas/atlaslocalgroupdisk/rucio/mc16_13TeV/b6/69/DAOD_HIGG4D2.16801697._000005.pool.root.1
190415 05:50:31 9131 XrdLink: Unable to receive from ddmadmin.0:30@fts106.cern.ch; connection reset by peer
190415 05:50:31 9131 XrdLink: Unable to send to ddmadmin.0:30@fts106.cern.ch; broken pipe

The partner site is dual-stacked, and the PUSH request failed with:
No servers are reachable via public IPv6 network to read the file.

Quoting @abh :

So, sadly, that means the http plugin is not setting the correct client capabilities.

@bbockelm
Copy link
Contributor

@abh3 - I assume this is coming from here:

https://github.com/xrootd/xrootd/blob/master/src/XrdTpc/XrdTpcTPC.cc#L184

Do we need to add special opaque information to the URL to get the cmsd to do the correct thing with respect to IPv4 / IPv6?

@abh3
Copy link
Member

abh3 commented Apr 18, 2019 via email

@bbockelm
Copy link
Contributor

Why do you think this is coming from the bridge and not the SFS object? My understanding of @olifre's description is that everything else works except TPC (which is the only place where we invoke the filesystem directly).

@olifre
Copy link
Contributor Author

olifre commented Apr 18, 2019

My understanding of @olifre's description is that everything else works except TPC

Just to confirm: That's true. To give examples, a streaming copy done by FTS works, GET / PUT works etc. Only TPC with our IPv4-only site against a dual-stacked site fails.

@abh3
Copy link
Member

abh3 commented Apr 18, 2019 via email

@bbockelm
Copy link
Contributor

@olifre - do you have the ability to test out a patch?

I think this will work:

diff --git a/src/XrdTpc/XrdTpcTPC.cc b/src/XrdTpc/XrdTpcTPC.cc
index 457bf28..8091ac6 100644
--- a/src/XrdTpc/XrdTpcTPC.cc
+++ b/src/XrdTpc/XrdTpcTPC.cc
@@ -181,6 +181,8 @@ int TPCHandler::OpenWaitStall(XrdSfsFile &fh, const std::string &resource,
 {
     int open_result;
     while (1) {
+        int orig_ucap = fh.error.getUCap();
+        fh.error.setUCap(orig_ucap | XrdOucEI::uIPv64);
         open_result = fh.open(resource.c_str(), mode, openMode, &sec,
                               authz.empty() ? NULL: authz.c_str());
         if ((open_result == SFS_STALL) || (open_result == SFS_STARTED)) {

Might take me a bit more time for me to completely reproduce your setup on my development VM.

@olifre
Copy link
Contributor Author

olifre commented Apr 18, 2019

do you have the ability to test out a patch?

@bbockelm Sadly, still a no 😢 . We wanted to set up a test setup with DFS in addition to our current one long ago, but to much other windmills to fight against came up (including IPv6 preparations, but these will still take a few months at least for a production solution).

@bbockelm
Copy link
Contributor

Gotcha. Oddly enough, I don't have access to any hosts without IPv6, so I'm currently trying to remove "just enough" IPv6 support from Xrootd to reproduce the issue exactly.

bbockelm added a commit to bbockelm/xrootd that referenced this issue Apr 18, 2019
If we open a file using the SFS interface, the OFS plugin will, by default,
only query for servers that support IPv6.  If a cluster only has IPv4
addresses, then this will always fail.

This changes the default to dual-stack: the transfer can be serviced by
a server that speaks either IPv4 or IPv6 (or both!).  This is likely the
best we can do as we have no indication of whether the source side is IPv4,
IPv6, or dual stack itself.

Fixes xrootd#968
@bbockelm
Copy link
Contributor

Ok, I figured out how to disable enough IPv6 on the development machine in order to trick xrootd to being an "ipv4-only" host. I was able to reproduce and confirm the fix works.

Basically, Xrootd tries to find a server that can perform the transfer and have some amount of protocol-awareness. By default, it searches for a host that can handle an IPv6 transfer. I changed the default to query for hosts that can do transfers over either IPv4 or IPv6.

Now, this assumes that the remote side is compatible with your cluster (i.e., an IPv6-only source will be matched to an IPv4-only cluster ... but a failure obviously still will occur further downstream). It seems the assumption the two side are compatible is better than assuming the remote side is always IPv6-only.

@simonmichal - this fix would be very good to backport.

@olifre
Copy link
Contributor Author

olifre commented Apr 18, 2019

It seems the assumption the two side are compatible is better than assuming the remote side is always IPv6-only.

That's for sure the case, IPv6-only is still rare, but dual-stack vs. IPv4-only is pretty common. And I hope that IPv4 only will die out earlier than IPv6 only arises, at least for us this will be true 😉.

Many thanks for the fix and explanation!
This would indeed surely be a good backport candidate. While IPv4-only servers are becoming more rare nowadays, our example shows they still exist. In case people are curious (that's now of course off-topic!): Our main blocker preventing IPv6 introduction is a controllable DHCPv6 workflow. Many clients don't play well with DHCPv6 (e.g. client-id changing especially when installing a machine via PXE) and we don't want to go down the routes of full autoconfiguration or full static configuration (both would be easy in straightforward, but come at a price).

ffurano pushed a commit to ffurano/xrootd-fbx4 that referenced this issue May 2, 2019
If we open a file using the SFS interface, the OFS plugin will, by default,
only query for servers that support IPv6.  If a cluster only has IPv4
addresses, then this will always fail.

This changes the default to dual-stack: the transfer can be serviced by
a server that speaks either IPv4 or IPv6 (or both!).  This is likely the
best we can do as we have no indication of whether the source side is IPv4,
IPv6, or dual stack itself.

Fixes xrootd#968
@olifre
Copy link
Contributor Author

olifre commented May 15, 2019

For the record, I can confirm this fixes our issue in practice 👍 Thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants