Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xrootd5:: incompatible communication between v4 and v5? #1324

Closed
adriansev opened this issue Nov 10, 2020 · 17 comments
Closed

xrootd5:: incompatible communication between v4 and v5? #1324

adriansev opened this issue Nov 10, 2020 · 17 comments

Comments

@adriansev
Copy link
Contributor

so, i tried xrootd 5 and so far i got this:

  1. the cms.perf is incompatible with the existent cms_monPerf script that was running in v4
    in cmslog i get this:
    201110 03:16:56 9534 Meter: Perf monitor returned invalid output:
  2. even if i comment out this i get in cmslog
201110 03:18:16 10200 Meter: Found 2 filesystem(s); 216TB total (77% util); 51TB free (25TB max)
------ server@storage05.spacescience.ro phase 2 server initialization completed.
------ cmsd server@storage05.spacescience.ro:8853 initialization completed.
201110 03:18:16 10223 Start: Waiting for primary server to login.

and in xrdlog:

201110 03:18:17 10232 sysConfig: Configured as HTTP(s) data server.
------ HTTP protocol initialization completed.
------ xrootd server@storage05.spacescience.ro:1094 initialization completed.

and everything is stuck and the server is not registering in redirector
the logs can be seen here https://cernbox.cern.ch/index.php/s/qzeEISru1gbL3uS
i would have expected to have some breakage in ALICE xrootd plugins but not to be complete disfunctional...
any idea what is going on? in which conditions v5 worked successfully?
Thank you!

@adriansev
Copy link
Contributor Author

i've seen https://xrootd.slac.stanford.edu/doc/dev50/R5-Issues.htm but even if every path declaration is commented out, still does not work .. see nopath directory in the above cernbox share

@adriansev
Copy link
Contributor Author

given the breakage, i would suggest to be pulled from epel (where there are no old versions, only the latest is present) and keep only the xroot repos for testing the new versions (which were helpful to me to recover the server because i could not get back to v4 otherwise)

@xrootd-dev
Copy link

xrootd-dev commented Nov 10, 2020 via email

@abh3
Copy link
Member

abh3 commented Nov 10, 2020 via email

@adriansev
Copy link
Contributor Author

so, the monPerf does not matter as the problem stays also with it's directive commented out.
the actual script that was used so far and was working in v4 is https://github.com/xrootd/xrootd/blob/master/utils/cms_monPerf

@abh3
Copy link
Member

abh3 commented Nov 10, 2020 via email

@adriansev
Copy link
Contributor Author

@abh3 see my cernbox share from my initial submission .. the nopath directory within is the configuration and logs when commenting out the {pid,admin}path directives and running the processes in debug mode

@abh3
Copy link
Member

abh3 commented Nov 10, 2020

OK, so according to the xrootd log it connected to the cmsd
201110 03:18:17 10245 cms_Finder: Connected to cmsd via /home/aliprod/alicexrdrun/admin/server/.olb/olbd.admin
According to the cmsd log it did not
201110 03:18:16 10223 Start: Waiting for primary server to login.

So, I've seen this before and it usually winds up being the case that a phantom cmsd is still running on that host. So, it did connect but not to the one you expected it to connect to. Could you check to see what is actually running? A ps -ef with a grep will usually illuminate the problem unless the "phantom" cmsd was killed but winds up still executing because the kernel can't get rid of it. Rare but it does happen.

@abh3
Copy link
Member

abh3 commented Nov 10, 2020

Ah, a few other things to get a clearer picture. I assume you are not using containers no virtual machines for any of this. If that is not the case, please let me know what the actual running setup is. Additionally, could you go back to the logs and post what the running log shows several minutes after what has been posted. From the logs I don't know where the cutoff was. According to the messages, the cmsd put up an accept a full second before the corresponding connect. So, that is odd.

@simonmichal
Copy link
Contributor

@adriansev : we will address this issue ASAP and if necessary push a patch to EPEL

We run regular tests with xrootd/cmsd setup with XRootD5, so another possibility is to try running a very basic xrootd+cmsd setup to see if that works for you (it should, it does work in our test suit) and then bisect over your config file to see which part causes the issue.

BTW Would it be possible for you to run against our release candidates in your test suit so we can detect early possible problems in the future?

@olifre
Copy link
Contributor

olifre commented Nov 10, 2020

the actual script that was used so far and was working in v4 is https://github.com/xrootd/xrootd/blob/master/utils/cms_monPerf

FWIW, I am using:

cms.perf int 60s pgm /usr/share/xrootd/utils/cms_monPerf 30

with /usr/share/xrootd/utils/cms_monPerf shipped with 5.0.2 here, and do neither observe the connection issues nor the monitor error message (and I can confirm from the process tree that the monitoring script is running successfully as subprocess). I am using the packages from http://xrootd.org/binaries/stable/slc/7/x86_64 here. But of course, Andy is right and the main issue (xrootd not connecting) should of course be the first focus here.

@adriansev
Copy link
Contributor Author

@adriansev : we will address this issue ASAP and if necessary push a patch to EPEL

We run regular tests with xrootd/cmsd setup with XRootD5, so another possibility is to try running a very basic xrootd+cmsd setup to see if that works for you (it should, it does work in our test suit) and then bisect over your config file to see which part causes the issue.

BTW Would it be possible for you to run against our release candidates in your test suit so we can detect early possible problems in the future?

@simonmichal well, i have not test suit .. this was tried in production :D .. moreover at this moment i have not even the simplest hardware to make it as a test storage system
and about the basic config, i thing that i have the most basic configuration : a simple xrootd server that subscribe to a redirector..
do you see anything more than basic in my config? (beside alicetokenacc and n2n translation that are irrelevant to service starting)

@abh3
Copy link
Member

abh3 commented Nov 11, 2020

Any more information about the questions I asked regarding that the xrootd did connect to some cmsd it just wasn't the cmsd you thought it would connect to?

@adriansev
Copy link
Contributor Author

@abh3 sorry, i forgot to answer :( so, this is a bona-fide production server on which i do the test, so no containers. Also, the logs stops there, nothing moves after that point. we can move discussion on a private mail and i can give access to a sacrificial server

@abh3
Copy link
Member

abh3 commented Nov 11, 2020 via email

@adriansev
Copy link
Contributor Author

@abh3 @simonmichal i updated the xrootd from xroot-testing to 5.0.3-0.rc1 and it seem that the problem is solved. the server is seen and registered to redirector. Waiting for the epel release.. I think that i can close this ticket...

@abh3
Copy link
Member

abh3 commented Nov 13, 2020

Apparently, this has been solved by fetching the latest release.

@abh3 abh3 closed this as completed Nov 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants