-
Notifications
You must be signed in to change notification settings - Fork 149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hang or crash inside XrdClient seen after fork with XRD_ENABLEFORKHANDLERS set #244
Comments
(I don't know how to attach a source file in github):
|
Hi David, OK, I tried to reproduce this problem with the latest release and it Andy On Wed, 17 Jun 2015, smithdh wrote:
|
Hi Andy, Thanks very much for looking at this and trying my problem reproducer. I've seen problems when using xrootd 3.3.6 and xrootd 4.2.1, on SL6, x86_64. When I was trying my minimal program was showing problems in more than 50% of runs. So I'd expect the example to show at least one run with problems in, say 5 runs. However strictly speaking it relies that some undefined behaviour will make it crash or hang, so there may be no amount of runs which will show the problem. But, since it was so frequent in my test I took the risk that it would indeed show the problem for you. Would you mind trying: csh to see if one of the runs fails? If not I see clearly need to make a better reproducer instructions or give some other information. Yours, |
Hi. For complete information here, here's a link to the ROOT ticket which is where this particular problem was report by the ATLAS group: https://sft.its.cern.ch/jira/browse/ROOT-7416 there are also further links in that ROOT ticket, which Stewart provided; They go back to a separate incident earlier this year which Lukasz was involved with. This was why ATLAS started to set XRD_ENABLEFORKHANDLERS in their environment. |
Just to be accurate, the problem that atlas reported in ROOT-7416 wasn't specifically about XRD_ENABLEFORKHANDLERS, but I assert that that was the cause of the problem. I've heard subsequently that they are able to run the workflow without the variable set. |
They started to enable the fork handlers because they started to fork Athena in order to use the kernel page sharing for the detector conditions data. AFAIK CMS has used all this for years without problems even with the old client. |
Oof - CMS never really used the forking mode in large-scale production. I suspect it was run more in unit tests than anywhere else! For Run II, we did multithreading instead - and that's actually seeing light-of-day. |
That's kind of surprising, given how much you pushed for it, but it means that it is probably a trivial issue. Happy debugging :) |
Unfortunately, it is not a trivial issue. The problem is that whenever The main question is "what context does the child process expect?". It's Andy On Fri, 19 Jun 2015, Lukasz Janyst wrote:
|
AFAIR the the fork safety in the old client was done under the assumption that at the moment of forking there is no open files and therefore there should be no network traffic. This means that all the connections can be wiped out in both the parent and the child. Perhaps the email conversation pasted below can be of some use. Lukasz
|
An looking at David's code, the exact opposite is happening. This was never meant to work under these conditions. |
In the old client there is no way to safely abort or pause the on-going traffic. It can be implemented, but it's a lot of gymnastics and I would not bother. The new client works fine under these circumstances. Unless you want to implement a major feature in an obsolete code, I would close this issue, especially that ATLAS claims that the problematic release is deprecated anyways. |
From INC0717037.
|
Hi David et al, We now understand the problem. The old client is not geared for forking if any files are open. That means all open files must be closed prior to forking. The test case supplied did not follow that rule and I suspect that the failing ATLAS code does not follow that rule as well. I am of the opinion that this problem should be closed as user error. If ATLAS needs files to remain open across a fork() then they will need to upgrade to the 4.x release and use the new client. I know that may be problematic because (until this is resolved) root is not built against 4.x. Furthermore, certain C-Library calls (like popen()) do an implicit fork which, when fork handlers are enabled, will cause the program to fail. Yes, in those cases it's not obvious that all files need to be closed before making such a call. Fixing this is a rather odious task. What is your opinion? |
There is no way to programatically distinguish popen from fork in the handlers. The new client never aborts the connections in the parent, just kills the problematic threads and takes all the locks, so that the child ends up in the clean state. After the fork the parent just restarts the poller thread and continues with the old connections and sessions still valid, and the child goes through recovery on the on-demand basis. This behavior allows for all the file and filesystem handles (XrdCl::File and XrdCl::Filesystem) to stay valid and usable in both the child and the parent. In the old client, you can enable/disable fork handlers from C++ if no envvars are set. Perhaps that's the way ATLAS should go. |
Thanks a lot to everyone. (And thanks Lukasz for following to give details). I'm passing on the information. I'll invite them to contact the team directly if they would like, on the XROOTD-L list on this issue tracker. Or if I get relevant questions or requests I'll let you know. |
I am closing this as the old client doesn't support open files when a fork is done. |
Some ATLAS users reported crashes or hangs of their workflow; I've made some checks and believe it was due to a (relatively) recent addition to their ATLAS environment which set XRD_ENABLEFORKHANDLERS. The workflow was using xroot file access via XrdPosixXrootd.so.1 (so using XrdClient). The OS is 64bit SL(C)6. The process did fork a child process (due to a popen()).
I've made a minimal program which shows a problem fairly often, which directly uses XrdClient (either libXrdClient.so.1 from the 3.3.6 compatibility package or with .so.2 from 4.2.1). e.g.
(edit fork_handler_test2.cpp to open a file on a convenient xrootd server)
c++ fork_handler_test2.cpp -I/usr/include/xrootd -lXrdClient
export XRD_ENABLEFORKHANDLERS=1
./a.out
observed result is that in a fairly high fraction of attempts the program hangs or give a segmentation fault. Expected result is to read 1 byte from the file, no output.
The text was updated successfully, but these errors were encountered: