Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hang or crash inside XrdClient seen after fork with XRD_ENABLEFORKHANDLERS set #244

Closed
smithdh opened this issue Jun 17, 2015 · 17 comments
Closed
Assignees

Comments

@smithdh
Copy link
Contributor

smithdh commented Jun 17, 2015

Some ATLAS users reported crashes or hangs of their workflow; I've made some checks and believe it was due to a (relatively) recent addition to their ATLAS environment which set XRD_ENABLEFORKHANDLERS. The workflow was using xroot file access via XrdPosixXrootd.so.1 (so using XrdClient). The OS is 64bit SL(C)6. The process did fork a child process (due to a popen()).

I've made a minimal program which shows a problem fairly often, which directly uses XrdClient (either libXrdClient.so.1 from the 3.3.6 compatibility package or with .so.2 from 4.2.1). e.g.

(edit fork_handler_test2.cpp to open a file on a convenient xrootd server)

c++ fork_handler_test2.cpp -I/usr/include/xrootd -lXrdClient

export XRD_ENABLEFORKHANDLERS=1
./a.out

observed result is that in a fairly high fraction of attempts the program hangs or give a segmentation fault. Expected result is to read 1 byte from the file, no output.

@smithdh
Copy link
Contributor Author

smithdh commented Jun 17, 2015

(I don't know how to attach a source file in github):

#include <XrdClient/XrdClient.hh>
#include <stdio.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <errno.h>

int main()
{
   XrdClient *xrd = new XrdClient("root://xroot.example.com//path/non_empty_file");

   if (!xrd) {
      fprintf(stderr, "Could not make an XrdClient\n");
      return 1;
   }

   bool bret = xrd->Open(0, kXR_open_read, false);

   if (!bret) {
      fprintf(stderr, "Could not open file\n");
      return 1;
   }

   FILE *stream = popen("/bin/true", "r");

   if (!stream) {
      fprintf(stderr, "Could not popen\n");
      return 1;
   }

   pclose(stream);

   int rlen = 1;
   long long off =0;
   char buf[1];
   int iret = xrd->Read(buf, off, rlen);

   if (iret != 1) {
     fprintf(stderr, "Unexpected return value from Read\n");
     return 1;
   }

   xrd->Close();
   delete xrd;

   return 0;
}

@abh3
Copy link
Member

abh3 commented Jun 18, 2015

Hi David,

OK, I tried to reproduce this problem with the latest release and it
worked every time. How many tries do I need? Also, what version are you
using for this test. I know there were several bugs corrected in this code
path in a 3.x release.

Andy

On Wed, 17 Jun 2015, smithdh wrote:

(I don't know how to attach a source file in github):

#include <XrdClient/XrdClient.hh>
#include <stdio.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <errno.h>

int main()
{
XrdClient *xrd = new XrdClient("root://xroot.example.com//path/non_empty_file");

if (!xrd) {
fprintf(stderr, "Could not make an XrdClient\n");
return 1;
}

bool bret = xrd->Open(0, kXR_open_read, false);

if (!bret) {
fprintf(stderr, "Could not open file\n");
return 1;
}

FILE *stream = popen("/bin/true", "r");

if (!stream) {
fprintf(stderr, "Could not popen\n");
return 1;
}

pclose(stream);

int rlen = 1;
long long off =0;
char buf[1];
int iret = xrd->Read(buf, off, rlen);

if (iret != 1) {
fprintf(stderr, "Unexpected return value from Read\n");
return 1;
}

xrd->Close();
delete xrd;

return 0;
}


Reply to this email directly or view it on GitHub:
#244 (comment)

@smithdh
Copy link
Contributor Author

smithdh commented Jun 19, 2015

Hi Andy,

Thanks very much for looking at this and trying my problem reproducer. I've seen problems when using xrootd 3.3.6 and xrootd 4.2.1, on SL6, x86_64.

When I was trying my minimal program was showing problems in more than 50% of runs. So I'd expect the example to show at least one run with problems in, say 5 runs. However strictly speaking it relies that some undefined behaviour will make it crash or hang, so there may be no amount of runs which will show the problem. But, since it was so frequent in my test I took the risk that it would indeed show the problem for you.

Would you mind trying:

csh
setenv XRD_ENABLEFORKHANDLERS 1
setenv MALLOC_CHECK_ 2
setenv MALLOC_PERTURB_ 85
repeat 1000 ./a.out

to see if one of the runs fails? If not I see clearly need to make a better reproducer instructions or give some other information.

Yours,
David

@smithdh
Copy link
Contributor Author

smithdh commented Jun 19, 2015

Hi.

For complete information here, here's a link to the ROOT ticket which is where this particular problem was report by the ATLAS group:

https://sft.its.cern.ch/jira/browse/ROOT-7416

there are also further links in that ROOT ticket, which Stewart provided; They go back to a separate incident earlier this year which Lukasz was involved with. This was why ATLAS started to set XRD_ENABLEFORKHANDLERS in their environment.

@smithdh
Copy link
Contributor Author

smithdh commented Jun 19, 2015

here's a link to the ROOT ticket which is where this particular problem

Just to be accurate, the problem that atlas reported in ROOT-7416 wasn't specifically about XRD_ENABLEFORKHANDLERS, but I assert that that was the cause of the problem. I've heard subsequently that they are able to run the workflow without the variable set.

@ljanyst
Copy link
Contributor

ljanyst commented Jun 19, 2015

They started to enable the fork handlers because they started to fork Athena in order to use the kernel page sharing for the detector conditions data. AFAIK CMS has used all this for years without problems even with the old client.

@bbockelm
Copy link
Contributor

Oof - CMS never really used the forking mode in large-scale production. I suspect it was run more in unit tests than anywhere else!

For Run II, we did multithreading instead - and that's actually seeing light-of-day.

@ljanyst
Copy link
Contributor

ljanyst commented Jun 19, 2015

That's kind of surprising, given how much you pushed for it, but it means that it is probably a trivial issue. Happy debugging :)

@abh3
Copy link
Member

abh3 commented Jun 19, 2015

Unfortunately, it is not a trivial issue. The problem is that whenever
fork() is called the currennt connections in the parent are destroyed
making it impossible for the parent to continue execution in the current
context without getting a SEGV. The connections in the child are also
destroyed. Whether that's questionable as well. Does the child process
expect to reuse open connections created by the parent? If not, a fix may
be possible. Otherwise, it's rather complicated to fix this.

The main question is "what context does the child process expect?". It's
clear from the way programs are written, the parent process expects
established connections to still be alive. What does the child process
expect?

Andy

On Fri, 19 Jun 2015, Lukasz Janyst wrote:

That's kind of surprising, given how much you pushed for it, but it means that it is probably a trivial issue. Happy debugging :)


Reply to this email directly or view it on GitHub:
#244 (comment)

@ljanyst
Copy link
Contributor

ljanyst commented Jun 19, 2015

AFAIR the the fork safety in the old client was done under the assumption that at the moment of forking there is no open files and therefore there should be no network traffic. This means that all the connections can be wiped out in both the parent and the child.

Perhaps the email conversation pasted below can be of some use.

Lukasz

----- Original message -----
From: Lukasz Janyst <ljanyst@cern.ch>
To: Brian Bockelman <bbockelm@cse.unl.edu>
Cc: Andrew Hanushevsky <abh@slac.stanford.edu>, Peter Elmer <Peter.Elmer@cern.ch>, Christopher Jones <cdj@fnal.gov>, xrootd-dev <xrootd-dev@slac.stanford.edu>, Philippe Canal <pcanal@fnal.gov>
Subject: Re: Fork safety of XrdClient
Date: Wed, 12 Jan 2011 10:33:25 +0100

If your TFile object is deleted before the fork it should be fine to
just install the fork handlers. The destructors call everything that
is required for this to work.

   Lukasz

On Wed, Jan 12, 2011 at 9:13 AM, Lukasz Janyst <ljanyst@cern.ch> wrote:
> Hi Brian,
>
>   the disconnection stuff is exactly one of the things that I still
> have to work on. For some reason that I yet have to discover the
> Disconnect method is not called on Close, but still tries to commit
> some requests... Disconnect cannot be called from the fork handlers
> because they know only about the global ConnectionManager which
> handles Physical and Logical connections, the XrdClient object however
> has it's own Connection object which combines the two and is not
> registered to the ConnectionManager.
>
> Cheers,
>   Lukasz
>
>
>
> On Tue, Jan 11, 2011 at 8:38 PM, Brian Bockelman <bbockelm@cse.unl.edu> wrote:
>> Hi Lukasz,
>>
>> Thanks.  I have a plethora of teleconferences to attend today, but hopefully will hit this sometime in the next 24 hours.
>>
>> Why do we have to both install the fork handlers and disconnect the client connections?  Can't the latter just be done inside the fork handler (or maybe I'm asking the question too soon, and that's what prototype #2 does...)?
>>
>> My issue is that, in the current set of abstractions, we have no way to distinguish "please close the file and disconnect the client" and "please close the file".  The forking and the I/O handling code is done in two completely separate parts of the CMS framework.  Further, I'm not even sure of the lifetime of the file handle; it may be the TFile object is long gone by time we decide to fork.
>>
>> Brian
>>
>> On Jan 11, 2011, at 11:54 AM, Lukasz Janyst wrote:
>>
>>> Hi Brian,
>>>
>>>   I attach the patch for the first prototype. It hasn't been tested
>>> much (apart from the most simplest case) so it may cause your stuff to
>>> crash. It definitely still needs some work, but it would be helpful if
>>> you could try it out and give me some feedback.
>>>
>>>   Some remarks:
>>>
>>>   * at the initialization time you should call once and only once:
>>> XrdClientEnv::InstallForkHandlers()
>>>   * before forking, when you're done with your reading you should say:
>>>
>>>  cl.Close();
>>>  cl.GetClientConn()->Disconnect( FALSE );
>>>
>>>   for all your files opened with xroot (cl is an instance of the
>>> XrdClient class).
>>>
>>> Cheers,
>>>   Lukasz
>>>

@ljanyst
Copy link
Contributor

ljanyst commented Jun 19, 2015

An looking at David's code, the exact opposite is happening. This was never meant to work under these conditions.

@ljanyst
Copy link
Contributor

ljanyst commented Jun 19, 2015

In the old client there is no way to safely abort or pause the on-going traffic. It can be implemented, but it's a lot of gymnastics and I would not bother. The new client works fine under these circumstances. Unless you want to implement a major feature in an obsolete code, I would close this issue, especially that ATLAS claims that the problematic release is deprecated anyways.

@ljanyst
Copy link
Contributor

ljanyst commented Jun 19, 2015

From INC0717037.

12-06-2015 16:55:00 - Emil Obreshkov
H Belinda,

We see the mentioned problem in an old release branch (still 17 releases) which we expect will stop using soon so to safe people some time lets move over and that ticket can be closed since we have a workaround.

Thank you,
Emil

@abh3
Copy link
Member

abh3 commented Jun 19, 2015

Hi David et al,

We now understand the problem. The old client is not geared for forking if any files are open. That means all open files must be closed prior to forking. The test case supplied did not follow that rule and I suspect that the failing ATLAS code does not follow that rule as well. I am of the opinion that this problem should be closed as user error. If ATLAS needs files to remain open across a fork() then they will need to upgrade to the 4.x release and use the new client. I know that may be problematic because (until this is resolved) root is not built against 4.x. Furthermore, certain C-Library calls (like popen()) do an implicit fork which, when fork handlers are enabled, will cause the program to fail. Yes, in those cases it's not obvious that all files need to be closed before making such a call. Fixing this is a rather odious task. What is your opinion?

@ljanyst
Copy link
Contributor

ljanyst commented Jun 20, 2015

There is no way to programatically distinguish popen from fork in the handlers. The new client never aborts the connections in the parent, just kills the problematic threads and takes all the locks, so that the child ends up in the clean state. After the fork the parent just restarts the poller thread and continues with the old connections and sessions still valid, and the child goes through recovery on the on-demand basis. This behavior allows for all the file and filesystem handles (XrdCl::File and XrdCl::Filesystem) to stay valid and usable in both the child and the parent.

In the old client, you can enable/disable fork handlers from C++ if no envvars are set. Perhaps that's the way ATLAS should go.

@smithdh
Copy link
Contributor Author

smithdh commented Jun 23, 2015

Thanks a lot to everyone. (And thanks Lukasz for following to give details). I'm passing on the information. I'll invite them to contact the team directly if they would like, on the XROOTD-L list on this issue tracker. Or if I get relevant questions or requests I'll let you know.

@abh3
Copy link
Member

abh3 commented Jul 11, 2015

I am closing this as the old client doesn't support open files when a fork is done.

@abh3 abh3 closed this as completed Jul 11, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants