Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XrdHttp loses requests under modest concurrency #810

Closed
bbockelm opened this issue Aug 29, 2018 · 4 comments
Closed

XrdHttp loses requests under modest concurrency #810

bbockelm opened this issue Aug 29, 2018 · 4 comments

Comments

@bbockelm
Copy link
Contributor

After applying a workaround for issue #809, I notice that ab will quite consistently fail with only few number of repeated requests:

$ ab -k -s 30 -n 400 -c 100 http://host.example.com/some/path/hello_world
This is ApacheBench, Version 2.3 <$Revision: 1430300 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking hcc-briantest7.unl.edu (be patient)
Completed 100 requests
Completed 200 requests
Completed 300 requests
apr_pollset_poll: The timeout specified has expired (70007)
Total of 398 requests completed

This is a fairly vanilla setup - should be reproducible with the HTTP module on top of a POSIX filesystem.

I haven't been able to diagnose it precisely, but it seems to only occur when this line pops into the log:

180828 16:07:34 880  ofs_Stall: Stall 3: File hello_world is being staged; estimated time to completion 3 seconds for /some/path/hello_world

I admit, I don't understand why the file would ever be considered as being staged with the default OFS plugin. However, the behavior is very much as if there's a callback not occurring.

NOTE this issue is based on an investigation into user complaints about the service. It may be a synthetic benchmark but it shows something actually observed in the wild.

@bbockelm
Copy link
Contributor Author

I should mention the workaround to get ab working is this one:

--- a/src/XrdHttp/XrdHttpProtocol.cc
+++ b/src/XrdHttp/XrdHttpProtocol.cc
@@ -1376,6 +1376,7 @@ int XrdHttpProtocol::StartSimpleResp(int code, const char *desc, const char *hea
     else ss << "Unknown";
   }
   ss << crlf;
+  ss << "Connection: Keep-Alive" << crlf;
 
   if (bodylen >= 0) ss << "Content-Length: " << bodylen << crlf;
 

Not a complete fix to #809, but good enough to enable testing.

@bbockelm
Copy link
Contributor Author

The ofs_Stall is not coming from the OSS but from here:

https://github.com/xrootd/xrootd/blob/master/src/XrdOfs/XrdOfsHandle.cc#L203

It appears there's a modest contention on the file descriptor table -- one that does not particularly play well with what appears to be an ad-hoc implementation of a timed lock:

https://github.com/xrootd/xrootd/blob/master/src/XrdOfs/XrdOfsHandle.cc#L504

It's not obvious why one would utilize that instead of a wrapper around pthread_mutex_timedlock; it appears the hand-rolled version has similar guarantees as the standard function but worse performance.

@abh3
Copy link
Member

abh3 commented Aug 29, 2018 via email

@bbockelm
Copy link
Contributor Author

Ok, found the culprit. It's here:

https://github.com/xrootd/xrootd/blob/master/src/XrdXrootd/XrdXrootdTransit.cc#L418

If a stall occurs (you can tweak the XrdOfsHandle::WaitLock code to never wait and this causes a stall easily), then the XrdXrootdProtocol object is scheduled to re-run at a later point. It's re-run in that function.

However, it doesn't appear to invoke the appropriate callback for realProt->Process as in the other overload of XrdXrootdTransit::Process, meaning the callback doesn't get pushed until the client itself causes activity (such as disconnecting).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants