Skip to content
This repository has been archived by the owner on Nov 23, 2021. It is now read-only.

EndToEnd tests failing(hanging) on latest 4.0 snapshots on Linux #22

Open
carlbrown opened this issue Jul 17, 2017 · 24 comments
Open

EndToEnd tests failing(hanging) on latest 4.0 snapshots on Linux #22

carlbrown opened this issue Jul 17, 2017 · 24 comments
Assignees

Comments

@carlbrown
Copy link
Contributor

carlbrown commented Jul 17, 2017

With the last few swift-4.0 development snapshots, all but one of the XCTests with names that end with EndToEnd fail on Linux with timeouts. These are the tests that start a server on a random port and create a URLSession to send requests to it Those tests pass fine on Darwin (and on Linux under Swift 3.1.1).

If you comment all but one of the EndToEnd tests out, then the one you leave in (which ever one you pick) will pass. With two EndToEnd tests in, regardless of the order you choose, the first one will complete, and the second one will time out.

I'm guessing something isn't being cleaned up correctly, but I'm not sure what. Will update as I learn more.

@carlbrown carlbrown self-assigned this Jul 17, 2017
@carlbrown
Copy link
Contributor Author

I updated BlueSockets to the latest 4.0 toolchain (see https://github.com/IBM-Swift/BlueSocket/tree/UnsupportedWIP/) to see if that was the problem, and the BlueSocket tests run clean on Linux under the latest snapshots, so that's not it.

The investigation continues.

@carlbrown
Copy link
Contributor Author

carlbrown commented Jul 18, 2017

It appears that the issue is happening because, in the test that passes, the attempt to close the socket fails (although the test passes).

Here's the working output (with some extra prints) on the Mac:

Test testHelloEndToEnd() on port 53808
Connection Accepted
Listener called
process setting blocking mode
process resuming reader source <OS_dispatch_source: 0x7f8e8cf0b170> on socket 13
reader source event
reader source event about to read
reader source read 223 bytes
reader source sending to parser
reader source returned from parser
reader source event about to read
close called
socket valid calling close on it
checking reader queue <OS_dispatch_queue_serial: Read Queue 0[0x7f8e8ce2ef40]> before calling cancel
ReaderSource Event Error: Error code: -9982(0x-26FE), Bad file descriptor
close called
checking reader queue <OS_dispatch_queue_serial: Read Queue 0[0x7f8e8ce2ef40]> before calling cancel
reader source cancel
Calling close() on socket -1
close called

And here's the same (failing) call on Linux:

Test testOkEndToEnd() on port 43869
Connection Accepted
Listener called
process setting blocking mode
process resuming reader source Dispatch.DispatchSource on socket 9
reader source event
reader source event about to read
reader source read 228 bytes
reader source sending to parser
reader source returned from parser
close called
socket valid calling close on it
checking reader queue Dispatch.DispatchQueue before calling cancel
checking reader source Dispatch.DispatchSource before calling cancel
Test Case 'ServerTests.testOkEndToEnd' passed (0.021 seconds)

The two diverge at: ReaderSource Event Error: Error code: -9982(0x-26FE), Bad file descriptor
On the Mac when this happens, it indicates that the socket was closed as it was being read. This error doesn't happen on Linux (although the same close() call does), leaving the socket open so that the next (and subsequent) tests hang.

I can rework things so that this doesn't happen in these tests (and probably will), but that leaves open the possibility that it can happen in the wild, causing a server hang, so it needs to get fixed.

Next step to see if I'm correct, is to replicate this error in a much simpler sample project.

@carlbrown
Copy link
Contributor Author

carlbrown commented Jul 19, 2017

Here is the output of two test runs on the same system, one from Swift3.1.1 and the other from last week's 4.0 Snapshot:
swift.test.311.txt
swift.test.40.txt

In both cases, the code is doing a blocking call to Glibc.recv() in a DispatchSource's Read event handler when Glibc.shutdown() is called. In Swift 3.1.1, we get Got Count 0 from Glibc.recv(-1), but in Swift 4.0, the Glibc.recv() call doesn't return.

I'm also attaching a tar.gz of both source trees, to make it easier for other people to reproduce:
issue_22_complete_source.tar.gz

@carlbrown
Copy link
Contributor Author

Possibilities:

  1. It's possible the system calls are behaving differently, either:
    1. The system calls do something different when being called from Swift4 than when called from Swift3. I'd rate this unlikely.
    2. There's some global state that's set differently in Swift4 than Swift3 that's affecting the system calls. Possible, but I can't think of an obvious place to start looking.
    3. Some other reason.
  2. The system calls are behaving the same, but printing is broken, rendering my data invalid.
  3. The system calls are behaving the same but some kind of threading/concurrency problem is keeping things from being able to continue, causing the thread/queue calling recv() to hang.
  4. Something else.

@dgrove-oss
Copy link

Hi Carl. Am I reading it correctly that this test used to work on Swift 4 development snapshots until about a week ago? Asking because Apple merged several of their own PRs into dispatch during the July 13-July 17 time period. I'm trying to figure out if it is worth selectively backing those out to see if one of them was buggy. They touched bits of code dealing with Linux signal handling, ioctls, and processing of epoll.

@carlbrown
Copy link
Contributor Author

carlbrown commented Jul 19, 2017

As far as i know, it's never worked on Swift 4. It certainly hasn't worked on any of the 4.0-DEVELOPMENT snapshots that I've tried. It has worked on all the 3.x ones. (Sorry about the "on the last few 4.0 snapshots" language. I was trying to indicate that I hadn't tested it on earlier ones, not that it had worked on earlier ones).

One of the things I will do if I run out of other things to try is to go back in time snapshot by snapshot until I find the one where it stopped working. That will be a lot of downloading snapshots, though, so I was hoping to figure it out some other way.

@dgrove-oss
Copy link

ok, thanks. Then I will look from first principles instead of focusing on what Apple changed in their PRs last week.

@carlbrown
Copy link
Contributor Author

carlbrown commented Jul 20, 2017

Ok, so good news and bad news.

The good news is, I've discovered via brute force that the regression happens between the 2017-05-30 and 2017-05-31 snapshots (e.g. the tests pass with the May 30th snapshot and hang/fail with the May 31st one).

The bad news is there are no libDispatch changes between the swift-DEVELOPMENT-SNAPSHOT-05-30-a and swift-DEVELOPMENT-SNAPSHOT-05-31-a tags. So if the snapshots are representative of the tags, it's not a libDispatch issue.

The worse news is that of the changes between those tags, the biggest change is the addition of URLProtocol. I'm not sure how that could be causal, but the tests that are hanging do use URLSession, so it's possible that's the issue.

@dgrove-oss
Copy link

sounds like its over to the Foundation team then.

@carlbrown
Copy link
Contributor Author

carlbrown commented Jul 20, 2017

Quite possibly. I'm just really not sure how a change to URLSession (happening on one queue) could possibly cause Glibc.close(socket) not to interrupt Glibc.recv(Socket) on a different queue.

@ianpartridge
Copy link
Contributor

@pushkarnk

@carlbrown
Copy link
Contributor Author

I've discovered can fix this problem by applying a change to BlueSocket where we call Glibc.shutdown() when close()ing a socket that isConnected in addition to isListening. (c.f. https://github.com/carlbrown/BlueSocket/commit/f5dc916bf707b12b11634a89655f92c3836f5051#diff-cec92f2c4685d05c3d246f2bb8a42733L3324). This means that BlueSocket will call shutdown() on both the TCP connection being used to transfer data and the TCP socket that is listening for new connections (instead of just closing the listening socket and hoping that cleans up the accepted ones as well).

I think this is probably the right thing to do in any case (well, really I think BlueSocket should have exposed a shutdown() method, but that ship has sailed). However, that raises the problem that we have no way of knowing what other repositories out in the wild might break because they are relying on the existing behavior.

The other problem is: that "fix" in no way answers the question about why the previous version works fine on the Mac and on pre-May31st/pre-_HTTPURLProtocol versions of Swift, but fails on post-May30th/post-_HTTPURLProtocol versions.

So I'm not sure if I should apply the BlueSocket fix and just move on, or keep trying to answer the question about what changed, expecting that this won't be the only place that the post-_HTTPURLProtocol behavior might be causing people issues once they move to Swift4.

Opinions? @ianpartridge ? @pushkarnk ?

@pushkarnk
Copy link

pushkarnk commented Jul 25, 2017

I tried to get the tests running, but the swift test command fails for me with this:

Fetching https://github.com/IBM-Swift/CHTTPParser.git
Fetching https://github.com/IBM-Swift/BlueSocket.git
Fetching https://github.com/IBM-Swift/BlueSignals.git
Cloning https://github.com/IBM-Swift/BlueSocket.git
Resolving https://github.com/IBM-Swift/BlueSocket.git at 0.12.56
Cloning https://github.com/IBM-Swift/BlueSignals.git
Resolving https://github.com/IBM-Swift/BlueSignals.git at 0.9.48
Cloning https://github.com/IBM-Swift/CHTTPParser.git
Resolving https://github.com/IBM-Swift/CHTTPParser.git at 0.4.0
Compile CHTTPParser utils.c
Compile CHTTPParser http_parser.c
Compile Swift Module 'Socket' (3 sources)
Compile Swift Module 'Signals' (1 sources)
Compile Swift Module 'HTTP' (7 sources)
Compile Swift Module 'HTTPTests' (3 sources)
Compile Swift Module 'BlueSocketHTTP' (2 sources)
Compile Swift Module 'BlueSocketHTTPTests' (7 sources)
Compile Swift Module 'SwiftServerHTTPPackageTests' (1 sources)
Linking ./.build/x86_64-unknown-linux/debug/SwiftServerHTTPPackageTests.xctest
/usr/bin/ld.gold: error: cannot find -lCHTTPParser
/usr/bin/ld.gold: error: cannot find -lCHTTPParser
/usr/bin/ld.gold: error: cannot find -lCHTTPParser
clang: error: linker command failed with exit code 1 (use -v to see invocation)
<unknown>:0: error: link command failed with exit code 1 (use -v to see invocation)
error: terminated(1): /root/pushkar/swift/usr/bin/swift-build-tool -f /root/pushkar/http/.build/debug.yaml test

@pushkarnk
Copy link

Any idea what I am missing here?

@carlbrown
Copy link
Contributor Author

carlbrown commented Jul 25, 2017

You have to remove the link line from the module.map for CHTTPParser.

See the 0.5.0 tag: Kitura/CHTTPParser@2f5dc43#diff-f73852fef28faf8dabf66d2f28db6998L20

@carlbrown
Copy link
Contributor Author

carlbrown commented Jul 25, 2017

There's also a tar file in a comment above (#22 (comment)) that has the whole directory structure with everything checked out, if that's easier.

@pushkarnk
Copy link

Thanks @carlbrown I can see the failures now. I'll start the investigation.

@carlbrown
Copy link
Contributor Author

Any status on this @pushkarnk @ianpartridge ?

@pushkarnk
Copy link

@carlbrown I see that before the failing tests this message is printed:
Error accepting client connection: Error code: -9994(0x-270A), Invalid argument
Does it mean the test isn't able to connect to the server? Do you see this message in your env?

@carlbrown
Copy link
Contributor Author

@pushkarnk No, that means the socket was shutdown while the accept() system call was listening for new connections (which invalidates the socket passed as an argument).

You'll see that message on both the May 30th and May 31st snapshots.

@pushkarnk
Copy link

pushkarnk commented Aug 23, 2017

The ReaderSource Event Error: Error code: -9982(0x-26FE), Bad file descriptor message comes on Linux, when running the tests in LLDB and the tests seem to always pass.

@carlbrown
Copy link
Contributor Author

What snapshot are you seeing them passing with? What changed since 29 days ago when you said you saw the failures?

@pushkarnk
Copy link

pushkarnk commented Aug 24, 2017

@carlbrown I don't see them pass when simply run as swift test. They pass when the binary is run in LLDB. Again, with GDB I see the hang. This is with the latest toolchain. I am trying to understand how a debugger can make a difference.

@pushkarnk
Copy link

swiftlang/swift-corelibs-foundation#1195 has fixed this problem.

URLSession holds references to URLSessionTask. URLSessionTask holds a reference to an EasyHandle, which is a wrapper around a curl easy handle. Per this doc, to close a connection pertaining to an easy handle, curl_easy_cleanup needs to be called. Now, we call curl_easy_cleanup in the deinit for EasyHandle. In the early days of URLSession, to keep URLSessionTasks alive beyond the function in which they were locally held (because, the responses are async), we had introduced a retain cycle between URLSession and URLSessionTask, and we broke it after the task completed. Unfortunately, this cycle breaking was missed in the refactoring exercise. As a result, URLSessionTask objects accumulated in memory, holding EasyHandles along with them. EasyHandle.deinit() and hence curl_easy_cleanup were never invoked, and hence connections were kept alive. Hence this problem in swift-server/http. The client end did not initiate a connection shutdown and the server was stuck in read() blocking the queue on which reads() for the subsequent connections waited indefinitely.

swift test now goes through with 0 failures.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants