EndToEnd tests failing(hanging) on latest 4.0 snapshots on Linux #22

carlbrown · 2017-07-17T19:16:55Z

With the last few swift-4.0 development snapshots, all but one of the XCTests with names that end with EndToEnd fail on Linux with timeouts. These are the tests that start a server on a random port and create a URLSession to send requests to it Those tests pass fine on Darwin (and on Linux under Swift 3.1.1).

If you comment all but one of the EndToEnd tests out, then the one you leave in (which ever one you pick) will pass. With two EndToEnd tests in, regardless of the order you choose, the first one will complete, and the second one will time out.

I'm guessing something isn't being cleaned up correctly, but I'm not sure what. Will update as I learn more.

The text was updated successfully, but these errors were encountered:

carlbrown · 2017-07-17T23:08:04Z

I updated BlueSockets to the latest 4.0 toolchain (see https://github.com/IBM-Swift/BlueSocket/tree/UnsupportedWIP/) to see if that was the problem, and the BlueSocket tests run clean on Linux under the latest snapshots, so that's not it.

The investigation continues.

carlbrown · 2017-07-18T23:02:50Z

It appears that the issue is happening because, in the test that passes, the attempt to close the socket fails (although the test passes).

Here's the working output (with some extra prints) on the Mac:

Test testHelloEndToEnd() on port 53808
Connection Accepted
Listener called
process setting blocking mode
process resuming reader source <OS_dispatch_source: 0x7f8e8cf0b170> on socket 13
reader source event
reader source event about to read
reader source read 223 bytes
reader source sending to parser
reader source returned from parser
reader source event about to read
close called
socket valid calling close on it
checking reader queue <OS_dispatch_queue_serial: Read Queue 0[0x7f8e8ce2ef40]> before calling cancel
ReaderSource Event Error: Error code: -9982(0x-26FE), Bad file descriptor
close called
checking reader queue <OS_dispatch_queue_serial: Read Queue 0[0x7f8e8ce2ef40]> before calling cancel
reader source cancel
Calling close() on socket -1
close called

And here's the same (failing) call on Linux:

Test testOkEndToEnd() on port 43869
Connection Accepted
Listener called
process setting blocking mode
process resuming reader source Dispatch.DispatchSource on socket 9
reader source event
reader source event about to read
reader source read 228 bytes
reader source sending to parser
reader source returned from parser
close called
socket valid calling close on it
checking reader queue Dispatch.DispatchQueue before calling cancel
checking reader source Dispatch.DispatchSource before calling cancel
Test Case 'ServerTests.testOkEndToEnd' passed (0.021 seconds)

The two diverge at: ReaderSource Event Error: Error code: -9982(0x-26FE), Bad file descriptor
On the Mac when this happens, it indicates that the socket was closed as it was being read. This error doesn't happen on Linux (although the same close() call does), leaving the socket open so that the next (and subsequent) tests hang.

I can rework things so that this doesn't happen in these tests (and probably will), but that leaves open the possibility that it can happen in the wild, causing a server hang, so it needs to get fixed.

Next step to see if I'm correct, is to replicate this error in a much simpler sample project.

carlbrown · 2017-07-19T17:39:51Z

Here is the output of two test runs on the same system, one from Swift3.1.1 and the other from last week's 4.0 Snapshot:
swift.test.311.txt
swift.test.40.txt

In both cases, the code is doing a blocking call to Glibc.recv() in a DispatchSource's Read event handler when Glibc.shutdown() is called. In Swift 3.1.1, we get Got Count 0 from Glibc.recv(-1), but in Swift 4.0, the Glibc.recv() call doesn't return.

I'm also attaching a tar.gz of both source trees, to make it easier for other people to reproduce:
issue_22_complete_source.tar.gz

carlbrown · 2017-07-19T20:29:50Z

Possibilities:

It's possible the system calls are behaving differently, either:
1. The system calls do something different when being called from Swift4 than when called from Swift3. I'd rate this unlikely.
2. There's some global state that's set differently in Swift4 than Swift3 that's affecting the system calls. Possible, but I can't think of an obvious place to start looking.
3. Some other reason.
The system calls are behaving the same, but printing is broken, rendering my data invalid.
The system calls are behaving the same but some kind of threading/concurrency problem is keeping things from being able to continue, causing the thread/queue calling recv() to hang.
Something else.

dgrove-oss · 2017-07-19T22:18:42Z

Hi Carl. Am I reading it correctly that this test used to work on Swift 4 development snapshots until about a week ago? Asking because Apple merged several of their own PRs into dispatch during the July 13-July 17 time period. I'm trying to figure out if it is worth selectively backing those out to see if one of them was buggy. They touched bits of code dealing with Linux signal handling, ioctls, and processing of epoll.

carlbrown · 2017-07-19T23:30:14Z

As far as i know, it's never worked on Swift 4. It certainly hasn't worked on any of the 4.0-DEVELOPMENT snapshots that I've tried. It has worked on all the 3.x ones. (Sorry about the "on the last few 4.0 snapshots" language. I was trying to indicate that I hadn't tested it on earlier ones, not that it had worked on earlier ones).

One of the things I will do if I run out of other things to try is to go back in time snapshot by snapshot until I find the one where it stopped working. That will be a lot of downloading snapshots, though, so I was hoping to figure it out some other way.

dgrove-oss · 2017-07-20T12:24:21Z

ok, thanks. Then I will look from first principles instead of focusing on what Apple changed in their PRs last week.

carlbrown · 2017-07-20T16:00:26Z

Ok, so good news and bad news.

The good news is, I've discovered via brute force that the regression happens between the 2017-05-30 and 2017-05-31 snapshots (e.g. the tests pass with the May 30th snapshot and hang/fail with the May 31st one).

The bad news is there are no libDispatch changes between the swift-DEVELOPMENT-SNAPSHOT-05-30-a and swift-DEVELOPMENT-SNAPSHOT-05-31-a tags. So if the snapshots are representative of the tags, it's not a libDispatch issue.

The worse news is that of the changes between those tags, the biggest change is the addition of URLProtocol. I'm not sure how that could be causal, but the tests that are hanging do use URLSession, so it's possible that's the issue.

dgrove-oss · 2017-07-20T18:19:23Z

sounds like its over to the Foundation team then.

carlbrown · 2017-07-20T18:27:58Z

Quite possibly. I'm just really not sure how a change to URLSession (happening on one queue) could possibly cause Glibc.close(socket) not to interrupt Glibc.recv(Socket) on a different queue.

ianpartridge · 2017-07-20T19:07:17Z

@pushkarnk

carlbrown · 2017-07-24T20:40:36Z

I've discovered can fix this problem by applying a change to BlueSocket where we call Glibc.shutdown() when close()ing a socket that isConnected in addition to isListening. (c.f. https://github.com/carlbrown/BlueSocket/commit/f5dc916bf707b12b11634a89655f92c3836f5051#diff-cec92f2c4685d05c3d246f2bb8a42733L3324). This means that BlueSocket will call shutdown() on both the TCP connection being used to transfer data and the TCP socket that is listening for new connections (instead of just closing the listening socket and hoping that cleans up the accepted ones as well).

I think this is probably the right thing to do in any case (well, really I think BlueSocket should have exposed a shutdown() method, but that ship has sailed). However, that raises the problem that we have no way of knowing what other repositories out in the wild might break because they are relying on the existing behavior.

The other problem is: that "fix" in no way answers the question about why the previous version works fine on the Mac and on pre-May31st/pre-_HTTPURLProtocol versions of Swift, but fails on post-May30th/post-_HTTPURLProtocol versions.

So I'm not sure if I should apply the BlueSocket fix and just move on, or keep trying to answer the question about what changed, expecting that this won't be the only place that the post-_HTTPURLProtocol behavior might be causing people issues once they move to Swift4.

Opinions? @ianpartridge ? @pushkarnk ?

pushkarnk · 2017-07-25T09:20:34Z

I tried to get the tests running, but the swift test command fails for me with this:

Fetching https://github.com/IBM-Swift/CHTTPParser.git
Fetching https://github.com/IBM-Swift/BlueSocket.git
Fetching https://github.com/IBM-Swift/BlueSignals.git
Cloning https://github.com/IBM-Swift/BlueSocket.git
Resolving https://github.com/IBM-Swift/BlueSocket.git at 0.12.56
Cloning https://github.com/IBM-Swift/BlueSignals.git
Resolving https://github.com/IBM-Swift/BlueSignals.git at 0.9.48
Cloning https://github.com/IBM-Swift/CHTTPParser.git
Resolving https://github.com/IBM-Swift/CHTTPParser.git at 0.4.0
Compile CHTTPParser utils.c
Compile CHTTPParser http_parser.c
Compile Swift Module 'Socket' (3 sources)
Compile Swift Module 'Signals' (1 sources)
Compile Swift Module 'HTTP' (7 sources)
Compile Swift Module 'HTTPTests' (3 sources)
Compile Swift Module 'BlueSocketHTTP' (2 sources)
Compile Swift Module 'BlueSocketHTTPTests' (7 sources)
Compile Swift Module 'SwiftServerHTTPPackageTests' (1 sources)
Linking ./.build/x86_64-unknown-linux/debug/SwiftServerHTTPPackageTests.xctest
/usr/bin/ld.gold: error: cannot find -lCHTTPParser
/usr/bin/ld.gold: error: cannot find -lCHTTPParser
/usr/bin/ld.gold: error: cannot find -lCHTTPParser
clang: error: linker command failed with exit code 1 (use -v to see invocation)
<unknown>:0: error: link command failed with exit code 1 (use -v to see invocation)
error: terminated(1): /root/pushkar/swift/usr/bin/swift-build-tool -f /root/pushkar/http/.build/debug.yaml test

pushkarnk · 2017-07-25T09:21:00Z

Any idea what I am missing here?

carlbrown · 2017-07-25T12:33:29Z

You have to remove the link line from the module.map for CHTTPParser.

See the 0.5.0 tag: Kitura/CHTTPParser@2f5dc43#diff-f73852fef28faf8dabf66d2f28db6998L20

carlbrown · 2017-07-25T13:28:48Z

There's also a tar file in a comment above (#22 (comment)) that has the whole directory structure with everything checked out, if that's easier.

pushkarnk · 2017-07-26T06:35:55Z

Thanks @carlbrown I can see the failures now. I'll start the investigation.

carlbrown · 2017-08-07T14:53:14Z

Any status on this @pushkarnk @ianpartridge ?

pushkarnk · 2017-08-10T10:45:58Z

@carlbrown I see that before the failing tests this message is printed:
Error accepting client connection: Error code: -9994(0x-270A), Invalid argument
Does it mean the test isn't able to connect to the server? Do you see this message in your env?

carlbrown · 2017-08-10T14:41:26Z

@pushkarnk No, that means the socket was shutdown while the accept() system call was listening for new connections (which invalidates the socket passed as an argument).

You'll see that message on both the May 30th and May 31st snapshots.

pushkarnk · 2017-08-23T09:05:20Z

The ReaderSource Event Error: Error code: -9982(0x-26FE), Bad file descriptor message comes on Linux, when running the tests in LLDB and the tests seem to always pass.

carlbrown · 2017-08-23T20:28:40Z

What snapshot are you seeing them passing with? What changed since 29 days ago when you said you saw the failures?

pushkarnk · 2017-08-24T07:09:09Z

@carlbrown I don't see them pass when simply run as swift test. They pass when the binary is run in LLDB. Again, with GDB I see the hang. This is with the latest toolchain. I am trying to understand how a debugger can make a difference.

pushkarnk · 2017-09-07T07:04:33Z

swiftlang/swift-corelibs-foundation#1195 has fixed this problem.

URLSession holds references to URLSessionTask. URLSessionTask holds a reference to an EasyHandle, which is a wrapper around a curl easy handle. Per this doc, to close a connection pertaining to an easy handle, curl_easy_cleanup needs to be called. Now, we call curl_easy_cleanup in the deinit for EasyHandle. In the early days of URLSession, to keep URLSessionTasks alive beyond the function in which they were locally held (because, the responses are async), we had introduced a retain cycle between URLSession and URLSessionTask, and we broke it after the task completed. Unfortunately, this cycle breaking was missed in the refactoring exercise. As a result, URLSessionTask objects accumulated in memory, holding EasyHandles along with them. EasyHandle.deinit() and hence curl_easy_cleanup were never invoked, and hence connections were kept alive. Hence this problem in swift-server/http. The client end did not initiate a connection shutdown and the server was stuck in read() blocking the queue on which reads() for the subsequent connections waited indefinitely.

swift test now goes through with 0 failures.

carlbrown self-assigned this Jul 17, 2017

carlbrown mentioned this issue Aug 31, 2017

Swift 4 support #40

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EndToEnd tests failing(hanging) on latest 4.0 snapshots on Linux #22

EndToEnd tests failing(hanging) on latest 4.0 snapshots on Linux #22

carlbrown commented Jul 17, 2017 •

edited

Loading

carlbrown commented Jul 17, 2017

carlbrown commented Jul 18, 2017 •

edited

Loading

carlbrown commented Jul 19, 2017 •

edited

Loading

carlbrown commented Jul 19, 2017

dgrove-oss commented Jul 19, 2017

carlbrown commented Jul 19, 2017 •

edited

Loading

dgrove-oss commented Jul 20, 2017

carlbrown commented Jul 20, 2017 •

edited

Loading

dgrove-oss commented Jul 20, 2017

carlbrown commented Jul 20, 2017 •

edited

Loading

ianpartridge commented Jul 20, 2017

carlbrown commented Jul 24, 2017

pushkarnk commented Jul 25, 2017 •

edited

Loading

pushkarnk commented Jul 25, 2017

carlbrown commented Jul 25, 2017 •

edited

Loading

carlbrown commented Jul 25, 2017 •

edited

Loading

pushkarnk commented Jul 26, 2017

carlbrown commented Aug 7, 2017

pushkarnk commented Aug 10, 2017

carlbrown commented Aug 10, 2017

pushkarnk commented Aug 23, 2017 •

edited

Loading

carlbrown commented Aug 23, 2017

pushkarnk commented Aug 24, 2017 •

edited

Loading

pushkarnk commented Sep 7, 2017

EndToEnd tests failing(hanging) on latest 4.0 snapshots on Linux #22

EndToEnd tests failing(hanging) on latest 4.0 snapshots on Linux #22

Comments

carlbrown commented Jul 17, 2017 • edited Loading

carlbrown commented Jul 17, 2017

carlbrown commented Jul 18, 2017 • edited Loading

carlbrown commented Jul 19, 2017 • edited Loading

carlbrown commented Jul 19, 2017

dgrove-oss commented Jul 19, 2017

carlbrown commented Jul 19, 2017 • edited Loading

dgrove-oss commented Jul 20, 2017

carlbrown commented Jul 20, 2017 • edited Loading

dgrove-oss commented Jul 20, 2017

carlbrown commented Jul 20, 2017 • edited Loading

ianpartridge commented Jul 20, 2017

carlbrown commented Jul 24, 2017

pushkarnk commented Jul 25, 2017 • edited Loading

pushkarnk commented Jul 25, 2017

carlbrown commented Jul 25, 2017 • edited Loading

carlbrown commented Jul 25, 2017 • edited Loading

pushkarnk commented Jul 26, 2017

carlbrown commented Aug 7, 2017

pushkarnk commented Aug 10, 2017

carlbrown commented Aug 10, 2017

pushkarnk commented Aug 23, 2017 • edited Loading

carlbrown commented Aug 23, 2017

pushkarnk commented Aug 24, 2017 • edited Loading

pushkarnk commented Sep 7, 2017

carlbrown commented Jul 17, 2017 •

edited

Loading

carlbrown commented Jul 18, 2017 •

edited

Loading

carlbrown commented Jul 19, 2017 •

edited

Loading

carlbrown commented Jul 19, 2017 •

edited

Loading

carlbrown commented Jul 20, 2017 •

edited

Loading

carlbrown commented Jul 20, 2017 •

edited

Loading

pushkarnk commented Jul 25, 2017 •

edited

Loading

carlbrown commented Jul 25, 2017 •

edited

Loading

carlbrown commented Jul 25, 2017 •

edited

Loading

pushkarnk commented Aug 23, 2017 •

edited

Loading

pushkarnk commented Aug 24, 2017 •

edited

Loading