New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent crash/hang with TLS on Windows under poor network conditions #191

Closed
enolan opened this Issue Apr 21, 2016 · 21 comments

Comments

Projects
None yet
3 participants
@enolan

enolan commented Apr 21, 2016

I can get a simple http-conduit program to crash or hang fairly often when the connection has occasional data corruption. I've also observed the hang when only simulating packet loss. Here's a repo with my test case. The program is tiny:

module Main where
import Network.HTTP.Conduit
import qualified Data.ByteString.Lazy.Char8 as LC8

main :: IO ()
main = do
    man <- newManager tlsManagerSettings
    req <- parseUrl "https://s3.amazonaws.com/hackage.fpcomplete.com/package/lens-4.12.3.tar.gz"
    response <- httpLbs req man
    LC8.putStrLn $ responseBody response

Using Clumsy in tamper mode set to 20%, it'll crash or hang maybe half the time.

When I instrument API calls with API Monitor I can see that the crashes happen immediately after a call to recv() fails. recv() returns SOCKET_ERROR which is a constant equal to -1. (When it's successful, recv() returns the number of bytes received.) Then memcpy() is called with the length set to -1. This causes the Windows equivalent of a segfault. Somewhere, the return value of recv() isn't getting checked.

The same thing with a cleartext HTTP connection is no problem. I also set up a simple cleartext HTTP server on a Linux machine and injected RSTs with tcpkill. This doesn't make it crash either.

I found this in the course of debugging commercialhaskell/stack#1689 (in the course of getting Idris building on AppVeyor). Stack hasn't ever crashed on me, but it seems likely the hangs and crashes in the simple http-conduit program are related to the hangs in Stack.

To reproduce:

  1. Clone the testcase repo
  2. stack build
  3. Download Clumsy
  4. Set Clumsy parameters to something like this and hit start.
  5. stack exec bad-memcpy-crash

You may have to run it several times. When it crashes you should see the Windows crash dialog and if you have Visual Studio installed, an offer to open a debugger.

Let me know if there's more I can do to figure this out.

@snoyberg

This comment has been minimized.

Show comment
Hide comment
@snoyberg

snoyberg Apr 21, 2016

Owner

I recently heard about vincenthz/hs-tls#124
causing trouble. Can you try upgrading to the latest tls package (by using
a nightly snapshot) and see if that fixes things?

On Thu, Apr 21, 2016, 7:11 AM Echo Nolan notifications@github.com wrote:

I can get a simple http-conduit program to crash or hang fairly often when
the connection has occasional data corruption. I've also observed the hang
when only simulating packet loss. Here's a repo with my test case
https://github.com/enolan/http-conduit-crash-hang. The program is tiny:

module Main whereimport Network.HTTP.Conduitimport qualified Data.ByteString.Lazy.Char8 as LC8
main :: IO ()
main = do
man <- newManager tlsManagerSettings
req <- parseUrl "https://s3.amazonaws.com/hackage.fpcomplete.com/package/lens-4.12.3.tar.gz"
response <- httpLbs req man
LC8.putStrLn $ responseBody response

Using Clumsy https://jagt.github.io/clumsy/ in tamper mode set to 20%,
it'll crash or hang maybe half the time.

When I instrument API calls with API Monitor
http://www.rohitab.com/apimonitor I can see that the crashes happen
immediately after a call to recv() fails http://i.imgur.com/rgIhYCH.png.
recv() returns SOCKET_ERROR which is a constant equal to -1. (When it's
successful, recv() returns the number of bytes received.) Then memcpy()
is called with the length set to -1. This causes the Windows equivalent of
a segfault. Somewhere, the return value of recv() isn't getting checked.

The same thing with a cleartext HTTP connection is no problem. I also set
up a simple cleartext HTTP server on a Linux machine and injected RSTs with
tcpkill. This doesn't make it crash either.

I found this in the course of debugging commercialhaskell/stack#1689
commercialhaskell/stack#1689 (in the course
of getting Idris building on AppVeyor). Stack hasn't ever crashed on me,
but it seems likely the hangs and crashes in the simple http-conduit
program are related to the hangs in Stack.

To reproduce:

  1. Clone the testcase repo
    https://github.com/enolan/http-conduit-crash-hang
  2. stack build
  3. Download Clumsy https://jagt.github.io/clumsy/
  4. Set Clumsy parameters to something like this
    http://i.imgur.com/whWZ956.png and hit start.
  5. stack exec bad-memcpy-crash

You may have to run it several times. When it crashes you should see the
Windows crash dialog and if you have Visual Studio installed, an offer to
open a debugger.

Let me know if there's more I can do to figure this out.


You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
#191

Owner

snoyberg commented Apr 21, 2016

I recently heard about vincenthz/hs-tls#124
causing trouble. Can you try upgrading to the latest tls package (by using
a nightly snapshot) and see if that fixes things?

On Thu, Apr 21, 2016, 7:11 AM Echo Nolan notifications@github.com wrote:

I can get a simple http-conduit program to crash or hang fairly often when
the connection has occasional data corruption. I've also observed the hang
when only simulating packet loss. Here's a repo with my test case
https://github.com/enolan/http-conduit-crash-hang. The program is tiny:

module Main whereimport Network.HTTP.Conduitimport qualified Data.ByteString.Lazy.Char8 as LC8
main :: IO ()
main = do
man <- newManager tlsManagerSettings
req <- parseUrl "https://s3.amazonaws.com/hackage.fpcomplete.com/package/lens-4.12.3.tar.gz"
response <- httpLbs req man
LC8.putStrLn $ responseBody response

Using Clumsy https://jagt.github.io/clumsy/ in tamper mode set to 20%,
it'll crash or hang maybe half the time.

When I instrument API calls with API Monitor
http://www.rohitab.com/apimonitor I can see that the crashes happen
immediately after a call to recv() fails http://i.imgur.com/rgIhYCH.png.
recv() returns SOCKET_ERROR which is a constant equal to -1. (When it's
successful, recv() returns the number of bytes received.) Then memcpy()
is called with the length set to -1. This causes the Windows equivalent of
a segfault. Somewhere, the return value of recv() isn't getting checked.

The same thing with a cleartext HTTP connection is no problem. I also set
up a simple cleartext HTTP server on a Linux machine and injected RSTs with
tcpkill. This doesn't make it crash either.

I found this in the course of debugging commercialhaskell/stack#1689
commercialhaskell/stack#1689 (in the course
of getting Idris building on AppVeyor). Stack hasn't ever crashed on me,
but it seems likely the hangs and crashes in the simple http-conduit
program are related to the hangs in Stack.

To reproduce:

  1. Clone the testcase repo
    https://github.com/enolan/http-conduit-crash-hang
  2. stack build
  3. Download Clumsy https://jagt.github.io/clumsy/
  4. Set Clumsy parameters to something like this
    http://i.imgur.com/whWZ956.png and hit start.
  5. stack exec bad-memcpy-crash

You may have to run it several times. When it crashes you should see the
Windows crash dialog and if you have Visual Studio installed, an offer to
open a debugger.

Let me know if there's more I can do to figure this out.


You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
#191

@enolan

This comment has been minimized.

Show comment
Hide comment
@enolan

enolan Apr 21, 2016

Still happens on nightly-2016-04-20 with tls-1.3.5 :(

enolan commented Apr 21, 2016

Still happens on nightly-2016-04-20 with tls-1.3.5 :(

@snoyberg

This comment has been minimized.

Show comment
Hide comment
@snoyberg

snoyberg Apr 21, 2016

Owner

I don't have ready access to a Windows machine to try this on. Can you test if this kind of problem occurs at all when using the connection package directly?

Pinging @vincenthz for any other thoughts on narrowing down the problem.

Owner

snoyberg commented Apr 21, 2016

I don't have ready access to a Windows machine to try this on. Can you test if this kind of problem occurs at all when using the connection package directly?

Pinging @vincenthz for any other thoughts on narrowing down the problem.

@vincenthz

This comment has been minimized.

Show comment
Hide comment
@vincenthz

vincenthz Apr 21, 2016

Contributor

it does smells like a network bug, more than anything linked to tls. I haven't actually really look at it, but to the best of my knowledge, I'm never setting any network buffer anywhere; they are all the result of some network call.

Contributor

vincenthz commented Apr 21, 2016

it does smells like a network bug, more than anything linked to tls. I haven't actually really look at it, but to the best of my knowledge, I'm never setting any network buffer anywhere; they are all the result of some network call.

@snoyberg

This comment has been minimized.

Show comment
Hide comment
@snoyberg

snoyberg Apr 22, 2016

Owner

@enolan Can you test if you can reproduce this at all with an insecure connection?

Owner

snoyberg commented Apr 22, 2016

@enolan Can you test if you can reproduce this at all with an insecure connection?

@enolan

This comment has been minimized.

Show comment
Hide comment
@enolan

enolan Apr 23, 2016

I tried pretty hard and couldn't. Debug printfs indicate the crash happens in hGetBuf. I'm trying to narrow it down further.

enolan commented Apr 23, 2016

I tried pretty hard and couldn't. Debug printfs indicate the crash happens in hGetBuf. I'm trying to narrow it down further.

@enolan

This comment has been minimized.

Show comment
Hide comment
@snoyberg

This comment has been minimized.

Show comment
Hide comment
@snoyberg

snoyberg May 4, 2016

Owner

Wow, good catch. Is there any potential workaround outside of GHC?

Owner

snoyberg commented May 4, 2016

Wow, good catch. Is there any potential workaround outside of GHC?

@enolan

This comment has been minimized.

Show comment
Hide comment
@enolan

enolan May 4, 2016

It only happens when sockets are converted to handles, so if tls used
network's API rather than hGet / hPut there wouldn't be a problem.

On Tue, May 3, 2016, 8:20 PM Michael Snoyman notifications@github.com
wrote:

Wow, good catch. Is there any potential workaround outside of GHC?


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#191 (comment)

enolan commented May 4, 2016

It only happens when sockets are converted to handles, so if tls used
network's API rather than hGet / hPut there wouldn't be a problem.

On Tue, May 3, 2016, 8:20 PM Michael Snoyman notifications@github.com
wrote:

Wow, good catch. Is there any potential workaround outside of GHC?


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#191 (comment)

@snoyberg

This comment has been minimized.

Show comment
Hide comment
@snoyberg

snoyberg May 4, 2016

Owner

I don't know the details of how this works internally to tls/connection. @vincenthz is there any reasonable way to move away from Handles here? Would you consider a PR from someone to make such a change?

Owner

snoyberg commented May 4, 2016

I don't know the details of how this works internally to tls/connection. @vincenthz is there any reasonable way to move away from Handles here? Would you consider a PR from someone to make such a change?

@vincenthz

This comment has been minimized.

Show comment
Hide comment
@vincenthz

vincenthz May 5, 2016

Contributor

Good catch @enolan !

I've pushed support to connection and socks (tls was already able to handle both), I would appreciate some testing before I push a new version if possible

Contributor

vincenthz commented May 5, 2016

Good catch @enolan !

I've pushed support to connection and socks (tls was already able to handle both), I would appreciate some testing before I push a new version if possible

@snoyberg

This comment has been minimized.

Show comment
Hide comment
@snoyberg

snoyberg May 5, 2016

Owner

Are there any changes that need to be made on the http-client-tls side to take advantage of this support, or just recompiling against master?

Owner

snoyberg commented May 5, 2016

Are there any changes that need to be made on the http-client-tls side to take advantage of this support, or just recompiling against master?

@vincenthz

This comment has been minimized.

Show comment
Hide comment
@vincenthz

vincenthz May 5, 2016

Contributor

it should just work as is

Contributor

vincenthz commented May 5, 2016

it should just work as is

@snoyberg

This comment has been minimized.

Show comment
Hide comment
@snoyberg

snoyberg May 5, 2016

Owner

Awesome, thank you :)

Owner

snoyberg commented May 5, 2016

Awesome, thank you :)

@enolan

This comment has been minimized.

Show comment
Hide comment
@enolan

enolan May 8, 2016

@vincenthz Sorry! I read network's code wrong. It imports send and recv itself, with the right types for Windows, but it uses those imports on everything but Windows. Switching to using network's API directly won't actually help here. My bad.

enolan commented May 8, 2016

@vincenthz Sorry! I read network's code wrong. It imports send and recv itself, with the right types for Windows, but it uses those imports on everything but Windows. Switching to using network's API directly won't actually help here. My bad.

@vincenthz

This comment has been minimized.

Show comment
Hide comment
@vincenthz

vincenthz May 8, 2016

Contributor

:(

could network detect the 4294967295 (wrapped -1) value on windows 64 bits as meaning error like -1 ?

Contributor

vincenthz commented May 8, 2016

:(

could network detect the 4294967295 (wrapped -1) value on windows 64 bits as meaning error like -1 ?

@enolan

This comment has been minimized.

Show comment
Hide comment
@enolan

enolan May 8, 2016

The crash happens before control returns from base unfortunately.

On Sun, May 8, 2016, 11:55 AM Vincent Hanquez notifications@github.com
wrote:

:(

could network detect the 4294967295 (wrapped -1) value on windows 64 bits
as meaning error like -1 ?


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#191 (comment)

enolan commented May 8, 2016

The crash happens before control returns from base unfortunately.

On Sun, May 8, 2016, 11:55 AM Vincent Hanquez notifications@github.com
wrote:

:(

could network detect the 4294967295 (wrapped -1) value on windows 64 bits
as meaning error like -1 ?


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#191 (comment)

@vincenthz

This comment has been minimized.

Show comment
Hide comment
@vincenthz

vincenthz May 8, 2016

Contributor

darn :\ . Could probably have network call a different foreign import maybe, one that returns the right thing, because fixing base is not going to fix any previous version, and it could take a really long time ..

Contributor

vincenthz commented May 8, 2016

darn :\ . Could probably have network call a different foreign import maybe, one that returns the right thing, because fixing base is not going to fix any previous version, and it could take a really long time ..

ghc-mirror pushed a commit to ghc/ghc that referenced this issue May 19, 2016

Use the correct return type for Windows' send()/recv() (Fix #12010)
Summary:
They return signed 32 bit ints on Windows, even on a 64 bit OS, rather than
Linux's 64 bit ssize_t. This means when recv() returned -1 to signal an error we
thought it was 4294967295. It was converted to an int, -1 and the buffer was
memcpy'd which caused a segfault. Other bad stuff happened with send()s.

See also note CSsize in System.Posix.Internals.

Add a test for #12010

Test Plan:
- GHC testsuite (T12010)
- http-conduit test (snoyberg/http-client#191)

Reviewers: austin, hvr, bgamari, Phyx

Reviewed By: Phyx

Subscribers: thomie

Differential Revision: https://phabricator.haskell.org/D2170

GHC Trac Issues: #12010
@enolan

This comment has been minimized.

Show comment
Hide comment
@enolan

enolan Jun 2, 2016

This is the rabbit hole with no bottom. It's actually fixable in network, and I did, but now I get a different crash. haskell/network#192 is the PR. If either of you could look at it and see what I'm missing that'd be great.

@vincenthz Your change to tls to use Network.Socket over Handles actually wasn't in vain. The network workaround only works if you're using network.

enolan commented Jun 2, 2016

This is the rabbit hole with no bottom. It's actually fixable in network, and I did, but now I get a different crash. haskell/network#192 is the PR. If either of you could look at it and see what I'm missing that'd be great.

@vincenthz Your change to tls to use Network.Socket over Handles actually wasn't in vain. The network workaround only works if you're using network.

@enolan

This comment has been minimized.

Show comment
Hide comment
@enolan

enolan Aug 4, 2016

This is fixed upstream in network-2.6.3.1 and asn1-encoding-0.9.4.

enolan commented Aug 4, 2016

This is fixed upstream in network-2.6.3.1 and asn1-encoding-0.9.4.

@enolan enolan closed this Aug 4, 2016

@snoyberg

This comment has been minimized.

Show comment
Hide comment
@snoyberg

snoyberg Aug 4, 2016

Owner

Awesome, thank you!

On Thu, Aug 4, 2016, 6:51 AM Echo Nolan notifications@github.com wrote:

This is fixed upstream in network-2.6.3.1 and asn1-encoding-0.9.4.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#191 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AADBB-oKQP-8e-CHy3_KDK-sVm3GNjimks5qcWGjgaJpZM4IMU3S
.

Owner

snoyberg commented Aug 4, 2016

Awesome, thank you!

On Thu, Aug 4, 2016, 6:51 AM Echo Nolan notifications@github.com wrote:

This is fixed upstream in network-2.6.3.1 and asn1-encoding-0.9.4.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#191 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AADBB-oKQP-8e-CHy3_KDK-sVm3GNjimks5qcWGjgaJpZM4IMU3S
.

bgamari added a commit to bgamari/ghc that referenced this issue Sep 5, 2016

Use the correct return type for Windows' send()/recv() (Fix #12010)
Summary:
They return signed 32 bit ints on Windows, even on a 64 bit OS, rather than
Linux's 64 bit ssize_t. This means when recv() returned -1 to signal an error we
thought it was 4294967295. It was converted to an int, -1 and the buffer was
memcpy'd which caused a segfault. Other bad stuff happened with send()s.

See also note CSsize in System.Posix.Internals.

Add a test for #12010

Test Plan:
- GHC testsuite (T12010)
- http-conduit test (snoyberg/http-client#191)

Reviewers: austin, hvr, bgamari, Phyx

Reviewed By: Phyx

Subscribers: thomie

Differential Revision: https://phabricator.haskell.org/D2170

GHC Trac Issues: #12010

(cherry picked from commit 1ee47c1)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment