New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tls-retrievecertificate: HandshakeFailed (Error_Packet_unexpected "Alert [(AlertLevel_Fatal,BadRecordMac)]" " expected: change cipher") #124

Closed
glasserc opened this Issue Nov 2, 2015 · 31 comments

Comments

Projects
None yet
@glasserc

glasserc commented Nov 2, 2015

Using a command like for x inseq 1 1000; do .cabal-sandbox/bin/tls-retrievecertificate --verify --chain www.google.com 443 > google-$x.txt; done, I've reliably reproduced the above error message (it's showed up between one and four times per thousand). I can reproduce this on hosts at www.google.com and api.twitter.com with approximately the same level of frequency. It seems to happen on my home, work, and cell phone Internet connections. I've seen this error on two different machines, which rules out hardware issues.

@wrl314

This comment has been minimized.

Show comment
Hide comment
@wrl314

wrl314 Nov 3, 2015

Seeing same thing here. If it helps, version 1.2.16 did not have this issue but 1.3.3 does.

wrl314 commented Nov 3, 2015

Seeing same thing here. If it helps, version 1.2.16 did not have this issue but 1.3.3 does.

@andyfriesen

This comment has been minimized.

Show comment
Hide comment
@andyfriesen

andyfriesen Nov 5, 2015

I've noticed this too. If you remove cipher_ECDHE_RSA_AES128GCM_SHA256 from the cipher suite, then the bug does not seem to occur, so perhaps the problem is with that particular cipher?

andyfriesen commented Nov 5, 2015

I've noticed this too. If you remove cipher_ECDHE_RSA_AES128GCM_SHA256 from the cipher suite, then the bug does not seem to occur, so perhaps the problem is with that particular cipher?

@vincenthz

This comment has been minimized.

Show comment
Hide comment
@vincenthz

vincenthz Nov 5, 2015

Owner

Thanks @andyfriesen, this is a very interesting piece of information. There's some Heisenbug somewhere

Owner

vincenthz commented Nov 5, 2015

Thanks @andyfriesen, this is a very interesting piece of information. There's some Heisenbug somewhere

@vincenthz vincenthz added the Bug label Nov 5, 2015

@kazu-yamamoto

This comment has been minimized.

Show comment
Hide comment
@kazu-yamamoto

kazu-yamamoto Dec 9, 2015

Collaborator

Communicating with OpenTLS and go's TLS, I often see this bug:

mighty: HandshakeFailed (Error_Protocol ("bad record mac",True,BadRecordMac))
mighty: HandshakeFailed (Error_Protocol ("bad record mac",True,BadRecordMac))
mighty: HandshakeFailed (Error_Protocol ("bad record mac",True,BadRecordMac))
mighty: HandshakeFailed (Error_Protocol ("bad record mac",True,BadRecordMac))

Trying to make 500 connections by a benchmark tool, I see this bug some times between 1 to 10.

Collaborator

kazu-yamamoto commented Dec 9, 2015

Communicating with OpenTLS and go's TLS, I often see this bug:

mighty: HandshakeFailed (Error_Protocol ("bad record mac",True,BadRecordMac))
mighty: HandshakeFailed (Error_Protocol ("bad record mac",True,BadRecordMac))
mighty: HandshakeFailed (Error_Protocol ("bad record mac",True,BadRecordMac))
mighty: HandshakeFailed (Error_Protocol ("bad record mac",True,BadRecordMac))

Trying to make 500 connections by a benchmark tool, I see this bug some times between 1 to 10.

@kazu-yamamoto

This comment has been minimized.

Show comment
Hide comment
@kazu-yamamoto

kazu-yamamoto Dec 9, 2015

Collaborator

@vincenthz If you kindly suggest how to debug, I will try.

Collaborator

kazu-yamamoto commented Dec 9, 2015

@vincenthz If you kindly suggest how to debug, I will try.

@vincenthz

This comment has been minimized.

Show comment
Hide comment
@vincenthz

vincenthz Dec 11, 2015

Owner

@kazu-yamamoto did you try removing the cipher implementation above to see if that make a difference ? One good place to start is, I guess, to analize if ciphers make a difference in the overall stability; With a bit of luck we have a smoking gun, which gives us a place to start

Owner

vincenthz commented Dec 11, 2015

@kazu-yamamoto did you try removing the cipher implementation above to see if that make a difference ? One good place to start is, I guess, to analize if ciphers make a difference in the overall stability; With a bit of luck we have a smoking gun, which gives us a place to start

@sseveran

This comment has been minimized.

Show comment
Hide comment
@sseveran

sseveran Dec 12, 2015

I am also seeing this issue. When making high volumes of HTTP Requests. What is meant by remove cipher_ECDHE_RSA_AES128GCM_SHA256? I am using the library indirectly via connection and http-client-tls. Happy to collect any needed data.

sseveran commented Dec 12, 2015

I am also seeing this issue. When making high volumes of HTTP Requests. What is meant by remove cipher_ECDHE_RSA_AES128GCM_SHA256? I am using the library indirectly via connection and http-client-tls. Happy to collect any needed data.

@waern

This comment has been minimized.

Show comment
Hide comment
@waern

waern Dec 12, 2015

I'm also seeing this issue when making high volumes of HTTP requests.

waern commented Dec 12, 2015

I'm also seeing this issue when making high volumes of HTTP requests.

@kazu-yamamoto

This comment has been minimized.

Show comment
Hide comment
@kazu-yamamoto

kazu-yamamoto Dec 14, 2015

Collaborator

ECDHE-RSA-AES128-SHA256 also has this bug. I guess that ECDHE has a race.

Collaborator

kazu-yamamoto commented Dec 14, 2015

ECDHE-RSA-AES128-SHA256 also has this bug. I guess that ECDHE has a race.

@mietek

This comment has been minimized.

Show comment
Hide comment
@mietek

mietek Dec 20, 2015

I’ve just seen this issue happen at random, when building elm-reactor.

...
Linking dist/dist-sandbox-a4740c98/build/elm-reactor/elm-reactor ...
clang: warning: -Wl,-headerpad_max_install_names: 'linker' input unused
clang: warning: argument unused during compilation: '-L/usr/local/lib'
clang: warning: argument unused during compilation: '-L/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.11.sdk/System/Library/Frameworks/OpenGL.framework/Versions/Current/Libraries'
Failed to build SideBar.elm

Downloading elm-lang/core
Downloading evancz/elm-html
Downloading evancz/elm-markdown
Downloading evancz/virtual-dom
failed with 'TlsExceptionHostPort (HandshakeFailed (Error_Packet_unexpected "Alert [(AlertLevel_Fatal,BadRecordMac)]" " expected: change cipher")) "github.com" 80' when sending request to
    <http://github.com/evancz/virtual-dom/zipball/2.1.0/>

Custom build step: creating and collecting all static resources
cabal: Error: some packages failed to install:
elm-reactor-0.16 failed during the building phase. The exception was:
ExitFailure 1

cc @evancz

mietek commented Dec 20, 2015

I’ve just seen this issue happen at random, when building elm-reactor.

...
Linking dist/dist-sandbox-a4740c98/build/elm-reactor/elm-reactor ...
clang: warning: -Wl,-headerpad_max_install_names: 'linker' input unused
clang: warning: argument unused during compilation: '-L/usr/local/lib'
clang: warning: argument unused during compilation: '-L/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.11.sdk/System/Library/Frameworks/OpenGL.framework/Versions/Current/Libraries'
Failed to build SideBar.elm

Downloading elm-lang/core
Downloading evancz/elm-html
Downloading evancz/elm-markdown
Downloading evancz/virtual-dom
failed with 'TlsExceptionHostPort (HandshakeFailed (Error_Packet_unexpected "Alert [(AlertLevel_Fatal,BadRecordMac)]" " expected: change cipher")) "github.com" 80' when sending request to
    <http://github.com/evancz/virtual-dom/zipball/2.1.0/>

Custom build step: creating and collecting all static resources
cabal: Error: some packages failed to install:
elm-reactor-0.16 failed during the building phase. The exception was:
ExitFailure 1

cc @evancz

@domenkozar

This comment has been minimized.

Show comment
Hide comment
@domenkozar

domenkozar Mar 2, 2016

I'm also hitting this:

lsExceptionHostPort (HandshakeFailed (Error_Packet_unexpected "Alert [(AlertLevel_Fatal,BadRecordMac)]" " expected: change cipher")) "graph.facebook.com" 443

@vincenthz do you want some debugging information?

domenkozar commented Mar 2, 2016

I'm also hitting this:

lsExceptionHostPort (HandshakeFailed (Error_Packet_unexpected "Alert [(AlertLevel_Fatal,BadRecordMac)]" " expected: change cipher")) "graph.facebook.com" 443

@vincenthz do you want some debugging information?

@creichert

This comment has been minimized.

Show comment
Hide comment
@creichert

creichert Mar 5, 2016

I've seen this error on server-side and client-side (from http-client-tls and warp-tls via keter).

If it helps, one non-haskell client I'm using throws this error:

Unhandled CL+SSL::SSL-ERROR-SSL in thread #<SB-THREAD:THREAD
                                            "main thread" RUNNING
                                             {1002A8C083}>:
  A failure in the SSL library occurred on handle #.(SB-SYS:INT-SAP #X00660010) (return code: 1). SSL error queue:
error:140943FC:SSL routines:SSL3_READ_BYTES:sslv3 alert bad record mac

creichert commented Mar 5, 2016

I've seen this error on server-side and client-side (from http-client-tls and warp-tls via keter).

If it helps, one non-haskell client I'm using throws this error:

Unhandled CL+SSL::SSL-ERROR-SSL in thread #<SB-THREAD:THREAD
                                            "main thread" RUNNING
                                             {1002A8C083}>:
  A failure in the SSL library occurred on handle #.(SB-SYS:INT-SAP #X00660010) (return code: 1). SSL error queue:
error:140943FC:SSL routines:SSL3_READ_BYTES:sslv3 alert bad record mac
@mightybyte

This comment has been minimized.

Show comment
Hide comment
@mightybyte

mightybyte Mar 30, 2016

I'm also seeing this error at a similarly low frequency.

mightybyte commented Mar 30, 2016

I'm also seeing this error at a similarly low frequency.

@vincenthz

This comment has been minimized.

Show comment
Hide comment
@vincenthz

vincenthz Apr 5, 2016

Owner

turn out it's a completely deterministic problem (not a Heisenbug !), but not the one I thought/looked for originally, so I'm still looking but I'm getting closer to actually solve this.

Owner

vincenthz commented Apr 5, 2016

turn out it's a completely deterministic problem (not a Heisenbug !), but not the one I thought/looked for originally, so I'm still looking but I'm getting closer to actually solve this.

@creichert

This comment has been minimized.

Show comment
Hide comment
@creichert

creichert Apr 5, 2016

@vincenthz That's great news! I'll be happy to provide more information or testing if you need it.

creichert commented Apr 5, 2016

@vincenthz That's great news! I'll be happy to provide more information or testing if you need it.

@vincenthz

This comment has been minimized.

Show comment
Hide comment
@vincenthz

vincenthz Apr 9, 2016

Owner

Looks like I've missed this during review of ac76363, the serialization need to be "filling" the resulting bytes to be of the size of the curve (i.e. using a P256, the shared key generated need to be always 32 bytes (padded by 0 if necessary), instead of producting less bytes if the leading bytes 0)

Roughly speaking: once in 256 tries, the code will generate a 0 leading byte shared key which would have resulted in this bug (key mismatch with the server).

Funny story: this is the bug I guessed originally, except when I checked against openssl using integers, it matched (as integer: 00x = x).

Just passed a 10000 tries tests without failing, so that should be it. fixed by 32451bb. I'll try to make a release ASAP, need to unwind other changes

Owner

vincenthz commented Apr 9, 2016

Looks like I've missed this during review of ac76363, the serialization need to be "filling" the resulting bytes to be of the size of the curve (i.e. using a P256, the shared key generated need to be always 32 bytes (padded by 0 if necessary), instead of producting less bytes if the leading bytes 0)

Roughly speaking: once in 256 tries, the code will generate a 0 leading byte shared key which would have resulted in this bug (key mismatch with the server).

Funny story: this is the bug I guessed originally, except when I checked against openssl using integers, it matched (as integer: 00x = x).

Just passed a 10000 tries tests without failing, so that should be it. fixed by 32451bb. I'll try to make a release ASAP, need to unwind other changes

@vincenthz vincenthz closed this Apr 9, 2016

@vincenthz

This comment has been minimized.

Show comment
Hide comment
@vincenthz

vincenthz Apr 9, 2016

Owner

released as tls-1.3.5

Owner

vincenthz commented Apr 9, 2016

released as tls-1.3.5

@creichert

This comment has been minimized.

Show comment
Hide comment
@creichert

creichert Apr 9, 2016

To test the fix I upgraded keter to the new version of tls. However, now I get the BadRecordMac error on each request. I could have missed something so I created a repo w/ the issue at https://github.com/creichert/keter/tree/wip-tls-bad-record-mac.

Roughly, my steps were:

$ cd keter

$ cat stack.yaml 
flags:
  keter: {}
packages:
  - '.'
extra-deps:
  - tls-1.3.5
  - cryptonite-0.14
resolver: lts-5.10

$ stack list-dependencies | grep tls
http-client-tls 0.2.2
tls 1.3.5
warp-tls 3.2.1

$ stack build
... wait for build 

On the last step, I run keter and navigate to https://127.0.0.1/ in a browser:

$ .stack-work/install/x86_64-linux/lts-5.10/7.10.3/bin/keter etc/keter-config.yaml  
keter: HandshakeFailed (Error_Protocol ("bad record mac",True,BadRecordMac))
keter: HandshakeFailed (Error_Protocol ("bad record mac",True,BadRecordMac))
keter: HandshakeFailed (Error_Protocol ("bad record mac",True,BadRecordMac))
keter: HandshakeFailed (Error_Protocol ("bad record mac",True,BadRecordMac))
keter: HandshakeFailed (Error_Protocol ("bad record mac",True,BadRecordMac))
keter: HandshakeFailed (Error_Protocol ("bad record mac",True,BadRecordMac))
keter: HandshakeFailed (Error_Protocol ("bad record mac",True,BadRecordMac))
keter: HandshakeFailed (Error_Protocol ("bad record mac",True,BadRecordMac))

Could I be missing something here? I'll be happy to provide more info if needed.

creichert commented Apr 9, 2016

To test the fix I upgraded keter to the new version of tls. However, now I get the BadRecordMac error on each request. I could have missed something so I created a repo w/ the issue at https://github.com/creichert/keter/tree/wip-tls-bad-record-mac.

Roughly, my steps were:

$ cd keter

$ cat stack.yaml 
flags:
  keter: {}
packages:
  - '.'
extra-deps:
  - tls-1.3.5
  - cryptonite-0.14
resolver: lts-5.10

$ stack list-dependencies | grep tls
http-client-tls 0.2.2
tls 1.3.5
warp-tls 3.2.1

$ stack build
... wait for build 

On the last step, I run keter and navigate to https://127.0.0.1/ in a browser:

$ .stack-work/install/x86_64-linux/lts-5.10/7.10.3/bin/keter etc/keter-config.yaml  
keter: HandshakeFailed (Error_Protocol ("bad record mac",True,BadRecordMac))
keter: HandshakeFailed (Error_Protocol ("bad record mac",True,BadRecordMac))
keter: HandshakeFailed (Error_Protocol ("bad record mac",True,BadRecordMac))
keter: HandshakeFailed (Error_Protocol ("bad record mac",True,BadRecordMac))
keter: HandshakeFailed (Error_Protocol ("bad record mac",True,BadRecordMac))
keter: HandshakeFailed (Error_Protocol ("bad record mac",True,BadRecordMac))
keter: HandshakeFailed (Error_Protocol ("bad record mac",True,BadRecordMac))
keter: HandshakeFailed (Error_Protocol ("bad record mac",True,BadRecordMac))

Could I be missing something here? I'll be happy to provide more info if needed.

@vincenthz

This comment has been minimized.

Show comment
Hide comment
@vincenthz

vincenthz Apr 9, 2016

Owner

Yes, you pinned cryptonite 0.14 which has a dhe bug, move to cryptonite 0.15

Owner

vincenthz commented Apr 9, 2016

Yes, you pinned cryptonite 0.14 which has a dhe bug, move to cryptonite 0.15

@creichert

This comment has been minimized.

Show comment
Hide comment
@creichert

creichert Apr 10, 2016

@vincenthz I knew I missed something. Everything appears to be working correctly now, thanks!

creichert commented Apr 10, 2016

@vincenthz I knew I missed something. Everything appears to be working correctly now, thanks!

@kazu-yamamoto

This comment has been minimized.

Show comment
Hide comment
@kazu-yamamoto

kazu-yamamoto Apr 10, 2016

Collaborator

@vincenthz Thank you for fixing my mistake.

Collaborator

kazu-yamamoto commented Apr 10, 2016

@vincenthz Thank you for fixing my mistake.

@mightybyte

This comment has been minimized.

Show comment
Hide comment
@mightybyte

mightybyte Apr 11, 2016

I put 1.3.5 into production today and I haven't seen any HandshakeFailed errors since, so it's looking good here. Thanks!

mightybyte commented Apr 11, 2016

I put 1.3.5 into production today and I haven't seen any HandshakeFailed errors since, so it's looking good here. Thanks!

@dmjio

This comment has been minimized.

Show comment
Hide comment
@dmjio

dmjio commented Apr 18, 2016

@vincenthz thanks !

@23Skidoo

This comment has been minimized.

Show comment
Hide comment
@23Skidoo

23Skidoo Feb 13, 2017

We're still seeing HandshakeFailed errors with tls-1.3.8. Will try to come up with a reproducible test case.

23Skidoo commented Feb 13, 2017

We're still seeing HandshakeFailed errors with tls-1.3.8. Will try to come up with a reproducible test case.

@vincenthz

This comment has been minimized.

Show comment
Hide comment
@vincenthz

vincenthz Feb 13, 2017

Owner

first thing is opening a new bug, leave this old bug alone. also upgrade to latest 1.3.9

Owner

vincenthz commented Feb 13, 2017

first thing is opening a new bug, leave this old bug alone. also upgrade to latest 1.3.9

@creichert

This comment has been minimized.

Show comment
Hide comment
@creichert

creichert Feb 13, 2017

@23Skidoo I was seeing the same thing on 1.3.8 and upgraded to 1.3.9 and it went away. #140 (comment)

creichert commented Feb 13, 2017

@23Skidoo I was seeing the same thing on 1.3.8 and upgraded to 1.3.9 and it went away. #140 (comment)

@23Skidoo

This comment has been minimized.

Show comment
Hide comment
@23Skidoo

23Skidoo Feb 13, 2017

@creichert @vincenthz Thanks! I'll try upgrading first.

23Skidoo commented Feb 13, 2017

@creichert @vincenthz Thanks! I'll try upgrading first.

ehohenstein-imvu added a commit to imvu/hs-tls that referenced this issue Mar 14, 2017

Fix for intermittent BadRecordMac problem during handshaking when usi…
…ng DHE key exchange and TLS 1.1 or TLS 1.2

This fixes issue vincenthz#124.

There was a change between the TLS 1.0 RFC and the TLS 1.1 RFC that specifies that
leading 0 bytes of the pre-master-secret should be stripped before using it to derive
the master secret. The hs-tls library was not doing this which lead to 1 out of 256
handshakes to fail when using Diffie-Hellman key exchange and TLS 1.1 or TLS 1.2 when
communicating with an RFC compliant peer. Note that because the same logic is used in
both client and server contexts, this affected both clients and servers but would
not have affected a client and server both using this library.

See section 8.1.2 of the relevant RFCs for details:
https://tools.ietf.org/html/rfc2246#section-8.1.2
https://tools.ietf.org/html/rfc4346#section-8.1.2
https://tools.ietf.org/html/rfc5246#section-8.1.2
@ehohenstein-imvu

This comment has been minimized.

Show comment
Hide comment
@ehohenstein-imvu

ehohenstein-imvu Mar 14, 2017

I believe I have found and fixed the problem as originally described, though with a small clarification. I have submitted a pull request for the fix as #197.

The TLS 1.1 and 1.2 RFC specficiations say that leading 0 bytes must be trimmed from the pre-master-secret before it is used to derive the master secret when using Diffie-Hellman key exchange but hs-tls was not doing this. The TLS 1.0 RFC does not say that leading 0 bytes should be trimmed. This problem would have caused approximately 1 out of 256 handshakes to fail with a BadRecordMac exception when communicating with an RFC compliant peer using DHE and TLS 1.1 or 1.2 but not ECDHE or TLS 1.0.

I think andyfriesen's comment about cipher_ECDHE_RSA_AES128GCM_SHA256 above is actually incorrect. I found that the problem was consistently reproducible in 1000 connection attempts to an OpenSSL server with all ciphers using DHE but none using ECDHE.

ehohenstein-imvu commented Mar 14, 2017

I believe I have found and fixed the problem as originally described, though with a small clarification. I have submitted a pull request for the fix as #197.

The TLS 1.1 and 1.2 RFC specficiations say that leading 0 bytes must be trimmed from the pre-master-secret before it is used to derive the master secret when using Diffie-Hellman key exchange but hs-tls was not doing this. The TLS 1.0 RFC does not say that leading 0 bytes should be trimmed. This problem would have caused approximately 1 out of 256 handshakes to fail with a BadRecordMac exception when communicating with an RFC compliant peer using DHE and TLS 1.1 or 1.2 but not ECDHE or TLS 1.0.

I think andyfriesen's comment about cipher_ECDHE_RSA_AES128GCM_SHA256 above is actually incorrect. I found that the problem was consistently reproducible in 1000 connection attempts to an OpenSSL server with all ciphers using DHE but none using ECDHE.

@23Skidoo

This comment has been minimized.

Show comment
Hide comment
@23Skidoo

23Skidoo Mar 14, 2017

For the record, upgrading the library version appears to have fixed this issue for us.

23Skidoo commented Mar 14, 2017

For the record, upgrading the library version appears to have fixed this issue for us.

@ehohenstein-imvu

This comment has been minimized.

Show comment
Hide comment
@ehohenstein-imvu

ehohenstein-imvu Mar 14, 2017

I was able to reproduce the problem as described in the initial bug report using revision 2be245c which was committed only about a week ago and after 1.3.9 was released, though using the tls-simpleclient application talking to an OpenSSL server and by using the command line flag '--use-cipher 158' which corresponds to DHE-RSA-AES128GCM-SHA256. I was also able to reproduce it using all of the following ciphers:
DHE-RSA-AES128-SHA1 (51)
DHE-RSA-AES256-SHA1 (57)
DHE-RSA-AES128-SHA256 (103)
DHE-RSA-AES256GCM-SHA384 (159)

ehohenstein-imvu commented Mar 14, 2017

I was able to reproduce the problem as described in the initial bug report using revision 2be245c which was committed only about a week ago and after 1.3.9 was released, though using the tls-simpleclient application talking to an OpenSSL server and by using the command line flag '--use-cipher 158' which corresponds to DHE-RSA-AES128GCM-SHA256. I was also able to reproduce it using all of the following ciphers:
DHE-RSA-AES128-SHA1 (51)
DHE-RSA-AES256-SHA1 (57)
DHE-RSA-AES128-SHA256 (103)
DHE-RSA-AES256GCM-SHA384 (159)

@kazu-yamamoto

This comment has been minimized.

Show comment
Hide comment
@kazu-yamamoto

kazu-yamamoto Mar 15, 2017

Collaborator

This issue is about ECDH and already fixed.
#197 is for DH, not ECDH.

Collaborator

kazu-yamamoto commented Mar 15, 2017

This issue is about ECDH and already fixed.
#197 is for DH, not ECDH.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment