segfault in hashable_siphash_24_sse2 #60

Closed
jaspervdj opened this Issue Feb 19, 2013 · 25 comments

Comments

Projects
None yet
7 participants

A long-running application of me seems to segfault every once in a long while. I ran it using GDB last time and got the backtrace -- it seems to fail on ByteString hashing in some particular case.

GDB gives me a segfault on 0x088ed28a in hashable_siphash_24_sse2. The disassembly of this method can be found here: https://gist.github.com/jaspervdj/4986452

I think I'm going to recompile hashable with debugging flags enabled and try to reproduce it. However, the issue only appears after running for a few days so it might take a while.

Owner

tibbe commented Feb 19, 2013

What version of hashable are you using?

hashable-1.2.0.5

Owner

tibbe commented Feb 19, 2013

Just to make sure, you're not creating ByteStrings in an unsafe way somewhere, in such a way that their memory might be reclaimed prematurely? (Just making sure we don't go off on a wild goose chase here.)

bos was assigned Feb 19, 2013

I'm not doing anything unsafe, but it's not a small application and some other external library might be creating a corrupt ByteString. I'll try to figure out something more reproducible, so I don't think you should spend any time on this issue at this point.

Compiled with debugging info, got it to crash again, gdb gives:

hashable_siphash24_sse2 (ik0=6260769432266699290, ik1=8526604106625318493,
m=0xb7bfffec "t", n=18) at cbits/siphash-sse2.c:79

This m points to a "timestamp" :: Text literal (using the IsString instance). The C function doesn't seem to fail with this particular input or alignment when I try it to call directly from a simple C file. If you need any more info, please let me know.

jystic commented Mar 14, 2013

I've got an application segfaulting in hashable_siphash24_sse2 (with hashable-1.2.0.5) as well.

I have a HashSet (Text, Text) that when converted to a [(Text, Text)] seems to fix the problem. The Texts are filenames and SHA1 hashes of git commits that are possibly being created in an unsafe way by gitlib.

Owner

tibbe commented Mar 14, 2013

@jystic Do you mind running the older (1.1.2.5) version of hashable for a while and see if that fixes the issue? If so, that would provide us with more evidence that hashable-1.2.0.5 is to blame.

jystic commented Mar 21, 2013

@tibbe Yep, just tried it, running the older (1.1.2.5) version of hashable does not result in the segfault.

So hashable-1.2.0.5 looks to be the cause.

Owner

tibbe commented Mar 21, 2013

@jystic Thanks for double checking. We will need to figure out a way to debug this. @bos would a core dump help?

We just got an email on our mailing list with the same complaint:

Hi,

I'm trying to deploy a snap-0.11 application under a sub-url and ran into a segmentation fault. I could reproduce it via:

    mkdir testapp
    cd testapp
    snap-0.11 init default
    cabal-dev install-deps
    # change src/Site.hs:85 from
    #   addRoutes routes
    # to
    #   addRoutes [("/foo/bar", route routes)]
    cabal-dev configure
    cabal-dev build
    dist/build/testapp/testapp

(I don't know if this is the intended way to install an application under a sub-url, but couldn't find documentation adressing this.)

While URLs like /foo/bar, /foo/bar/login etc. work as expected, the url /foo/ leads to a segmentation fault (using GHC 7.6.2):

...

I'm running Debian 6.0.7 (squeeze) on a i686 box:

    $ uname -a
    Linux palmer 3.2.0-0.bpo.4-686-pae #1 SMP Debian 3.2.39-2~bpo60+1 i686 GNU/Linux

...

 Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0xb6bfeb70 (LWP 13972)]
0x085bdf0a in hashable_siphash24_sse2 ()

bjpe commented Apr 2, 2013

Well, it's me who got the seg fault :) Again, the seg fault did not occur when I configured my application to use hashable < 1.2. Feel free to ask for more details if necessary!

Collaborator

bos commented Apr 3, 2013

@bjpe I cannot replicate your segfault, but your description is pretty vague.

What version of GHC are you using?

Also, does the code in src/Site.hs that's crashing for you look exactly like this?

    -- Commented out:
    -- addRoutes routes
    -- Replaced with:
    addRoutes [("/foo/", route routes)]

And is that the only change you've made?

Collaborator

bos commented Apr 3, 2013

A core dump would be very helpful, in case someone has one. The movq instruction from @jaspervdj's crash report looks innocuous to me ... oh, wait.

Look at this. Jasper's segfault is happening when the function is called at address 0xb7bfffec, with an input of 18 bytes.

Now consider this loop control structure:

for(i = 0; i < (n-n%8); i += 8)
{
  // whatever
}

This will start at 0xb7bfffec, and step through the loop twice until that i reaches 16.

The very next line does this:

mi = _mm_loadl_epi64((__m128i*)(m + i));
// This is followed by a bunch of screwing around to
// mask out bytes we shouldn't have been looking at.

So we're trying to load 8 bytes starting at 0xb7bfffec+16, or 0xb7bffffc...which is right next to a page boundary.

If that next page is either not mapped or read-protected, we'll get a segfault.

This explains why the bug is so difficult to trigger: the conditions to cause it are not obvious.

Owner

tibbe commented Apr 3, 2013

So the question is, why would it be unmapped or read-protected? Does GHC have such pages on its heap?

bjpe commented Apr 3, 2013

@bos My configuration is

  • GHC 7.6.2
  • cabal-install version 1.16.0.2 using version 1.16.0.3 of the Cabal library
  • cabal-dev 0.9.2 built with Cabal 1.16.0.3
  • Snap 0.9.3.3 Project Kickstarter (installed through snap-0.11.2)

I need to do the following to reproduce the segfault:

$ mkdir testapp
$ cd testapp
$ /path/to/snap-binary/snap init default

Then change src/Site.hs:

$ diff testapp/src/Site.hs fresh/src/Site.hs 
85,86c85
<     -- addRoutes routes
<     addRoutes [("/foo/bar", route routes)]
---
>     addRoutes routes

Then, back in testapp:

$ cabal-dev install
$ dist/build/testapp/testapp
$ curl localhost:8000/foo/ # in other terminal

Now curl returns with curl: (52) Empty reply from server and the testapp binary complains Segmentation fault.

Collaborator

bos commented Apr 3, 2013

@tibbe We get unlucky simply by walking off the end of a memory map.

@bos there's nothing unlucky about this as far as I can tell. Line 79 in that function assumes that it's safe to read 8 bytes at the address "m + i"; unfortunately we already know that this is probably beyond the end of the string (we know i >= n-n%8 as the postcondition of the loop).

Normally this is OK because the bits we didn't want are masked out immediately following, but when the 8-byte read crosses a page boundary we can get the situation we have here.

Collaborator

bos commented Apr 3, 2013

@gregorycollins That's what I meant.

nominolo commented Apr 4, 2013

To be fair, this kind of problem wouldn't occur if the read was properly aligned. AFAIK, unaligned reads are very fast on the latest Intel hardware, but it might be a serious performance problem on older hardware. So fixing alignment would fix this bug and a potential performance issue.

You can't -- otherwise S.drop 1 would require a copy.

nominolo commented Apr 4, 2013

I'm not talking about enforcing alignment for bytestrings but for the SSE code. I.e., you want something like:

if (aligned) {
  fast and aligned stuff
} else {
   handle bytes until first aligned byte
   fast and aligned stuff, but perhaps a bit slower due to byte shuffling
}
if (bytes leftover) {
    handle remaining unaligned bytes
}

Also, it would be nice to see the actual assembly generated for that _mm_loadl_epi64. An 8 byte move to an xmm register can cause a pipeline stall if not treated properly by the compiler.

gregorycollins referenced this issue in snapframework/snap-core Apr 17, 2013

Closed

Snap makes segmentation fault when ... #179

@bos bos added a commit to bos/hashable that referenced this issue Apr 18, 2013

@bos bos Add a regression test for the SSE2 bug (gh-60) (gh-60)
--HG--
rename : tests/Properties.hs => tests/Main.hs
f7ed096

@bos bos added a commit to bos/hashable that referenced this issue Apr 18, 2013

@bos bos Fix SSE2 read past end of input (gh-60) abd8ce3
Collaborator

bos commented Apr 18, 2013

I believe that this is fixed, and that the results of the hash function are unaffected. I wrote the fix a couple of weeks ago, but then broke my leg, and only tonight did I have time to sit down and write the rather complicated regression test necessary to reproduce the bug and ensure it is fixed (see f7ed096 for the fun). Sorry for the delay.

Please test HEAD on your systems before we mark this as closed.

Collaborator

bos commented Apr 18, 2013

Oh yeah, and the regression test tickles a bug in cabal-install 1.16.0, so you may need to rebuild cabal-install and use that before you can run the test suite. Fun times!

bjpe commented Apr 18, 2013

@bos Works! I just tested it for my occurence in combination with snap and it behaves as expected, thanks!

Collaborator

bos commented Apr 21, 2013

Fixed in v1.2.0.6.

bos closed this Apr 21, 2013

23Skidoo referenced this issue in snapframework/snap-core Apr 22, 2013

Closed

Blacklist hashable >= 1.2.0.0 && < 1.2.0.5 #164

@ichistmeinname ichistmeinname pushed a commit to ichistmeinname/Currygle that referenced this issue Sep 30, 2015

@bjpe bjpe Added constraint on hashable to avoid segfaulting
For details, see: tibbe/hashable#60
b454b56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment