Issue with RethinkDB hanging #4299

thaiat · 2015-05-28T11:53:10Z

Hi all,

We are in the process of writing a loopback connector for rethinkdb that uses the amazing rethinkdbdash

The code is available here : https://github.com/yoobic/loopback-connector-rethinkdbdash

After running test (npm test) for about 15 to 20 times, rethinkdb will hang.
I can see this message : 'Entering slow growth mode'

The only way to come back to normal is to execute:

pkill -9 rethinkdb

I'm on mac, lots of RAM and disk space available.
I can reproduce on different laptop.
Using latest of rethinkdb and rethinkdbdash

The text was updated successfully, but these errors were encountered:

neumino · 2015-05-28T14:10:49Z

You can enter slow growth mode, it's doesn't always mean that RethinkDB is dead/hanging.
It typically happen if you reach the maximum of open file. Basically rethinkdbdash will just slowly try to open connections instead of eagerly opening them.

I tried running npm test in a loop and ended up with #4300, not sure if it's related though.

Ijmir · 2015-05-28T18:08:33Z

After running the tests about 15-20 times in loop the db stops responding and all tests always fail.
Can't reach via web interface too. An error I saw was "Uncaught RqlDriverError: Handshake timedout" while trying to connect to DB in hanged mode.
After restarting db (by killing the process "kill -KILL "), it hangs even after the first time the test are running - no need to 20 times loop. Only when I delete the files in "rethinkdb-data" and restart db, it doesn't hang immediately.

mlucy · 2015-05-28T23:43:28Z

Thanks for the bug report.

Is there any error output from RethinkDB before it hangs?

It might be that you're hitting the maximum open file limit on OS X. Do you still see the hang if you run ulimit -n 10000 first?

thaiat · 2015-05-30T01:48:44Z

@mlucy still happenning, here is my ulimit :

-t: cpu time (seconds)              unlimited
-f: file size (blocks)              unlimited
-d: data seg size (kbytes)          unlimited
-s: stack size (kbytes)             8192
-c: core file size (blocks)         0
-v: address space (kbytes)          unlimited
-l: locked-in-memory size (kbytes)  unlimited
-u: processes                       709
-n: file descriptors                65536

Ijmir · 2015-05-31T09:32:37Z

@mlucy I get on server logs:

2015-05-31T12:17:43.549539000 125.589003s error: Could not write to socket: Protocol wrong type for socket
2015-05-31T12:20:45.407267000 307.452388s error: Could not write to socket: Protocol wrong type for socket

I ran ulimit -n 10000 but it still reproduced. My ulimit:

-t: cpu time (seconds)              unlimited
-f: file size (blocks)              unlimited
-d: data seg size (kbytes)          unlimited
-s: stack size (kbytes)             8192
-c: core file size (blocks)         0
-v: address space (kbytes)          unlimited
-l: locked-in-memory size (kbytes)  unlimited
-u: processes                       709
-n: file descriptors                10000

Web interface does not respond.

mlucy · 2015-06-02T06:38:32Z

Thanks for the update.

@danielmewes, any idea what might be going on here?

danielmewes · 2015-06-02T18:12:24Z

Very odd. I wonder if the log messages are relevant. They might be a red herring, since the issue appears to only happen after a while and then goes away once rethinkdb_data is deleted.
On the other hand I think it's also weird that we're getting a "Protocol wrong type for socket" error while writing to a presumably already existing socket. I would expect an unsupported protocol type to bubble up when we initially open the socket, not when writing to it.
Either the error message is wrong, or something very odd is going on. @Tryneus any ideas about the protocol error?

@Ijmir when RethinkDB hangs, could you check two things for us and post the results here?

how much CPU is the RethinkDB process using at that time?
what is the content of rethinkdb_data?

This data might help narrow down the possible causes of this.

danielmewes · 2015-06-02T18:55:56Z

Hmm the same "Protocol wrong type" error appeared here #4327 (comment) for a different user. It might be unrelated. Still curious what it means.

Ijmir · 2015-06-03T10:37:00Z

@danielmewes
memory 325MB
cpu 53%
I don't always get the "Protocol wrong type"
rethinkdb_data - https://dl.dropboxusercontent.com/u/15207747/rethinkdb_data.zip
this data hangs rethink db from 1-2 runs of test suite.

danielmewes · 2015-06-03T19:31:33Z

Thanks for the data and information @Ijmir ! Will try to reproduce.

danielmewes · 2015-06-03T20:32:36Z

@Ijmir I think I'm not running the tests correctly. All I get when I run npm test on your repository is


> loopback-connector-rethinkdbdash@0.0.2 test /home/ssd3/daniel/user-data/4299-data/loopback-connector-rethinkdbdash
> gulp lint

[13:29:28] Using gulpfile /home/ssd3/daniel/user-data/4299-data/loopback-connector-rethinkdbdash/gulpfile.js
[13:29:28] Starting 'static'...
[13:29:30] 
/home/ssd3/daniel/user-data/4299-data/loopback-connector-rethinkdbdash/lib/rethink.js
  590:0  warning  Unexpected todo comment  no-warning-comments

✖ 1 problem (0 errors, 1 warning)

[13:29:30] All lint files passed
[13:29:30] Finished 'static' after 1.18 s
[13:29:30] Starting 'lint'...
[13:29:30] Finished 'lint' after 6.71 μs

It doesn't seem to run any RethinkDB tests. Any idea what I'm doing wrong? (I'm not that familiar with node)

danielmewes · 2015-06-03T20:35:22Z

Oh never mind. Got it working by running

node_modules/gulp/bin/gulp.js test

instead I think.

danielmewes · 2015-06-03T20:40:34Z

On Linux I run into #4300 pretty quickly, so I'm going to test this on Mac next. Maybe we're just not catching the stack overflow properly on Mac.

thaiat · 2015-06-03T20:41:59Z

@danielmewes to run the test you have to execute

make test-verbose

danielmewes · 2015-06-03T21:16:36Z

Thanks, I can reproduce it on OS X.

danielmewes · 2015-06-04T22:17:01Z

The weird thing is that once RethinkDB gets into that state, htop shows that it spends ~50% CPU in kernel code. However according to dtrace there are virtually no syscalls going on.

I also tried attaching instruments timing-based profiler and it didn't show anything interesting either.

However I found that increasing the coroutine stack size avoids the issue, so I think the initial theory that this is the same as #4300 , except that something is broken about our coroutine stack overflow detection on OS X.
We do call mprotect successfully on the watchdog page, so something must go wrong with handling the signal that accessing it should trigger.

danielmewes · 2015-06-04T22:17:48Z

This will probably go away with #4024 (comment) , though we also need to find out why the stack overflow protection isn't working on OS X.

danielmewes · 2015-06-04T22:24:46Z

Ah, on OS X accessing a protected page can trigger SIGSEGV or SIGBUS. On Linux it always causes SIGSEGV. We are handling SIGSEGV but not SIGBUS.

danielmewes · 2015-06-04T22:51:27Z

A fix for us not catching the stack overflow on OS X is now in code review 2954 by @AtnNn .

@thaiat this fix will stop RethinkDB from hanging, but instead it is going to crash just like described by @neumino in #4300 .
So it doesn't fully resolve the problem. There must be some query in the test suite that's very deeply nested that's causing this to happen in the first place.

In the near future, we plan to implement a protection for such queries such that the query itself will fail, but the server won't crash (our current behavior is obviously bad).

Sorry you had to run into this.

danielmewes · 2015-06-08T19:02:59Z

Merged into v2.0.x as of fec4595 and cherry-picked into next in 82f2487 and 978bbf9

primitive-type · 2015-08-19T20:32:56Z

I believe I'm still experiencing an issue related to this: neumino/thinky#323

danielmewes · 2015-08-19T20:36:44Z

@primitive-type That failure looks unrelated to this issue. The "too many open files" message usually means that there's still a configuration problem on the OS side. Can you make sure that your increased maxfiles settings are correctly applied to the user / shell that you run RethinkDB from?

primitive-type · 2015-08-19T21:48:03Z

@danielmewes Sure, here is the result of ulimit -all:

core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
file size               (blocks, -f) unlimited
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 65536
pipe size            (512 bytes, -p) 1
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 2048
virtual memory          (kbytes, -v) unlimited

I closely followed the instructions at http://docs.basho.com/riak/latest/ops/tuning/open-files-limit/#Mac-OS-X

RethinkDB was using about 25% CPU at the time of the crash. (Mac OS X 10.10.4)

amir-rahnama · 2016-09-01T11:57:41Z

Recently running a Rethinkdb 2.2.5 cluster goes on my local machine into slow growth mode a lot. I am using Docker for Mac Version 1.12.1-rc1-beta23. I am not really using volumes or anything but still it happens.

Any ideas?

danielmewes · 2016-09-01T19:11:06Z

@ambodi It's likely a different issue from what this was about originally. Check out @neumino's comment here #4299 (comment) about what slow-growth mode means.

I recommend checking the ulimits, specifically the open file limit by running ulimit -a. You can also check the RethinkDB log_file for messages like these: error: accept() failed: Too many open files..

mlucy added this to the 2.0.x milestone May 28, 2015

mlucy added the tp:bug label May 28, 2015

mlucy self-assigned this May 28, 2015

danielmewes assigned danielmewes and unassigned mlucy Jun 3, 2015

danielmewes modified the milestones: 2.0.3, 2.0.x Jun 3, 2015

danielmewes modified the milestones: 2.0.x, 2.0.3 Jun 4, 2015

danielmewes modified the milestones: 2.0.3, 2.0.x Jun 4, 2015

danielmewes pushed a commit that referenced this issue Jun 4, 2015

Handle SIGBUS similar to SIGSEGV to fix #4299.

f4a7cd2

danielmewes added the st:review label Jun 6, 2015

danielmewes closed this as completed in 82f2487 Jun 8, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with RethinkDB hanging #4299

Issue with RethinkDB hanging #4299

thaiat commented May 28, 2015

neumino commented May 28, 2015

Ijmir commented May 28, 2015

mlucy commented May 28, 2015

thaiat commented May 30, 2015

Ijmir commented May 31, 2015

mlucy commented Jun 2, 2015

danielmewes commented Jun 2, 2015

danielmewes commented Jun 2, 2015

Ijmir commented Jun 3, 2015

danielmewes commented Jun 3, 2015

danielmewes commented Jun 3, 2015

danielmewes commented Jun 3, 2015

danielmewes commented Jun 3, 2015

thaiat commented Jun 3, 2015

danielmewes commented Jun 3, 2015

danielmewes commented Jun 4, 2015

danielmewes commented Jun 4, 2015

danielmewes commented Jun 4, 2015

danielmewes commented Jun 4, 2015

danielmewes commented Jun 8, 2015

primitive-type commented Aug 19, 2015

danielmewes commented Aug 19, 2015

primitive-type commented Aug 19, 2015

amir-rahnama commented Sep 1, 2016

danielmewes commented Sep 1, 2016

Issue with RethinkDB hanging #4299

Issue with RethinkDB hanging #4299

Comments

thaiat commented May 28, 2015

neumino commented May 28, 2015

Ijmir commented May 28, 2015

mlucy commented May 28, 2015

thaiat commented May 30, 2015

Ijmir commented May 31, 2015

mlucy commented Jun 2, 2015

danielmewes commented Jun 2, 2015

danielmewes commented Jun 2, 2015

Ijmir commented Jun 3, 2015

danielmewes commented Jun 3, 2015

danielmewes commented Jun 3, 2015

danielmewes commented Jun 3, 2015

danielmewes commented Jun 3, 2015

thaiat commented Jun 3, 2015

danielmewes commented Jun 3, 2015

danielmewes commented Jun 4, 2015

danielmewes commented Jun 4, 2015

danielmewes commented Jun 4, 2015

danielmewes commented Jun 4, 2015

danielmewes commented Jun 8, 2015

primitive-type commented Aug 19, 2015

danielmewes commented Aug 19, 2015

primitive-type commented Aug 19, 2015

amir-rahnama commented Sep 1, 2016

danielmewes commented Sep 1, 2016