Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with RethinkDB hanging #4299

Closed
thaiat opened this issue May 28, 2015 · 25 comments
Closed

Issue with RethinkDB hanging #4299

thaiat opened this issue May 28, 2015 · 25 comments
Assignees
Milestone

Comments

@thaiat
Copy link

thaiat commented May 28, 2015

Hi all,

We are in the process of writing a loopback connector for rethinkdb that uses the amazing rethinkdbdash

The code is available here : https://github.com/yoobic/loopback-connector-rethinkdbdash

After running test (npm test) for about 15 to 20 times, rethinkdb will hang.
I can see this message : 'Entering slow growth mode'

The only way to come back to normal is to execute:

pkill -9 rethinkdb

I'm on mac, lots of RAM and disk space available.
I can reproduce on different laptop.
Using latest of rethinkdb and rethinkdbdash

@neumino
Copy link
Member

neumino commented May 28, 2015

You can enter slow growth mode, it's doesn't always mean that RethinkDB is dead/hanging.
It typically happen if you reach the maximum of open file. Basically rethinkdbdash will just slowly try to open connections instead of eagerly opening them.

I tried running npm test in a loop and ended up with #4300, not sure if it's related though.

@Ijmir
Copy link

Ijmir commented May 28, 2015

After running the tests about 15-20 times in loop the db stops responding and all tests always fail.
Can't reach via web interface too. An error I saw was "Uncaught RqlDriverError: Handshake timedout" while trying to connect to DB in hanged mode.
After restarting db (by killing the process "kill -KILL "), it hangs even after the first time the test are running - no need to 20 times loop. Only when I delete the files in "rethinkdb-data" and restart db, it doesn't hang immediately.

@mlucy mlucy added this to the 2.0.x milestone May 28, 2015
@mlucy mlucy added the tp:bug label May 28, 2015
@mlucy
Copy link
Member

mlucy commented May 28, 2015

Thanks for the bug report.

Is there any error output from RethinkDB before it hangs?

It might be that you're hitting the maximum open file limit on OS X. Do you still see the hang if you run ulimit -n 10000 first?

@mlucy mlucy self-assigned this May 28, 2015
@thaiat
Copy link
Author

thaiat commented May 30, 2015

@mlucy still happenning, here is my ulimit :

-t: cpu time (seconds)              unlimited
-f: file size (blocks)              unlimited
-d: data seg size (kbytes)          unlimited
-s: stack size (kbytes)             8192
-c: core file size (blocks)         0
-v: address space (kbytes)          unlimited
-l: locked-in-memory size (kbytes)  unlimited
-u: processes                       709
-n: file descriptors                65536

@Ijmir
Copy link

Ijmir commented May 31, 2015

@mlucy I get on server logs:

2015-05-31T12:17:43.549539000 125.589003s error: Could not write to socket: Protocol wrong type for socket
2015-05-31T12:20:45.407267000 307.452388s error: Could not write to socket: Protocol wrong type for socket

I ran ulimit -n 10000 but it still reproduced. My ulimit:

-t: cpu time (seconds)              unlimited
-f: file size (blocks)              unlimited
-d: data seg size (kbytes)          unlimited
-s: stack size (kbytes)             8192
-c: core file size (blocks)         0
-v: address space (kbytes)          unlimited
-l: locked-in-memory size (kbytes)  unlimited
-u: processes                       709
-n: file descriptors                10000

Web interface does not respond.

@mlucy
Copy link
Member

mlucy commented Jun 2, 2015

Thanks for the update.

@danielmewes, any idea what might be going on here?

@danielmewes
Copy link
Member

Very odd. I wonder if the log messages are relevant. They might be a red herring, since the issue appears to only happen after a while and then goes away once rethinkdb_data is deleted.
On the other hand I think it's also weird that we're getting a "Protocol wrong type for socket" error while writing to a presumably already existing socket. I would expect an unsupported protocol type to bubble up when we initially open the socket, not when writing to it.
Either the error message is wrong, or something very odd is going on. @Tryneus any ideas about the protocol error?

@Ijmir when RethinkDB hangs, could you check two things for us and post the results here?

  • how much CPU is the RethinkDB process using at that time?
  • what is the content of rethinkdb_data?

This data might help narrow down the possible causes of this.

@danielmewes
Copy link
Member

Hmm the same "Protocol wrong type" error appeared here #4327 (comment) for a different user. It might be unrelated. Still curious what it means.

@Ijmir
Copy link

Ijmir commented Jun 3, 2015

@danielmewes
memory 325MB
cpu 53%
I don't always get the "Protocol wrong type"
rethinkdb_data - https://dl.dropboxusercontent.com/u/15207747/rethinkdb_data.zip
this data hangs rethink db from 1-2 runs of test suite.

screen shot 2015-06-03 at 13 24 29
screen shot 2015-06-03 at 13 25 11

@danielmewes danielmewes assigned danielmewes and unassigned mlucy Jun 3, 2015
@danielmewes danielmewes modified the milestones: 2.0.3, 2.0.x Jun 3, 2015
@danielmewes
Copy link
Member

Thanks for the data and information @Ijmir ! Will try to reproduce.

@danielmewes
Copy link
Member

@Ijmir I think I'm not running the tests correctly. All I get when I run npm test on your repository is


> loopback-connector-rethinkdbdash@0.0.2 test /home/ssd3/daniel/user-data/4299-data/loopback-connector-rethinkdbdash
> gulp lint

[13:29:28] Using gulpfile /home/ssd3/daniel/user-data/4299-data/loopback-connector-rethinkdbdash/gulpfile.js
[13:29:28] Starting 'static'...
[13:29:30] 
/home/ssd3/daniel/user-data/4299-data/loopback-connector-rethinkdbdash/lib/rethink.js
  590:0  warning  Unexpected todo comment  no-warning-comments

✖ 1 problem (0 errors, 1 warning)

[13:29:30] All lint files passed
[13:29:30] Finished 'static' after 1.18 s
[13:29:30] Starting 'lint'...
[13:29:30] Finished 'lint' after 6.71 μs

It doesn't seem to run any RethinkDB tests. Any idea what I'm doing wrong? (I'm not that familiar with node)

@danielmewes
Copy link
Member

Oh never mind. Got it working by running

node_modules/gulp/bin/gulp.js test

instead I think.

@danielmewes
Copy link
Member

On Linux I run into #4300 pretty quickly, so I'm going to test this on Mac next. Maybe we're just not catching the stack overflow properly on Mac.

@thaiat
Copy link
Author

thaiat commented Jun 3, 2015

@danielmewes to run the test you have to execute

make test-verbose 

@danielmewes
Copy link
Member

Thanks, I can reproduce it on OS X.

@danielmewes
Copy link
Member

The weird thing is that once RethinkDB gets into that state, htop shows that it spends ~50% CPU in kernel code. However according to dtrace there are virtually no syscalls going on.

I also tried attaching instruments timing-based profiler and it didn't show anything interesting either.

However I found that increasing the coroutine stack size avoids the issue, so I think the initial theory that this is the same as #4300 , except that something is broken about our coroutine stack overflow detection on OS X.
We do call mprotect successfully on the watchdog page, so something must go wrong with handling the signal that accessing it should trigger.

@danielmewes
Copy link
Member

This will probably go away with #4024 (comment) , though we also need to find out why the stack overflow protection isn't working on OS X.

@danielmewes danielmewes modified the milestones: 2.0.x, 2.0.3 Jun 4, 2015
@danielmewes
Copy link
Member

Ah, on OS X accessing a protected page can trigger SIGSEGV or SIGBUS. On Linux it always causes SIGSEGV. We are handling SIGSEGV but not SIGBUS.

@danielmewes danielmewes modified the milestones: 2.0.3, 2.0.x Jun 4, 2015
@danielmewes
Copy link
Member

A fix for us not catching the stack overflow on OS X is now in code review 2954 by @AtnNn .

@thaiat this fix will stop RethinkDB from hanging, but instead it is going to crash just like described by @neumino in #4300 .
So it doesn't fully resolve the problem. There must be some query in the test suite that's very deeply nested that's causing this to happen in the first place.

In the near future, we plan to implement a protection for such queries such that the query itself will fail, but the server won't crash (our current behavior is obviously bad).

Sorry you had to run into this.

@danielmewes
Copy link
Member

Merged into v2.0.x as of fec4595 and cherry-picked into next in 82f2487 and 978bbf9

@primitive-type
Copy link

I believe I'm still experiencing an issue related to this: neumino/thinky#323

@danielmewes
Copy link
Member

@primitive-type That failure looks unrelated to this issue. The "too many open files" message usually means that there's still a configuration problem on the OS side. Can you make sure that your increased maxfiles settings are correctly applied to the user / shell that you run RethinkDB from?

@primitive-type
Copy link

@danielmewes Sure, here is the result of ulimit -all:

core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
file size               (blocks, -f) unlimited
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 65536
pipe size            (512 bytes, -p) 1
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 2048
virtual memory          (kbytes, -v) unlimited

I closely followed the instructions at http://docs.basho.com/riak/latest/ops/tuning/open-files-limit/#Mac-OS-X

RethinkDB was using about 25% CPU at the time of the crash. (Mac OS X 10.10.4)

@amir-rahnama
Copy link

Recently running a Rethinkdb 2.2.5 cluster goes on my local machine into slow growth mode a lot. I am using Docker for Mac Version 1.12.1-rc1-beta23. I am not really using volumes or anything but still it happens.

Any ideas?

@danielmewes
Copy link
Member

@ambodi It's likely a different issue from what this was about originally. Check out @neumino's comment here #4299 (comment) about what slow-growth mode means.

I recommend checking the ulimits, specifically the open file limit by running ulimit -a. You can also check the RethinkDB log_file for messages like these: error: accept() failed: Too many open files..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants