Node restarts with no error message #5934

bsharpe · 2016-07-11T14:06:56Z

One of our 3 nodes just restarted with no error message in the logs.

2016-07-11T12:54:36.864986769 256836.984920s info: Table 86d681c4-7810-4964-bac5-1e314f235666: Configuration is changing.

...and then...

2016-07-11T13:48:30.249995989 0.173834s notice: Running rethinkdb 2.3.4~0trusty (GCC 4.8.2)...
2016-07-11T13:48:30.263244308 0.187077s notice: Running on Linux 4.2.0-38-generic x86_64
2016-07-11T13:48:30.263315463 0.187145s notice: Loading data from directory /mnt/data/lib/rethinkdb/default
2016-07-11T13:48:30.437299564 0.361133s info: Cache size is set to 204800 MB
2016-07-11T13:48:30.440596177 0.364426s notice: Listening for intracluster connections on port 29015
2016-07-11T13:48:30.440606061 0.364436s warn: Attempted to join self, peer ignored
2016-07-11T13:48:30.441443230 0.365273s info: Attempting connection to 2 peers...

Ubuntu 14.04.01
Using RethinkDB 2.3.4 built from source as per instructions in the docs

We were under normal operations plus a full backup was being perform at the same time.

The text was updated successfully, but these errors were encountered:

larkost · 2016-07-11T16:53:37Z

One of the ways this can happen is the out-of-memory-killer (OOM, an OS feature) decides that the system would be more stable if there was some free memory and kills the largest user (databases are large users for good reason). Unfortunately there is no catchable signal sent, so there is nothing we can put in our logs. But if this is the case, then there is probably an entry in the system log about the event. Usually they have the text killed process in the line, so should be easy to find.

@bsharpe: can you check for that message? If it is there, then I would check your memory settings to make sure that RethinkDB's cache setting is not set to larger than, say 2/3rds the total memory size: r.db('rethinkdb').table('server_config') (looks for cache_size_mb). We default to half the available memory, but you can set that (usually in a config file) to anything you like (including nonsensical values).

bsharpe · 2016-07-11T20:26:40Z

thx @larkost -- cache size was >2/3rds available ram... will tone it down. :)

bsharpe · 2016-07-11T20:28:44Z

this would be very useful info to have in the docs...

danielmewes · 2016-07-11T20:41:44Z

One thing is unclear to me though: The OOM killer would terminate the rethinkdb process without an error appearing in our log. However we don't have anything in place to automatically restart it. @bsharpe is the automatic restart something that you set up?

bsharpe · 2016-07-11T20:43:00Z

@danielmewes yes, we setup monit to restart things...

bsharpe · 2016-07-11T20:46:55Z

@larkost confirmed...

Jul 11 16:34:24 db-1 kernel: [2246975.984824] Out of memory: Kill process 164167 (rethinkdb) score 983 or sacrifice child
Jul 11 16:34:24 db-1 kernel: [2246975.984891] Killed process 164167 (rethinkdb) total-vm:296731348kB, anon-rss:259131484kB, file-rss:1684kB

danielmewes · 2016-07-11T20:50:53Z

Ok we need to figure out why something started using so much memory.

We've had some reports recently of increased memory usage by RethinkDB under disk I/o contention. Taking a backup might have pushed it into that scenario. We're still investigating this problem though, so it's too early to tell if it's a plausible explanation of this or not.

Another thing we're looking into is whether the backup script itself could be using too much memory.

@bsharpe

Were you using rethinkdb dump (or rethinkdb export) for the backup?
Did you run it on the same server where the OOM error occurred?
Could a scheduled backup process also have played a role in Node uses all swap and won't give it back without a restart #5935 ?
Does your normal query workload include a lot of write operations?

bsharpe · 2016-07-11T20:54:13Z

@danielmewes

Unfortunately.. we were not running a backup at this time.
when we do though;

we use rethinkdb dump
we run it on a separate server through a proxy
not related to 5935 -- that happens whenever we try to shard this one table (no backups during that time)
if i had to guess, but watching the dashboard -- i'd say that our writing is about 1/10th our reading.

danielmewes · 2016-07-11T22:04:23Z

@bsharpe So the time where the out-of-memory situation happened was after the backup finished (or before it started)? I'm a bit confused, because you wrote

We were under normal operations plus a full backup was being perform at the same time.

in the first comment. Just want to make sure I understand what happened...

bsharpe · 2016-07-11T22:05:36Z

@danielmewes sorry -- we had two of these today - yes, the first was during a backup. The second one was not. So, when the backup was happening -- it was from a different machine via a proxy.

bsharpe mentioned this issue Jul 11, 2016

Node uses all swap and won't give it back without a restart #5935

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Node restarts with no error message #5934

Node restarts with no error message #5934

bsharpe commented Jul 11, 2016 •

edited

larkost commented Jul 11, 2016

bsharpe commented Jul 11, 2016

bsharpe commented Jul 11, 2016

danielmewes commented Jul 11, 2016

bsharpe commented Jul 11, 2016

bsharpe commented Jul 11, 2016

danielmewes commented Jul 11, 2016

bsharpe commented Jul 11, 2016

danielmewes commented Jul 11, 2016

bsharpe commented Jul 11, 2016

Node restarts with no error message #5934

Node restarts with no error message #5934

Comments

bsharpe commented Jul 11, 2016 • edited

larkost commented Jul 11, 2016

bsharpe commented Jul 11, 2016

bsharpe commented Jul 11, 2016

danielmewes commented Jul 11, 2016

bsharpe commented Jul 11, 2016

bsharpe commented Jul 11, 2016

danielmewes commented Jul 11, 2016

bsharpe commented Jul 11, 2016

danielmewes commented Jul 11, 2016

bsharpe commented Jul 11, 2016

bsharpe commented Jul 11, 2016 •

edited