Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node restarts with no error message #5934

Open
bsharpe opened this issue Jul 11, 2016 · 10 comments
Open

Node restarts with no error message #5934

bsharpe opened this issue Jul 11, 2016 · 10 comments

Comments

@bsharpe
Copy link
Contributor

bsharpe commented Jul 11, 2016

One of our 3 nodes just restarted with no error message in the logs.

2016-07-11T12:54:36.864986769 256836.984920s info: Table 86d681c4-7810-4964-bac5-1e314f235666: Configuration is changing.

...and then...

2016-07-11T13:48:30.249995989 0.173834s notice: Running rethinkdb 2.3.4~0trusty (GCC 4.8.2)...
2016-07-11T13:48:30.263244308 0.187077s notice: Running on Linux 4.2.0-38-generic x86_64
2016-07-11T13:48:30.263315463 0.187145s notice: Loading data from directory /mnt/data/lib/rethinkdb/default
2016-07-11T13:48:30.437299564 0.361133s info: Cache size is set to 204800 MB
2016-07-11T13:48:30.440596177 0.364426s notice: Listening for intracluster connections on port 29015
2016-07-11T13:48:30.440606061 0.364436s warn: Attempted to join self, peer ignored
2016-07-11T13:48:30.441443230 0.365273s info: Attempting connection to 2 peers...

Ubuntu 14.04.01
Using RethinkDB 2.3.4 built from source as per instructions in the docs

We were under normal operations plus a full backup was being perform at the same time.

@larkost
Copy link
Collaborator

larkost commented Jul 11, 2016

One of the ways this can happen is the out-of-memory-killer (OOM, an OS feature) decides that the system would be more stable if there was some free memory and kills the largest user (databases are large users for good reason). Unfortunately there is no catchable signal sent, so there is nothing we can put in our logs. But if this is the case, then there is probably an entry in the system log about the event. Usually they have the text killed process in the line, so should be easy to find.

@bsharpe: can you check for that message? If it is there, then I would check your memory settings to make sure that RethinkDB's cache setting is not set to larger than, say 2/3rds the total memory size: r.db('rethinkdb').table('server_config') (looks for cache_size_mb). We default to half the available memory, but you can set that (usually in a config file) to anything you like (including nonsensical values).

@bsharpe
Copy link
Contributor Author

bsharpe commented Jul 11, 2016

thx @larkost -- cache size was >2/3rds available ram... will tone it down. :)

@bsharpe
Copy link
Contributor Author

bsharpe commented Jul 11, 2016

this would be very useful info to have in the docs...

@danielmewes
Copy link
Member

One thing is unclear to me though: The OOM killer would terminate the rethinkdb process without an error appearing in our log. However we don't have anything in place to automatically restart it. @bsharpe is the automatic restart something that you set up?

@bsharpe
Copy link
Contributor Author

bsharpe commented Jul 11, 2016

@danielmewes yes, we setup monit to restart things...

@bsharpe
Copy link
Contributor Author

bsharpe commented Jul 11, 2016

@larkost confirmed...

Jul 11 16:34:24 db-1 kernel: [2246975.984824] Out of memory: Kill process 164167 (rethinkdb) score 983 or sacrifice child
Jul 11 16:34:24 db-1 kernel: [2246975.984891] Killed process 164167 (rethinkdb) total-vm:296731348kB, anon-rss:259131484kB, file-rss:1684kB

@danielmewes
Copy link
Member

Ok we need to figure out why something started using so much memory.

We've had some reports recently of increased memory usage by RethinkDB under disk I/o contention. Taking a backup might have pushed it into that scenario. We're still investigating this problem though, so it's too early to tell if it's a plausible explanation of this or not.

Another thing we're looking into is whether the backup script itself could be using too much memory.

@bsharpe

@bsharpe
Copy link
Contributor Author

bsharpe commented Jul 11, 2016

@danielmewes

Unfortunately.. we were not running a backup at this time.
when we do though;

  • we use rethinkdb dump
  • we run it on a separate server through a proxy
  • not related to 5935 -- that happens whenever we try to shard this one table (no backups during that time)
  • if i had to guess, but watching the dashboard -- i'd say that our writing is about 1/10th our reading.

@danielmewes
Copy link
Member

@bsharpe So the time where the out-of-memory situation happened was after the backup finished (or before it started)? I'm a bit confused, because you wrote

We were under normal operations plus a full backup was being perform at the same time.

in the first comment. Just want to make sure I understand what happened...

@bsharpe
Copy link
Contributor Author

bsharpe commented Jul 11, 2016

@danielmewes sorry -- we had two of these today - yes, the first was during a backup. The second one was not. So, when the backup was happening -- it was from a different machine via a proxy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants