Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RethinkDB shuts down slowly on rotational drive #1547

Closed
danielmewes opened this issue Oct 16, 2013 · 12 comments
Closed

RethinkDB shuts down slowly on rotational drive #1547

danielmewes opened this issue Oct 16, 2013 · 12 comments

Comments

@danielmewes
Copy link
Member

Just something I've noticed which might be worth investigating:

After inserting a lot of data (~100 GB) into a single RethinkDB instance on rotational drives, I shut down the server. There was only one table, with a cache size of just 1 GB. After about 15 minutes, it is still shutting down.

What's weird is the output of iostat -m:

Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sda             305.50         2.86         0.75          5          1
dm-0            201.50         0.74         0.75          1          1
dm-1            542.00         2.12         0.00          4          0

First of all the write throughput is extremely low at less than a MB per second. Interestingly, RethinkDB is reading more data than writing. I assume this is the garbage collector at work.
At the same time, the relatively high number of transactions (305 on sda) could indicate that it is bound by disk seeks but that's just speculation.

The server uses only very little CPU at this point.

@danielmewes
Copy link
Member Author

hdparm -t /dev/sda reports that it can only read 8 MB/s while RethinkDB is shutting down. Without RethinkDB running the throughput is 124 MB/s on the same disk.

This indicates that the disk is actually saturated by the server.

If this is indeed the garbage collector:

  1. Why does GC seem to have such a random access pattern? (or is inefficient with rotational drives in some other way)
  2. Why is garbage collecting not interrupted at shut down? It just seems to keep on going.

Another interesting effect is that memory utilization as reported by htop fluctuates significantly during the shutdown process. It varied between about 30% (of 6 GB total on this machine), and up to more than 67%, repeatedly going up and down. I wouldn't expect memory consumption to increase (that much) during shutdown.

@coffeemug
Copy link
Contributor

I don't think it's necessarily clear that it's GC that causes a slow shutdown and isn't being interrupted. We need to validate that theory first.

@wojons
Copy link
Contributor

wojons commented Oct 22, 2013

If I may ask did you shutdown immediately after writing all this data? Also how much memory does the server have its self? What sort of write durability was used for inserting the data?

@danielmewes
Copy link
Member Author

Hi @wojons , I shut it down relatively shortly after writing the data, which I inserted with soft durability.
The RethinkDB server used between 2 and 4 GB of the system's 6 GB of memory.
Do you see similar (or different) behavior?

@wojons
Copy link
Contributor

wojons commented Oct 22, 2013

@danielmewes So far in all of my tests I have had fast shut downs on spinning disk (Less then 30 seconds). I have seen your issue with other databases. Did you have any extra indexes other then the primary?

@danielmewes
Copy link
Member Author

@wojons Thanks for chiming in. I did not have any secondary indexes.
The odd thing is that RethinkDB was actually reading from disk during shutdown. Because this was just a single machine, that couldn't have been due to backfills or cluster operations going on either. As far as I know the only other thing which could cause disk reads during shutdown is our garbage collector. As @coffeemug pointed out, that hypothesis yet has to be verified though.

I will check later which exact circumstances are required to reproduce it.

@wojons
Copy link
Contributor

wojons commented Oct 24, 2013

@danielmewes i think i just ran into your problem i did a mass delete of 2 million items and the shutdown took a while

@danielmewes danielmewes modified the milestones: backlog, subsequent Nov 25, 2014
@danielmewes danielmewes modified the milestones: outdated, backlog Jan 21, 2015
@danielmewes
Copy link
Member Author

Closing as outdated.

@wojons
Copy link
Contributor

wojons commented Jan 21, 2015

@danielmewes do we want to test this before closing it.

@Tryneus
Copy link
Member

Tryneus commented Jan 21, 2015

If this was happening in debug mode, it may have been due to some debug checks that were performed. I ran into that a couple months ago and removed the check as it was rather antiquated. If this was happening in release mode, nevermind.

@wojons
Copy link
Contributor

wojons commented Jan 21, 2015

@Tryneus i can find sometime this weekened to load 2M real records into a spinning disk db and then delete it and then shutdown and see if its still happending if @danielmewes he can assign it to me.

@danielmewes
Copy link
Member Author

Yeah I closed this because the last instance I'm aware of at least was quite a while ago and we changed and fixed a lot of things since.

Please just re-open this if you see it again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants