-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hangs with asynchronous writes, millions of records, and ≥60% cache usage #5634
Comments
I've disabled swap (removed the partition) and reduced RethinkDB's cache size to below 16GiB: the hanging just happened earlier. I guess a million or two records have not been written to disk. Being in this state, I cannot kill RethinkDB's main process. |
The issue does not exist in RethinkDB 2.2.6. |
|
Running on Ubuntu 16.04 (Xenial) (updated as of today), their Linux 4.4.0-4-generic, SSD with ext4. RethinkDB is yours published for Docker and started this way: # sysctl -w vm.overcommit_memory=1
# (no memory limits were hit, though)
docker run --name bench_rethinkdb \
-v /var/lib/rethinkdb:/data \
--net=host --privileged \
-d \
rethinkdb:2.3 rethinkdb --bind all --no-update-check (I've used 2.2.6 and 2.3.0.) I cannot send you the data directory, but the raw data ready for your import tool. Packed as squashfs image of about 2.5 GiB. That's all what is needed to reproduce the issue. No NDA needed. If that's okay with you I will start the upload and link to it here in a day or two. |
|
Thanks for looking into this! https://[link not public anymore]/ docker run \
-v /var/lib/rethinkdb:/data \
--net=host --privileged \
-d \
rethinkdb:2.3 rethinkdb --bind all --no-update-check In database 'test' create a new table 'benchdata'; sudo mount benchdata-people.csv.xz.squashfs /mnt
/usr/local/bin/rethinkdb-import \
--force --table test.benchdata \
--format csv -f /mnt/*.csv After a few minutes, depending on RethinkDB's cache size, writes/s drop to zero and RethinkDB stops responding. The error occurs even with said importer. I've formerly used a script written in Go for that (which pushed the data in in batches for 100 rows/transaction), but found that it doesn't make any difference. It's been of no difference whether I used 'address' as string or nested map. It's been of no difference whether 'isMale' is a string or bool, 'birthday' datetime or string. I guess this is about millions of rows <512byte and something kicking in eventually. |
@wmark Sorry for the delay, I was out of office last week. The download link no longer seems to be valid. Could you send me a new one via email to daniel@rethinkdb.com ? |
It looks like Docker has a default size limit for the changed data of a given container, depending on the backend used. I wonder what happens if that limit is exhausted. Do you think it's possible that you're hitting this? |
I don't think I hit that limit: version 2.2.6 runs just fine, even with a cache of 22 GiB. The data is here: https://s.blitznote.com/unclassified/benchdata-people.csv.xz.squashfs (2.5 GiB) |
Thanks for the re-upload @wmark . I got the data this time. Trying to reproduce now. |
So far I couldn't reproduce. It has imported 46M rows at this point and is still writing at a steady rate. |
My server went through and imported all 50M rows fine. I'll retry this on a larger machine in the next days to see if that changes anything. |
I will pull some memory (32 GiB--) and disable some threads (24--) tomorrow. |
@wmark You mentioned that you tried both larger and smaller cache sizes. I used a cache of just 2 GB. Had you previously tried anything that small? |
No, I've never tried with it with anything less than the 7 GiB RethinkDB sets with 16 GiB memory, which is the lowest I can go (without setting anything myself) because I've only memory sticks of 16 GiB each. Anyway, I've started test series now on a fresh Ubuntu 16.04, 2×SSDs with Ext4, Linux 4.4.0-21, Docker 1.11.0 (driver: overlay), and will update this comment with the results.
… 32 GiB, 4 threads combinations follow. Due to (1) and the new Linux scheduling RethinkDB on the first half of the CoD (~ NUMA node) I suspect that a variable in RethinkDB is used as if it were atomic, but without proper locking. |
Thanks for running more tests and helping us tracking this down. Can you explain a bit more why you think that the success of configuration 1 is an indication of a locking issue? |
It's because this time RethinkDB's threads didn't span two NUMA nodes or more. I will run some more tests to confirm that. |
I've changed cache sizes, forced the kernel to schedule RethinkDB on two different NUMA nodes, switched to an importer written in Go to make inserts burstier – but cannot reproduce this with 4.4.0-21 anymore. Thanks for staying on this with me! |
@wmark Thanks for putting so much work into this. I'll see if I can find anything suspicious in the kernel change logs. So Ubuntu 4.4.0-4 was bad, 4.4.0-21 is good it seems right? In the meantime, I have also tested 2.3.0 and 2.3.1 on a larger server with more RAM (though I limited the cache size to 16 GB), and with two NUMA nodes. Both worked fine, despite the scheduler scheduling RethinkDB across both nodes. However I was running a much older kernel on the host, namely 3.13.0-85-generic. So it seems plausible that this is somehow related to a kernel detail. |
Well I skimmed through the changelog http://changelogs.ubuntu.com/changelogs/pool/main/l/linux/linux_4.4.0-21.37/changelog but nothing popped out to me. That doesn't mean much though. |
I seem to be hitting an invisible wall when inserting some millions of rows (avg. 160bytes in size):
As soon as I've inserted 18M rows RethinkDB just stops responding.
Any pending INSERTs just hang.
CPU is at 0% at this point, no disk activity (no consolidation of the data on disk), ~60% of „cache“ has been used (I tried setting the cache to 16 GiB and then independently to 22 GiB without difference); and Linux shows that I still have some gigabytes memory free (out of the total memory, which is 32 GiB).
The web interface shows „NaN% cache used“ after some time after the stalling.
The text was updated successfully, but these errors were encountered: