-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Server quiting with data corruption error on queries #3090
Comments
Thanks @gato for reporting this, and sorry you ran into this. What version of RethinkDB were you running at that time? @danielmewes fixed something that could have lead to such crash. See #2410 If you were running a more recent version of RethinkDB, we should re-open the bug and try to track it down again. |
The log file says he is running 1.13.4. So this looks like a new bug. @gato, would you be willing to send us a copy of the corrupted database files so we can debug the problem? @mglukhovsky can arrange a secure upload site. |
@gato, please send me an email (mike@rethinkdb.com) if you'd be willing to send over the database files so we can track down this down (happy to sign an NDA if necessary). Thanks for reporting this! |
Hi, I've just installed last version from ubuntu repository (rethinkdb 1.14.1-0ubuntu1~trusty (GCC 4.8.2)). I will reimport data now and will run queries after that. if the error persist i will post an updated log and talk with @mglukhovsky. Thanks for the support. Marcelo |
Hi, those are the logs:
Server 2 (the one with the error):
Both servers are identical dual core 2GB ram and 120GB Sata 3 SSD. no other software are running on them (both are physical ones) and rethinkdb is configured to use 512MB as cache. Anyway version 1.15 was released and i saw a couple of thread_pool errors fixed in the commits that are part of it so i will install this version and start over with the tests. Are disconnections and swapping normal? or could this be part of the problem? are there any parameter for the server for make logs more verbose? or do i have to compile my version to have a better logging? Thanks. |
Hi @gato -- sorry you're running into this. We'll look into the issues and fix them ASAP. The best way to proceed is to work with @mglukhovsky do get us the data and have us replicate the issue internally. Unfortunately with heartbeat timeouts there isn't very much additional logging we can do to get to the root of the problem, so we have to work to replicate it internally. |
ok, i will try version 1.15 and let you know. i asked about logs verbosity in general if i can set it to debug or something like that i should do it before the test so we can get more info if something goes wrong. |
@gato thanks, shoot me an email (mentioned above) if this hasn't been resolved in 1.15 so we can track this down. Thanks for your help on this! |
Hi, I been doing lots of test, servers are not asserting any more but I'm still having issues. Those are the test i made 1 - tried version server 1.15.0 and had hertbeat issue on servers and command "rethinkdb import" hangs but servers reconnect to each other and work. Server 1 has 1GB of swap space used but server 2 was ok (import process run on server 1 and connects to it) 2 - Tried version server 1.15.1 and had issues with rethinkdb import aborting with an odd "buggy client" message, I assumed it was an old python library client so I "pip install --upgrade" and updated the client to 1.15.0 3 - Tried with new version and had the same buggy client issue. I have multiple json files with data (4000+ files with 5000 docs each) so i binary searched the file that had thrown the error to see if i found the buggy row. After cutting jsons for a couple of hours i found that we had in some of the json files documents that are other type of documents (we want to import them later in another table) . the odd thing here is that this different kind of documents failed randomly and not always on the same one but every time in one of them. So i removed them and tried again. 4- Now with last server version, last driver version, only one kind of documents and got the same heartbeat problem as in the first test. Servers still running "rethinkdb import" hang forever and server 1 with 1GB of swap used. (no errors on rethinkdb log except for the heartbeat issue) Note that killing the import process do not frees the memory so i assume is not "rethinkdb import" the one sucking up memory (but it could be because is invoked once per file so if is leaking memory even small amounts after calling it like 2000 times, lots of memory will be leaked) I guess heartbeat issue shows when server1 is swapping too much. and the memory leakage only shows in server1 (the one running the import process, also the one that the client connects too and also server2 is linked to server1) |
Thank you for the great report @gato and for running these tests.
|
Also, @gato (and @mglukhovsky) I think the source JSON data would be very helpful for us to analyze the observed memory consumption locally. We are happy to sign an NDA if it contains sensitive data. |
Hi @danielmewes |
Thank you @gato for the clarification. So I think this is essentially an issue of memory consumption and RethinkDB going into swap as a consequence at this point. We will test this more and investigate how to reduce our memory footprint once #2988 is fixed (since these could actually be the same bugs, or at least be related). |
Hi! |
Hi @gato For running RethinkDB in Valgrind, you can follow these steps:
|
Regarding the timeout in the import script: |
I cannot quite reproduce the observerd increase in memory usage by running a @gato did you ever send us your data? That would help us to reproduce the issue. Also an example for the filter conditions that you've been using for testing would be useful. |
Hi
Running this script continuously will consume the swap in around 18hs (at least in our servers) |
@gato Sorry, I had missed the part about the data above. |
I have been trying reproducing this with data derived from the sample document @gato kindly provided for us. Using a cache size of 512 MB, these are the memory usage numbers I'm getting:
It seems I cannot quite reproduce the growing memory consumption from running the count queries. In this test memory usage actually slightly increased over time, but that's probably just a measurement artifact (I might have measured at different times during query execution). One thing that definitely requires more investigation is why the memory consumption was about 500 MB higher after inserting the documents, compared to when I later ran the queries after a restart. I will look into that next. |
Hi @danielmewes |
@gato we are going to switch our default memory allocator from TCMalloc to Jemalloc in RethinkDB 1.16.0. This has resolved the possibly related memory issue in #2988 (comment) . I can send you a RethinkDB 1.15.2 binary for Ubuntu trusty with that change backported. Please drop me a quick email to daniel at rethinkdb.com if you want to give it a shot. |
Hi @danielmewes, Thanks for your offer. |
@gato We are planning to release 1.16.0 in early January. Please let me know which way works better for you. |
I will wait then. Thanks. |
@gato We've just released RethinkDB 1.16. |
@gato Any updates? Are you still seeing this with 1.16? |
Sorry, haven't tried yet, i'm currently working on something else right now and do not have quick access to the servers. Will try to make me some time soon and test it. |
No worries. Thanks for keeping us posted @gato . |
Hi @danielmewes, |
@gato Thanks a lot. I really appreciate you took the time to re-try this. At this point we might have to conclude that 2 GB of RAM (is that still the hardware you're using?) isn't enough for importing this data set with the default configuration at the moment (we should look into that more though). To get past the import stage, you can run |
Yes, all hardware is the same, in the same conditions and with the same data. |
@gato did you get a chance to try the |
Hi @danielmewes, |
@gato Thanks for the info. Sorry this isn't working. I know you must have spent a lot of time on this. So it seems like the 2 GB servers aren't enough for this. I'll try again to reproduce this and will run some memory profiling, but I'm not sure how quickly we can bring the memory usage down enough for making this work. For now the only option seems to be bigger servers unfortunately. |
Hi @danielmewes |
Closing this. We might still want to look into reducing memory usage, but I think there's nothing specific left to do for this issue. |
Hi!
I had this error a time ago running a cluster with 2 servers in a full replicated configuration. it was loaded with 5 million 4kb documents. (and the final installation will have more than 20M)
the error always happens in the same server. I tried switching which server was master and which was replica but always got the error pasted below.
I've checked the disk for errors but found none.
I've rebuilt the installation using rotational disks (3Tb 7200 RPM) but the resulting database was really slow (more than 2 days to create a compound index, compared with about 3h to create the same one in an ssd)
this error is the original one, i will rebuild the installation with the lastest version and update the ticket.
The text was updated successfully, but these errors were encountered: