Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Server quiting with data corruption error on queries #3090

Closed
gato opened this issue Sep 23, 2014 · 38 comments
Closed

Server quiting with data corruption error on queries #3090

gato opened this issue Sep 23, 2014 · 38 comments
Assignees
Milestone

Comments

@gato
Copy link

gato commented Sep 23, 2014

Hi!
I had this error a time ago running a cluster with 2 servers in a full replicated configuration. it was loaded with 5 million 4kb documents. (and the final installation will have more than 20M)
the error always happens in the same server. I tried switching which server was master and which was replica but always got the error pasted below.
I've checked the disk for errors but found none.
I've rebuilt the installation using rotational disks (3Tb 7200 RPM) but the resulting database was really slow (more than 2 days to create a compound index, compared with about 3h to create the same one in an ssd)

2014-08-21T11:47:19.362082555 0.128185s info: Running rethinkdb 1.13.4-0ubuntu1~trusty (GCC 4.8.2)...
2014-08-21T11:47:19.371603714 0.137707s info: Running on Linux 3.13.0-34-generic x86_64
2014-08-21T11:47:19.371719661 0.137822s info: Using cache size of 512 MB
2014-08-21T11:47:19.371927697 0.138030s warn: Requested cache size is larger than available memory.
2014-08-21T11:47:19.371969235 0.138072s info: Loading data from directory /mnt/v1/rethinkdb/data
2014-08-21T11:47:19.430678029 0.196781s info: Listening for intracluster connections on port 29015
2014-08-21T11:47:19.432559790 0.198662s info: Listening for client driver connections on port 28015
2014-08-21T11:47:19.432733746 0.198836s info: Listening for administrative HTTP connections on port 8080
2014-08-21T11:47:19.433650128 0.199753s info: Listening on addresses: 127.0.0.1, 127.0.1.1, 10.160.221.51, 10.160.222.51, ::1, fe80::218:71ff:fee3:6e6d%2, fe80::218:71ff:fee3:6e6e%3
2014-08-21T11:47:19.433665661 0.199768s info: Server ready
2014-08-21T11:47:46.608890602 27.374993s info: Connected to server "moev0" a130e6e5-f65f-413e-9162-45dbef72b80d
2014-08-21T12:24:00.783312435 2201.549415s error: Error in src/rdb_protocol/lazy_json.cc at line 22:
2014-08-21T12:24:00.783498550 2201.549601s error: Guarantee failed: [res == archive_result_t::SUCCESS] Deserialization of rdb value failed with error archive_result_t::RANGE_ERROR.
2014-08-21T12:24:00.783590087 2201.549693s error: Backtrace:
2014-08-21T12:24:02.018490331 2202.784594s error: Thu Aug 21 12:24:00 2014\n\n1: backtrace_t::backtrace_t() at 0xad8350 (/usr/bin/rethinkdb)\n2: format_backtrace(bool) at 0xad86e3 (/usr/bin/rethinkdb)\n3: report_fatal_error(char const*, int, char const*, ...) at 0xca3055 (/usr/bin/rethinkdb)\n4: get_data(rdb_value_t const*, buf_parent_t) at 0x98a142 (/usr/bin/rethinkdb)\n5: lazy_json_t::get() const at 0x98a2b4 (/usr/bin/rethinkdb)\n6: rget_cb_t::handle_pair(scoped_key_value_t&&, concurrent_traversal_fifo_enforcer_signal_t) at 0x90f665 (/usr/bin/rethinkdb)\n7: concurrent_traversal_adapter_t::handle_pair_coro(scoped_key_value_t*, semaphore_acq_t*, fifo_enforcer_write_token_t, auto_drainer_t::lock_t) at 0xa74e10 (/usr/bin/rethinkdb)\n8: callable_action_instance_t<std::_Bind<std::_Mem_fn<void (concurrent_traversal_adapter_t::*)(scoped_key_value_t*, semaphore_acq_t*, fifo_enforcer_write_token_t, auto_drainer_t::lock_t)> (concurrent_traversal_adapter_t*, scoped_key_value_t*, semaphore_acq_t*, fifo_enforcer_write_token_t, auto_drainer_t::lock_t)> >::run_action() at 0xa74c54 (/usr/bin/rethinkdb)\n9: coro_t::run() at 0xa20578 (/usr/bin/rethinkdb)
2014-08-21T12:24:02.018751993 2202.784855s error: Exiting.

this error is the original one, i will rebuild the installation with the lastest version and update the ticket.

@neumino
Copy link
Member

neumino commented Sep 23, 2014

Thanks @gato for reporting this, and sorry you ran into this.

What version of RethinkDB were you running at that time?

@danielmewes fixed something that could have lead to such crash. See #2410
The fix was shipped in RethinkDB 1.13.4 and 1.14.

If you were running a more recent version of RethinkDB, we should re-open the bug and try to track it down again.

@timmaxw
Copy link
Member

timmaxw commented Sep 23, 2014

The log file says he is running 1.13.4. So this looks like a new bug.

@gato, would you be willing to send us a copy of the corrupted database files so we can debug the problem? @mglukhovsky can arrange a secure upload site.

@mglukhovsky
Copy link
Member

@gato, please send me an email (mike@rethinkdb.com) if you'd be willing to send over the database files so we can track down this down (happy to sign an NDA if necessary). Thanks for reporting this!

@gato
Copy link
Author

gato commented Sep 24, 2014

Hi, I've just installed last version from ubuntu repository (rethinkdb 1.14.1-0ubuntu1~trusty (GCC 4.8.2)). I will reimport data now and will run queries after that. if the error persist i will post an updated log and talk with @mglukhovsky. Thanks for the support. Marcelo

@gato
Copy link
Author

gato commented Sep 26, 2014

Hi,
I've reimported the original dataset (around 5M docs) and run some test and they worked ok (i had 1 or 2 server disconection but that was all). so I started loading the full dataset (21M docs) and one of the servers failed while running the import process. (at aprox 19M docs)

those are the logs:
Server 1:

2014-09-25T10:57:22.716848337 5.090075s info: Running rethinkdb 1.14.1-0ubuntu1~trusty (GCC 4.8.2)...
2014-09-25T10:57:22.740781195 5.114009s info: Running on Linux 3.13.0-34-generic x86_64
2014-09-25T10:57:22.740901040 5.114127s info: Using cache size of 512 MB
2014-09-25T10:57:22.741341817 5.114568s info: Loading data from directory /mnt/v0/rethinkdb/data
2014-09-25T10:57:22.777616743 5.150843s info: Listening for intracluster connections on port 29015
2014-09-25T10:57:22.777861916 5.151088s info: Attempting connection to 1 peer...
2014-09-25T10:57:22.786064698 5.159291s info: Listening for client driver connections on port 28015
2014-09-25T10:57:22.786244969 5.159471s info: Listening for administrative HTTP connections on port 8080
2014-09-25T10:57:22.786249516 5.159476s info: Connected to server "larryv0" a4d687b6-4bdb-4207-a48e-aeb9eaed4f2c
2014-09-25T10:57:22.790147613 5.163374s info: Listening on addresses: 127.0.0.1, 127.0.1.1, 10.160.220.50, 10.160.221.50, ::1, fe80::217:a4ff:fea7:ac3%2, fe80::217:a4ff:fea7:ac4%3
2014-09-25T10:57:22.790155112 5.163381s info: Server ready
2014-09-25T21:04:01.483095821 36403.805568s info: Disconnected from server "larryv0" a4d687b6-4bdb-4207-a48e-aeb9eaed4f2c
2014-09-25T21:04:04.471407758 36406.793879s info: Connected to server "larryv0" a4d687b6-4bdb-4207-a48e-aeb9eaed4f2c
2014-09-26T14:23:59.183050421 98801.505521s error: Heartbeat timeout, killing connection to peer ::ffff:10.160.222.51
2014-09-26T14:23:59.916072239 98802.238543s info: Disconnected from server "larryv0" a4d687b6-4bdb-4207-a48e-aeb9eaed4f2c

Server 2 (the one with the error):

2014-09-25T10:56:48.170626257 5.346056s info: Running rethinkdb 1.14.1-0ubuntu1~trusty (GCC 4.8.2)...
2014-09-25T10:56:48.185744257 5.361175s info: Running on Linux 3.13.0-34-generic x86_64
2014-09-25T10:56:48.185855786 5.361286s info: Using cache size of 512 MB
2014-09-25T10:56:48.186141421 5.361571s info: Loading data from directory /mnt/v0/rethinkdb/data
2014-09-25T10:56:48.238824341 5.414254s info: Listening for intracluster connections on port 29015
2014-09-25T10:56:48.239071038 5.414501s info: Attempting connection to 1 peer...
2014-09-25T10:57:22.833675987 39.759934s info: Connected to server "moev0" d9e26816-d546-41ee-9194-f83b0fc8ff7b
2014-09-25T10:57:22.835094344 39.761353s info: Listening for client driver connections on port 28015
2014-09-25T10:57:22.835380242 39.761639s info: Listening for administrative HTTP connections on port 8080
2014-09-25T10:57:22.837132536 39.763391s info: Listening on addresses: 127.0.0.1, 127.0.1.1, 10.160.221.51, 10.160.222.51, ::1, fe80::218:71ff:fee3:6e6d%2, fe80::218:71ff:fee3:6e6e%3
2014-09-25T10:57:22.837240277 39.763499s info: Server ready
2014-09-25T21:03:55.677450966 36432.603709s error: Heartbeat timeout, killing connection to peer ::ffff:10.160.221.50
2014-09-25T21:03:55.678208625 36432.604467s info: Disconnected from server "moev0" d9e26816-d546-41ee-9194-f83b0fc8ff7b
2014-09-25T21:04:04.561072618 36441.487331s info: Connected to server "moev0" d9e26816-d546-41ee-9194-f83b0fc8ff7b
2014-09-26T14:23:45.078447313 98822.004706s error: Error in src/arch/runtime/thread_pool.cc at line 343:
2014-09-26T14:23:45.143476886 98822.069735s error: Segmentation fault from reading the address 0x7c9b8ac5.
2014-09-26T14:23:45.143524434 98822.069783s error: Backtrace:
2014-09-26T14:24:14.645453066 98851.571713s error: Heartbeat timeout, killing connection to peer 10.160.220.50
2014-09-26T14:24:18.064532476 98854.990791s error: Heartbeat timeout, killing connection to peer 10.160.220.50
2014-09-26T14:24:20.064550995 98856.990809s error: Heartbeat timeout, killing connection to peer 10.160.220.50
2014-09-26T14:24:22.064589806 98858.990848s error: Heartbeat timeout, killing connection to peer 10.160.220.50
2014-09-26T14:24:27.722749596 98864.649008s error: Heartbeat timeout, killing connection to peer 10.160.220.50
2014-09-26T14:24:40.723142184 98877.649401s error: Heartbeat timeout, killing connection to peer 10.160.220.50

Both servers are identical dual core 2GB ram and 120GB Sata 3 SSD. no other software are running on them (both are physical ones) and rethinkdb is configured to use 512MB as cache.
at the time of the crash server 1 has 400MB of swap memory allocated and server 2 1.1GB. so maybe cache value is configured too high.
Database is configured with 2 replicas 2 shards 2 ack.

Anyway version 1.15 was released and i saw a couple of thread_pool errors fixed in the commits that are part of it so i will install this version and start over with the tests.

Are disconnections and swapping normal? or could this be part of the problem? are there any parameter for the server for make logs more verbose? or do i have to compile my version to have a better logging?

Thanks.

@coffeemug
Copy link
Contributor

Hi @gato -- sorry you're running into this. We'll look into the issues and fix them ASAP. The best way to proceed is to work with @mglukhovsky do get us the data and have us replicate the issue internally. Unfortunately with heartbeat timeouts there isn't very much additional logging we can do to get to the root of the problem, so we have to work to replicate it internally.

@gato
Copy link
Author

gato commented Sep 26, 2014

ok, i will try version 1.15 and let you know. i asked about logs verbosity in general if i can set it to debug or something like that i should do it before the test so we can get more info if something goes wrong.

@mglukhovsky
Copy link
Member

@gato thanks, shoot me an email (mentioned above) if this hasn't been resolved in 1.15 so we can track this down. Thanks for your help on this!

@danielmewes danielmewes added this to the 1.15.x milestone Oct 8, 2014
@gato
Copy link
Author

gato commented Oct 9, 2014

Hi, I been doing lots of test, servers are not asserting any more but I'm still having issues. Those are the test i made

1 - tried version server 1.15.0 and had hertbeat issue on servers and command "rethinkdb import" hangs but servers reconnect to each other and work. Server 1 has 1GB of swap space used but server 2 was ok (import process run on server 1 and connects to it)

2 - Tried version server 1.15.1 and had issues with rethinkdb import aborting with an odd "buggy client" message, I assumed it was an old python library client so I "pip install --upgrade" and updated the client to 1.15.0

3 - Tried with new version and had the same buggy client issue. I have multiple json files with data (4000+ files with 5000 docs each) so i binary searched the file that had thrown the error to see if i found the buggy row. After cutting jsons for a couple of hours i found that we had in some of the json files documents that are other type of documents (we want to import them later in another table) . the odd thing here is that this different kind of documents failed randomly and not always on the same one but every time in one of them. So i removed them and tried again.

4- Now with last server version, last driver version, only one kind of documents and got the same heartbeat problem as in the first test. Servers still running "rethinkdb import" hang forever and server 1 with 1GB of swap used. (no errors on rethinkdb log except for the heartbeat issue)

Note that killing the import process do not frees the memory so i assume is not "rethinkdb import" the one sucking up memory (but it could be because is invoked once per file so if is leaking memory even small amounts after calling it like 2000 times, lots of memory will be leaked)

I guess heartbeat issue shows when server1 is swapping too much. and the memory leakage only shows in server1 (the one running the import process, also the one that the client connects too and also server2 is linked to server1)
I will do one more test that will be make server1 link to server2 and run import process in server1 but connecting to server2 to see if leakage appears and in which server. if this do not work the only thing i could do is compile lastest rethinkdb with some extra debug info to see what goes wrong.
Ideas are welcomed.
Regards.
Marcelo

@mglukhovsky
Copy link
Member

@Tryneus, do you have any ideas on why rethinkdb import would be hanging? Would it be helpful to get a copy of the JSON data @gato is trying to import?

@danielmewes
Copy link
Member

Thank you for the great report @gato and for running these tests.
I can see two issues here, both of which seem to be independent of the original data corruption crash:

  • RethinkDB going into swap, which in turn causes it to have heartbeat timeouts and time out when you run rethinkdb import. In case it's not already at a low value, I recommend reducing the cache size to see if that improves things. See http://rethinkdb.com/docs/memory-usage/ for details.
    May I ask how many tables you have in your database? There currently is a relatively large overhead (of several MB) per existing table. Not sure if that's an issue in this case though.
    Finally, we also have at least one other report (Apparent memory leak #2988) of excessive memory usage after running RethinkDB for a while. It's theoretically possible that your import process triggers the same behavior much quicker, though I don't want to draw any premature conclusions.
  • The second issue is that we give a bad error message when trying to insert strings that contain a null character. This is likely the cause of the "buggy client" error you've been seeing. See issue Improve driver handling of strings containing null bytes #3164 for that.

@danielmewes
Copy link
Member

Also, @gato (and @mglukhovsky) I think the source JSON data would be very helpful for us to analyze the observed memory consumption locally. We are happy to sign an NDA if it contains sensitive data.

@gato
Copy link
Author

gato commented Oct 9, 2014

Hi @danielmewes
I know the "buggy client" is a different issue and it doesn't bother me right now. for now i will be happy loading the first set of data.
Answering your questions: i have only 1 database with 1 table (the one i'm trying to load) it's configured with replica count = 2, ack = 2 and shards = 2. I create the database and import a small subset of data like 40k docs to allow a balanced sharding (our data already have uuids as keys so i use them). after replica and sharding is configured i delete those documents and start the full import process.
Servers are two old hp DL320 2 procs 2 GB ram and 120GB ssd running Ubuntu 14.04 with nothing else. cache is configured with 512MB and no other configuration is modified.
As i previously commented when i've started the test rethinkdb was 1.13 (i guess) and with a smaller sample of data i get the corruption errors not while importing but when querying the database. with version 1.14.1 I tried the same sample of data and worked ok so i tried to upload the whole dataset and started to have the disconnection issues and the seg fault.
then with 1.15.0 and 1.15.1 i still have the disconnections, i don't see the seg fault but importing client hangs for ever.
anyway i'm re importing data (as we speak) running (rethinkdb import) in one server but pointing to the other one. (just to see if client is leaking memory or the server part that talk with clients is)
I can't send you data (at least right now) because is not ours it's from several different parties (8 at least) and is very sensitive: traceability information of oncology and aids drugs) , but i can compile head version and increase logging verbosity even adding logs through the code, to help you identify the issue.
I will let you know the result of the test i'm running now. and will see how to continue then.
Regards.
Marcelo

@danielmewes
Copy link
Member

Thank you @gato for the clarification.

So I think this is essentially an issue of memory consumption and RethinkDB going into swap as a consequence at this point.
For now, is there an option for you to get more memory for the machines @gato ?
Another option could be to reduce the number of concurrent clients that rethinkdb import uses. That should also reduce memory consumption on the server. The command line argument for that is rethinkdb import --clients NUM_CLIENTS. I suggest trying 1-4 clients.

We will test this more and investigate how to reduce our memory footprint once #2988 is fixed (since these could actually be the same bugs, or at least be related).

@gato
Copy link
Author

gato commented Oct 23, 2014

Hi!
I've done the test described before and the import process run for a longer period of time but failed the same way (although much more data were imported ~20M docs from 21M+)
I've restarted both processes and memory went back to normal.
So i've used this almost complete set to run a couple of query test (all "filter({simple condition}).count().run()' like) in a loop, as none of them were using indexes every one of them take around 34 minutes to complete (which is fine for me). And after around 12 or 13 hours of running almost all swap where consumed.
I've built the database from repository with debug code and started it with valgrind but the server is so slow running with valgrind that "rethinkdb import" aborts with "Connection error during 'table check': timed out" also valgrind complains a lot about "Conditional jump or move depends on uninitialised value at..." and "Use of uninitialised value of size 8 at ..."
Do you have a way to profile or log memory usage that i can use to help in solving this issue? do you use gperf instead of valgrind? are any ./configure's or make's parameters that i need to set up to run this test?
Thanks.
Marcelo

@danielmewes
Copy link
Member

Hi @gato
thanks for helping us debug this.

For running RethinkDB in Valgrind, you can follow these steps:

  • configure without TCMalloc: ./configure --without-tcmalloc
  • build with special Valgrind hints enabled (otherwise Valgrind gets confused by our custom coroutine code): make DEBUG=1 VALGRIND=1
  • Use the supplied suppressions file when running Valgrind: cd build/debug_notcmalloc_valgrind; valgrind --suppressions=../../scripts/rethinkdb-valgrind-suppressions.supp rethinkdb

@danielmewes
Copy link
Member

Regarding the timeout in the import script:
You can try modifying the script. Is is usually installed to /usr/local/lib/python*/site-packages/rethinkdb I think. Open the file _import.py in there and look for all occurances of r.connect(host, port, auth_key=auth_key).
Add the optional argument r.connect(host, port, auth_key=auth_key, timeout=180) to each of them.

@danielmewes
Copy link
Member

I cannot quite reproduce the observerd increase in memory usage by running a filter().count() repeatedly (kept it running for a few days).

@gato did you ever send us your data? That would help us to reproduce the issue. Also an example for the filter conditions that you've been using for testing would be useful.
@mglukhovsky can you get things set up for @gato to upload his data file?

@gato
Copy link
Author

gato commented Oct 28, 2014

Hi
no, i haven't sent you data for the reasons described above (previous message).
i have this script running in a loop

#!/usr/bin/python
import rethinkdb as r
import datetime

def logme(str):
        print datetime.datetime.now(), ":", str

logme("connecting with server")
conn = r.connect( "larry", 28015).repl()

logme("query all")
count = r.db("cointreau").table("row").filter({}).count().run()
logme(["rows:", count])

logme("query by serie")
count = r.db("cointreau").table("row").filter({"process":{"dataSent":{"serie":"00040129"}}}).count().run()
logme(["rows:", count])

logme("query by lote")
count = r.db("cointreau").table("row").filter({"process":{"dataSent":{"lote":"PA141A"}}}).count().run()
logme(["rows:", count])

logme("done")

Running this script continuously will consume the swap in around 18hs (at least in our servers)
i'm sending @mglukhovsky a sample row with identifications and some text replaced with random data
so you can have a better idea of the amount and type of data that we use
Sorry i couldn't run with valgrind enabled yet. but i will as soon as have some time.

@danielmewes
Copy link
Member

@gato Sorry, I had missed the part about the data above.
Thank you for your test script and the randomized sample document. We will try to use those to reproduce it on our end.

@danielmewes
Copy link
Member

I have been trying reproducing this with data derived from the sample document @gato kindly provided for us.

Using a cache size of 512 MB, these are the memory usage numbers I'm getting:

  • after inserting 20M documents: 2,107 MB (RES)
  • restarting RethinkDB afterwards (I wanted to see if this would change the memory used): 954 MB
  • after running two types of count queries (one filtered on serie with about 200 matches and one over all documents): 1,619 MB
  • after repeatedly running these queries for about 20 hours: 1,581 MB

It seems I cannot quite reproduce the growing memory consumption from running the count queries. In this test memory usage actually slightly increased over time, but that's probably just a measurement artifact (I might have measured at different times during query execution).

One thing that definitely requires more investigation is why the memory consumption was about 500 MB higher after inserting the documents, compared to when I later ran the queries after a restart. I will look into that next.

@gato
Copy link
Author

gato commented Oct 31, 2014

Hi @danielmewes
About memory consumption when running queries maybe is something particular of my current installation.
Remember that I was importing the whole set of data when big swap use and hearthbeat issues showed and the import process was force terminated. then i re started the servers and run the tests with the imported data which may have some degree of corruption. not data issues are shown on logs, but i don't know for sure.
Perhaps fixing the memory issue in the import process will allow me to import the whole dataset without issues and the query issues fixes too.

@danielmewes
Copy link
Member

@gato we are going to switch our default memory allocator from TCMalloc to Jemalloc in RethinkDB 1.16.0. This has resolved the possibly related memory issue in #2988 (comment) .

I can send you a RethinkDB 1.15.2 binary for Ubuntu trusty with that change backported. Please drop me a quick email to daniel at rethinkdb.com if you want to give it a shot.

@gato
Copy link
Author

gato commented Nov 26, 2014

Hi @danielmewes, Thanks for your offer.
How will 1.16.0 be released? if it's not to far away i prefer waiting to that version.
This is because setting up everything and running the tests take me a couple of days.

@danielmewes
Copy link
Member

@gato We are planning to release 1.16.0 in early January. Please let me know which way works better for you.

@gato
Copy link
Author

gato commented Nov 27, 2014

I will wait then. Thanks.

@danielmewes danielmewes modified the milestones: 1.15.x, 1.16.x Jan 29, 2015
@danielmewes danielmewes modified the milestones: 1.16.x, 1.15.x Jan 29, 2015
@danielmewes
Copy link
Member

@gato We've just released RethinkDB 1.16.
When you get a chance to try it out, please let us know if it improves memory consumption for you.

@danielmewes
Copy link
Member

@gato Any updates? Are you still seeing this with 1.16?

@gato
Copy link
Author

gato commented Feb 24, 2015

Sorry, haven't tried yet, i'm currently working on something else right now and do not have quick access to the servers. Will try to make me some time soon and test it.
Regards.
Marcelo

@danielmewes
Copy link
Member

No worries. Thanks for keeping us posted @gato .

@gato
Copy link
Author

gato commented Mar 31, 2015

Hi @danielmewes,
I've finally got the time to run the tests again, but i'm afraid i have to tell you that server started to swap memory early and the import process hung 15 hs later, having completed ~76% of the original data set (16M of 21M docs) with one server swapping 1.5 GB (the one that also has the import client running) and 330MB of swap the other one.
version tested was 1.16.3
0trusty (GCC 4.8.2)
none other process were running in those servers.
Regards.
Marcelo

@danielmewes
Copy link
Member

@gato Thanks a lot. I really appreciate you took the time to re-try this.

At this point we might have to conclude that 2 GB of RAM (is that still the hardware you're using?) isn't enough for importing this data set with the default configuration at the moment (we should look into that more though).

To get past the import stage, you can run rethinkdb import with the --clients 1 parameter. That reduces the number of concurrently inserting connections from the default of 8 to a single one, which I expect will reduce the peak memory consumption during import.

@gato
Copy link
Author

gato commented Mar 31, 2015

Yes, all hardware is the same, in the same conditions and with the same data.
i will run the import process again with --clients 1 to see what happens and let you know.

@danielmewes danielmewes modified the milestones: 1.16.x, 2.0.x Apr 14, 2015
@danielmewes
Copy link
Member

@gato did you get a chance to try the --clients 1 import? Did it work?

@gato
Copy link
Author

gato commented May 6, 2015

Hi @danielmewes,
Yes I've tried, after hitting some out of disk errors (my fault) I was able to load the data using --clients 1 although it was painfully slow (like 3 and 1/2 days).
At the end of the import process some swapping had occurred like 350 ~ 400 MB swap in one node and a little less in the other.
After that, I've tried a simple count query (which i knew had to traverse the whole dataset) and the server started to swap like crazy, heartbeat between the nodes failed and query hung.
I haven't done anything after that.
Sorry.
Marcelo

@danielmewes danielmewes self-assigned this May 6, 2015
@danielmewes
Copy link
Member

@gato Thanks for the info. Sorry this isn't working. I know you must have spent a lot of time on this.

So it seems like the 2 GB servers aren't enough for this.

I'll try again to reproduce this and will run some memory profiling, but I'm not sure how quickly we can bring the memory usage down enough for making this work.

For now the only option seems to be bigger servers unfortunately.

@danielmewes danielmewes modified the milestones: subsequent, 2.0.x May 6, 2015
@gato
Copy link
Author

gato commented May 7, 2015

Hi @danielmewes
I'm almost sure now that 2GB isn't enough but those are old servers and it has no point buying memory for them anyway.
So, don't waste much time chasing this issue, as this is some edge situation (lots of data in old & small hardware). Feel free to close the issue or move it to the end of your backlog.
Thanks for your support
Marcelo

@danielmewes
Copy link
Member

Closing this. We might still want to look into reducing memory usage, but I think there's nothing specific left to do for this issue.

@danielmewes danielmewes modified the milestones: outdated, subsequent Apr 29, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants