crash on ubuntu 12.04 #2399

colprog · 2014-05-15T03:25:52Z

running rethinkdb 1.12.4 for two weeks without a problem ,but today it just crashed.

2014-05-01T01:25:32.395479491 0.024969s info: Running rethinkdb 1.12.4-0ubuntu1~precise (GCC 4.6.3)...
2014-05-01T01:25:32.397210840 0.026700s info: Running on Linux 3.2.0-29-generic x86_64
2014-05-01T01:25:32.397258453 0.026748s info: Using cache size of 593 MB
2014-05-01T01:25:32.397418814 0.026909s info: Loading data from directory /data/rethinkdb/db0
2014-05-01T01:25:32.564420794 0.193910s info: Listening for intracluster connections on port 29015
2014-05-01T01:25:32.588540351 0.218030s info: Listening for client driver connections on port 28015
2014-05-01T01:25:32.588758499 0.218248s info: Listening for administrative HTTP connections on port 8080
2014-05-01T01:25:32.588759897 0.218249s info: Listening on addresses: 127.0.0.1, 10.161.142.205
2014-05-01T01:25:32.588760444 0.218250s info: To fully expose RethinkDB on the network, bind to all addresses
2014-05-01T01:25:32.588761001 0.218250s info: by adding bind=all' to the config file (/etc/rethinkdb/instances.d/db0.conf). 2014-05-01T01:25:32.588761535 0.218251s info: Server ready 2014-05-15T11:12:29.635045878 1244817.264535s error: Error in src/rdb_protocol/lazy_json.cc at line 19: 2014-05-15T11:12:29.635112958 1244817.264602s error: Guarantee failed: [res == archive_result_t::SUCCESS] Deserialization of rdb value failed with error archive_result_t::RANGE_ERROR. 2014-05-15T11:12:29.635135037 1244817.264624s error: Backtrace: 2014-05-15T11:12:29.819389752 1244817.448880s error: Thu May 15 11:12:29 2014\n\n1: format_backtrace(bool) at 0xd1cb43 (/usr/bin/rethinkdb)\n2: report_fatal_error(char const*, int, char const*, ...) at 0xdd4794 (/usr/bin/rethinkdb)\n3: get_data(rdb_value_t const*, buf_parent_t) at 0xd101a8 (/usr/bin/rethinkdb)\n4: lazy_json_t::get() const at 0xd10304 (/usr/bin/rethinkdb)\n5: rget_cb_t::handle_pair(scoped_key_value_t&&, concurrent_traversal_fifo_enforcer_signal_t) at 0xc95f89 (/usr/bin/rethinkdb)\n6: concurrent_traversal_adapter_t::handle_pair_coro(scoped_key_value_t*, semaphore_acq_t*, fifo_enforcer_write_token_t, auto_drainer_t::lock_t) at 0xb38ef7 (/usr/bin/rethinkdb)\n7: callable_action_instance_t<std::_Bind<std::_Mem_fn<void (concurrent_traversal_adapter_t::*)(scoped_key_value_t*, semaphore_acq_t*, fifo_enforcer_write_token_t, auto_drainer_t::lock_t)> (concurrent_traversal_adapter_t*, scoped_key_value_t*, semaphore_acq_t*, fifo_enforcer_write_token_t, auto_drainer_t::lock_t)> >::run_action() at 0xb38dd8 (/usr/bin/rethinkdb)\n8: coro_t::run() at 0xdc751e (/usr/bin/rethinkdb) 2014-05-15T11:12:29.819499407 1244817.448989s error: Exiting. 2014-05-15T11:18:02.048885620 0.028158s info: Running rethinkdb 1.12.4-0ubuntu1~precise (GCC 4.6.3)... 2014-05-15T11:18:02.050985612 0.030258s info: Running on Linux 3.2.0-29-generic x86_64 2014-05-15T11:18:02.051027521 0.030300s info: Using cache size of 602 MB 2014-05-15T11:18:02.051150303 0.030423s info: Loading data from directory /data/rethinkdb/db0 2014-05-15T11:18:02.237516228 0.216788s info: Listening for intracluster connections on port 29015 2014-05-15T11:18:02.269815418 0.249088s info: Listening for client driver connections on port 28015 2014-05-15T11:18:02.270194158 0.249466s info: Listening for administrative HTTP connections on port 8080 2014-05-15T11:18:02.270195561 0.249468s info: Listening on addresses: 127.0.0.1, 10.161.142.205 2014-05-15T11:18:02.270196172 0.249468s info: To fully expose RethinkDB on the network, bind to all addresses 2014-05-15T11:18:02.270196707 0.249469s info: by addingbind=all' to the config file (/etc/rethinkdb/instances.d/db0.conf).
2014-05-15T11:18:02.270200599 0.249473s info: Server ready
2014-05-15T11:18:33.728954227 31.708227s error: Error in src/rdb_protocol/lazy_json.cc at line 19:
2014-05-15T11:18:33.729024758 31.708297s error: Guarantee failed: [res == archive_result_t::SUCCESS] Deserialization of rdb value failed with error archive_result_t::RANGE_ERROR.
2014-05-15T11:18:33.729044585 31.708317s error: Backtrace:
2014-05-15T11:18:33.744542726 31.723815s error: Error in src/rdb_protocol/lazy_json.cc at line 19:
2014-05-15T11:18:33.744636516 31.723909s error: Guarantee failed: [res == archive_result_t::SUCCESS] Deserialization of rdb value failed with error archive_result_t::RANGE_ERROR.
2014-05-15T11:18:33.744659853 31.723932s error: Backtrace:
2014-05-15T11:18:33.892180415 31.871454s error: Thu May 15 11:18:33 2014\n\n1: format_backtrace(bool) at 0xd1cb43 (/usr/bin/rethinkdb)\n2: report_fatal_error(char const_, int, char const_, ...) at 0xdd4794 (/usr/bin/rethinkdb)\n3: get_data(rdb_value_t const_, buf_parent_t) at 0xd101a8 (/usr/bin/rethinkdb)\n4: lazy_json_t::get() const at 0xd10304 (/usr/bin/rethinkdb)\n5: rget_cb_t::handle_pair(scoped_key_value_t&&, concurrent_traversal_fifo_enforcer_signal_t) at 0xc95f89 (/usr/bin/rethinkdb)\n6: concurrent_traversal_adapter_t::handle_pair_coro(scoped_key_value_t_, semaphore_acq_t_, fifo_enforcer_write_token_t, auto_drainer_t::lock_t) at 0xb38ef7 (/usr/bin/rethinkdb)\n7: callable_action_instance_t<std::Bind<std::Mem_fn<void (concurrent_traversal_adapter_t::)(scoped_key_value_t, semaphore_acq_t_, fifo_enforcer_write_token_t, auto_drainer_t::lock_t)> (concurrent_traversal_adapter_t_, scoped_key_value_t_, semaphore_acq_t*, fifo_enforcer_write_token_t, auto_drainer_t::lock_t)> >::run_action() at 0xb38dd8 (/usr/bin/rethinkdb)\n8: coro_t::run() at 0xdc751e (/usr/bin/rethinkdb)
2014-05-15T11:18:33.892286047 31.871558s error: Exiting.

The text was updated successfully, but these errors were encountered:

chrisvariety · 2014-05-15T04:29:15Z

Here's a cleaned up backtrace, hope it's useful:

2014-05-01T01:25:32.395479491 0.024969s info: Running rethinkdb 1.12.4-0ubuntu1~precise (GCC 4.6.3)...
2014-05-01T01:25:32.397210840 0.026700s info: Running on Linux 3.2.0-29-generic x86_64
2014-05-01T01:25:32.397258453 0.026748s info: Using cache size of 593 MB
2014-05-01T01:25:32.397418814 0.026909s info: Loading data from directory /data/rethinkdb/db0
2014-05-01T01:25:32.564420794 0.193910s info: Listening for intracluster connections on port 29015
2014-05-01T01:25:32.588540351 0.218030s info: Listening for client driver connections on port 28015
2014-05-01T01:25:32.588758499 0.218248s info: Listening for administrative HTTP connections on port 8080
2014-05-01T01:25:32.588759897 0.218249s info: Listening on addresses: 127.0.0.1, 10.161.142.205
2014-05-01T01:25:32.588760444 0.218250s info: To fully expose RethinkDB on the network, bind to all addresses
2014-05-01T01:25:32.588761001 0.218250s info: by adding bind=all' to the config file (/etc/rethinkdb/instances.d/db0.conf). 2014-05-01T01:25:32.588761535 0.218251s info: Server ready
2014-05-15T11:12:29.635045878 1244817.264535s error: Error in src/rdb_protocol/lazy_json.cc at line 19:
2014-05-15T11:12:29.635112958 1244817.264602s error: Guarantee failed: [res == archive_result_t::SUCCESS] Deserialization of rdb value failed with error archive_result_t::RANGE_ERROR.
2014-05-15T11:12:29.635135037 1244817.264624s error: Backtrace:
2014-05-15T11:12:29.819389752 1244817.448880s error: Thu May 15 11:12:29 2014

1: format_backtrace(bool) at 0xd1cb43 (/usr/bin/rethinkdb)
2: report_fatal_error(char const*, int, char const*, ...) at 0xdd4794 (/usr/bin/rethinkdb)
3: get_data(rdb_value_t const*, buf_parent_t) at 0xd101a8 (/usr/bin/rethinkdb)
4: lazy_json_t::get() const at 0xd10304 (/usr/bin/rethinkdb)
5: rget_cb_t::handle_pair(scoped_key_value_t&&, concurrent_traversal_fifo_enforcer_signal_t) at 0xc95f89 (/usr/bin/rethinkdb)
6: concurrent_traversal_adapter_t::handle_pair_coro(scoped_key_value_t*, semaphore_acq_t*, fifo_enforcer_write_token_t, auto_drainer_t::lock_t) at 0xb38ef7 (/usr/bin/rethinkdb)
7: callable_action_instance_t<std::_Bind<std::_Mem_fn<void (concurrent_traversal_adapter_t::*)(scoped_key_value_t*, semaphore_acq_t*, fifo_enforcer_write_token_t, auto_drainer_t::lock_t)> (concurrent_traversal_adapter_t*, scoped_key_value_t*, semaphore_acq_t*, fifo_enforcer_write_token_t, auto_drainer_t::lock_t)> >::run_action() at 0xb38dd8 (/usr/bin/rethinkdb)
8: coro_t::run() at 0xdc751e (/usr/bin/rethinkdb)
2014-05-15T11:12:29.819499407 1244817.448989s error: Exiting.

danielmewes · 2014-05-15T06:50:44Z

Thank you for the report @colprog. Sorry you are experiencing this problem.
Also thank you @chrisvariety for cleaning up the backtrace!

The error message is typical for problems related to snapshotting in our cache, and we have seen and fixed a couple of issues before that resulted in similar crashes.
Apparently there still is some problem left.
(this is not to say that it definitely has to be related to snapshotting, but it seems likely speaking from my past experience)

@colprog Can you tell us a bit more about your cluster configuration, data and workload?
Specifically:

How many nodes are in the cluster?
How many tables do you have?
What are the approximate sizes of the documents stored in those tables? How many documents are there?
Do you have any secondary indexes?
Are the tables replicated and/or sharded?
What types of queries do you run against the database? Are there writes going on? Do you use any complex range queries, such as r.table().map(), r.table().filter() or r.table().group()?

I know it takes some time to answer all of these questions, and I really appreciate your help. I'm asking them because I would like to replicate your setup and workload as closely as possible in order to reproduce the bug.

colprog · 2014-05-15T07:36:15Z

we have 1 node with 22 tables on them, no sharding or replication.
we are in development so the traffic is minimal, if you like, i could send you the db files.

running following query would crash the server:
r.db("sheath_dev").table("role").getAll("8f240ce6-8c3e-4cf2-a997-4736816fb8c0", {index: "owner"})

danielmewes · 2014-05-15T17:54:34Z

@colprog Thank you for the info. The db file would be helpful. If you (g)zip the data directory, it should compress fairly well. You can try sending it to me in an email to daniel@rethinkdb.com. If that doesn't work we can provide you with an upload site.

colprog · 2014-05-15T20:19:33Z

you can try this link:
http://oss.aliyuncs.com/dgmedias_dev/rethinkdb.tar.bz2

danielmewes · 2014-05-15T20:32:33Z

Thank you @colprog for the data. I downloaded it and can start a server with it.
I haven't tried if I can reproduce the issue yet, but will get to it very soon.
You can delete the file from your webspace.

(Internal note: the data is on newton in /home/daniel/2399-data.tar.bz2 )

colprog · 2014-05-15T20:35:36Z

Great! if you need anything else from my part. please let me know.

danielmewes · 2014-05-15T20:40:07Z

I can reproduce the crash with the query you gave me @colprog . Your information and data has been extremely helpful.
I will get to this issue asap.

danielmewes · 2014-05-16T00:06:24Z

The table file is already corrupted, and it is difficult or even impossible to trace how it got to that state from looking at the crash and the file.

My best bet right now is to a) read all the involved code again and b) try to replicate as much of the table configuration as possible and leave a server running over a long time with additional checks enabled, hoping that that will trigger the issue.

I don't know if that is actually relevant, but had you run previous versions of RethinkDB (e.g. 1.12.3) on the same data directory @colprog? Or did you start from scratch after installing 1.12.4?

colprog · 2014-05-16T02:24:34Z

@danielmewes the server has been upgraded from previous release since 1.11.0. but I started a new set of tables yesterday, and now it is crashing as well.

colprog · 2014-05-16T02:34:46Z

@danielmewes and it seems to be worse. a simple r.table("role") would crash the server

colprog · 2014-05-16T06:37:14Z

on a newly created and healthy table, the following query sequence seems to trigger the bug and corrupt the table:

r.db('sheath_dev').table("role").insert({"bar":{},"cleared":{},"coins":120,"contribs":0,"createTime":{"$reql_type$":"TIME","epoch_time":1400221967.619,"timezone":"+00:00"},"dailyRefreshData":{"dailyReward":true,"qhourlyReward":1},"energy":5,"energyRefreshTime":{"$reql_type$":"TIME","epoch_time":1400221967.619,"timezone":"+00:00"},"exp":0,"formation":0,"formationLevel":[0,0,0,0,0,0],"golds":20,"id":"66583953-4553-4759-8918-18e3960452dc","lastLogOff":{"$reql_type$":"TIME","epoch_time":1400221968.917,"timezone":"+00:00"},"level":1,"levelCleared":{},"levelGain":{},"name":"新玩家","owner":"8f240ce6-8c3e-4cf2-a997-4736816fb8c0","partition":"f79279b1-3b29-4e08-83d9-bdec419355ce","souls":{},"spent":0,"storageRoom":25,"store":{},"taskClaimed":[],"taskData":{},"taskDone":[],"team":["a470caaf-983c-4516-a470-30c1fd2587d0","49b90dbd-e135-450b-bbc5-f311ec3925db",null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null],"title":"","tutorial":1,"vip":0})
r.db('sheath_dev').table("role").update({update: {"dailyRefreshData": r.literal({})}})

danielmewes · 2014-05-16T18:41:49Z

@colprog Thank you for figuring out the exact query that causes this to happen. That is incredibly helpful!
Looking into it.

danielmewes · 2014-05-16T18:58:09Z

Here is the simplest sequence of commands that I could find to trigger the corruption:

r.tableCreate("t1");
r.table("t1").insert({});
r.table("t1").update({a: {b: r.literal({})}});
r.table("t1")

Replacing the update by r.table("t1").update({a: r.literal({})}); does not crash, nor does r.table("t1").update({a: {b: {}}});.

danielmewes · 2014-05-16T19:54:50Z

Valgrind reports a jump depending on uninitialized data. Could be related:

==20847== Thread 2:
==20847== Conditional jump or move depends on uninitialised value(s)
==20847==    at 0x2B415BD: ql::reader_t::load_items(ql::env_t*, ql::batchspec_t const&) (datum_stream.cc:189)
==20847==    by 0x2B41F16: ql::reader_t::next_batch(ql::env_t*, ql::batchspec_t const&) (datum_stream.cc:228)
==20847==    by 0x2B455C0: ql::lazy_datum_stream_t::next_batch_impl(ql::env_t*, ql::batchspec_t const&) (datum_stream.cc:683)
==20847==    by 0x2B4437A: ql::datum_stream_t::next_batch(ql::env_t*, ql::batchspec_t const&) (datum_stream.cc:554)
==20847==    by 0x29F441F: ql::replace_term_t::eval_impl(ql::scope_env_t*, ql::eval_flags_t) (writes.cc:218)
==20847==    by 0x2A8CCC3: ql::op_term_t::term_eval(ql::scope_env_t*, ql::eval_flags_t) (op.cc:149)
==20847==    by 0x2B532DA: ql::term_t::eval(ql::scope_env_t*, ql::eval_flags_t) (term.cc:363)
==20847==    by 0x2A18FC5: ql::rewrite_term_t::term_eval(ql::scope_env_t*, ql::eval_flags_t) (rewrites.cc:41)
==20847==    by 0x2B532DA: ql::term_t::eval(ql::scope_env_t*, ql::eval_flags_t) (term.cc:363)
==20847==    by 0x2B51D3E: ql::run(ql::protob_t<Query>, rdb_protocol_t::context_t*, signal_t*, Response*, ql::stream_cache2_t*) (term.cc:228)
==20847==    by 0x2A91D4D: query2_server_t::handle(ql::protob_t<Query>, Response*, query2_server_t::context_t*) (pb_server.cc:78)
==20847==    by 0x2A97978: boost::_mfi::mf3<bool, query2_server_t, ql::protob_t<Query>, Response*, query2_server_t::context_t*>::operator()(query2_server_t*, ql::protob_t<Query>, Response*, query2_server_t::context_t*) const (mem_fn_template.hpp:393)

coffeemug · 2014-05-16T20:22:28Z

/cc @mlucy

danielmewes · 2014-05-16T20:59:30Z

(reader_t::items_index isn't initialized, apparently)

danielmewes · 2014-05-16T22:33:10Z

Initializing reader_t::items_index to 0 removes the Valgrind warning, but doesn't solve the table corruption.

danielmewes · 2014-05-16T23:39:34Z

This crashes the server as well, without having to read any data back from the table:

r.tableCreate("t1");
r.table("t1").insert({id: "foo"});
r.table("t1").get("foo").update({a: {b: r.literal({})}}, {returnVals: true})

The error and backtrace don't look very helpful:

error: Error in src/rpc/connectivity/cluster.cc at line 988:
error: Guarantee failed: [conn != __null] Attempted to kill connection to myself.
error: Backtrace:
error: Fri May 16 16:36:42 2014

       1: rethinkdb_backtrace(void**, int) at rethinkdb_backtrace.cc:100
       2: backtrace_t::backtrace_t() at backtrace.cc:202
       3: lazy_backtrace_formatter_t::lazy_backtrace_formatter_t() at backtrace.cc:282
       4: format_backtrace(bool) at backtrace.cc:197
       5: report_fatal_error(char const*, int, char const*, ...) at errors.cc:83
       6: connectivity_cluster_t::kill_connection(peer_id_t) at cluster.cc:988
       7: non-virtual thunk to connectivity_cluster_t::kill_connection(peer_id_t) at distccd_a87aa0e1.ii:998
       8: message_multiplexer_t::client_t::kill_connection(peer_id_t) at multiplexer.cc:105
       9: mailbox_manager_t::mailbox_read_coroutine(peer_id_t, threadnum_t, unsigned long, std::vector<char, std::allocator<char> >*, long) at mailbox.cc:201
        [...]
error: Exiting.
Trace/breakpoint trap (core dumped)

It doesn't seem necessary to serialize the updated value into the table to get bad behavior.
(Edit: The kill_connection() that's triggering the crash is actually called as a consequence of attempting to deserialize the write response though)

danielmewes · 2014-05-17T00:11:51Z

@mlucy: This is the datum that's actually generated by the update():

{
    "a":    {
        "b":    {
            "$reql_type$":  "LITERAL",
            "value":    { }
                }
    },
    "id":   "foo"
}

As you can see the LITERAL pseudotype becomes part of the document.
When the datum is deserialized, the datum_t::datum_t(std::map) constructor is called, which runs datum_t::maybe_sanitize_ptype(). maybe_sanitize_ptype() throws an exception because the literal keyword is not allowed in that context.

Any idea how the literal keyword can end up in the result of an update()?

danielmewes · 2014-05-17T00:30:14Z

Oh I see it never recurses into "a" because the field "a" does not exist on the left side of the merge. Hmm...

Besides the problem with merge(), we should make absolutely sure that either maybe_sanitize_ptype() is run on any datum_t before it's stored in a table, or alternatively that loading a bad pseudo type from disk doesn't crash the server.

danielmewes · 2014-05-17T02:24:23Z

A fix is implemented in branch daniel_2399. In code review 1509 by @mlucy

AtnNn · 2014-05-20T05:48:07Z

How is the review going?

danielmewes · 2014-05-20T18:03:46Z

The fix has been merged into v1.12.x 3185a66
and next 4cbd347 (@AtnNn)

It will ship with RethinkDB 1.12.5.

AtnNn · 2014-05-22T00:23:05Z

@colprog The fix for this issue has been released today as part of RethinkDB 1.12.5.

colprog · 2014-05-22T06:17:13Z

Great, thanks! I'll upgrade right away

danielmewes added this to the 1.12.x milestone May 15, 2014

danielmewes added pr:high labels May 15, 2014

colprog mentioned this issue May 15, 2014

handles closed connection neumino/rethinkdbdash#18

Closed

danielmewes modified the milestones: 1.12.5, 1.12.x May 16, 2014

danielmewes pushed a commit that referenced this issue May 17, 2014

Added a regression test for #2399.

96a52e3

danielmewes mentioned this issue May 19, 2014

Upcoming release: 1.12.5 #2407

Closed

This was referenced May 19, 2014

Deserialization of rdb value failed with error archive_result_t::RANGE_ERROR. #2410

Closed

Ensure we only serialize valid pseudotypes #2412

Open

danielmewes pushed a commit that referenced this issue May 19, 2014

Improved #2399 regression test.

3185a66

danielmewes closed this as completed May 20, 2014

danielmewes pushed a commit that referenced this issue May 20, 2014

Added a regression test for #2399.

5ea21f4

danielmewes pushed a commit that referenced this issue May 20, 2014

Improved #2399 regression test.

5f0e81c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

crash on ubuntu 12.04 #2399

crash on ubuntu 12.04 #2399

colprog commented May 15, 2014

chrisvariety commented May 15, 2014

danielmewes commented May 15, 2014

colprog commented May 15, 2014

danielmewes commented May 15, 2014

colprog commented May 15, 2014

danielmewes commented May 15, 2014

colprog commented May 15, 2014

danielmewes commented May 15, 2014

danielmewes commented May 16, 2014

colprog commented May 16, 2014

colprog commented May 16, 2014

colprog commented May 16, 2014

danielmewes commented May 16, 2014

danielmewes commented May 16, 2014

danielmewes commented May 16, 2014

coffeemug commented May 16, 2014

danielmewes commented May 16, 2014

danielmewes commented May 16, 2014

danielmewes commented May 16, 2014

danielmewes commented May 17, 2014

danielmewes commented May 17, 2014

danielmewes commented May 17, 2014

AtnNn commented May 20, 2014

danielmewes commented May 20, 2014

AtnNn commented May 22, 2014

colprog commented May 22, 2014

crash on ubuntu 12.04 #2399

crash on ubuntu 12.04 #2399

Comments

colprog commented May 15, 2014

chrisvariety commented May 15, 2014

danielmewes commented May 15, 2014

colprog commented May 15, 2014

danielmewes commented May 15, 2014

colprog commented May 15, 2014

danielmewes commented May 15, 2014

colprog commented May 15, 2014

danielmewes commented May 15, 2014

danielmewes commented May 16, 2014

colprog commented May 16, 2014

colprog commented May 16, 2014

colprog commented May 16, 2014

danielmewes commented May 16, 2014

danielmewes commented May 16, 2014

danielmewes commented May 16, 2014

coffeemug commented May 16, 2014

danielmewes commented May 16, 2014

danielmewes commented May 16, 2014

danielmewes commented May 16, 2014

danielmewes commented May 17, 2014

danielmewes commented May 17, 2014

danielmewes commented May 17, 2014

AtnNn commented May 20, 2014

danielmewes commented May 20, 2014

AtnNn commented May 22, 2014

colprog commented May 22, 2014