Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

crash on ubuntu 12.04 #2399

Closed
colprog opened this issue May 15, 2014 · 26 comments
Closed

crash on ubuntu 12.04 #2399

colprog opened this issue May 15, 2014 · 26 comments

Comments

@colprog
Copy link

colprog commented May 15, 2014

running rethinkdb 1.12.4 for two weeks without a problem ,but today it just crashed.

2014-05-01T01:25:32.395479491 0.024969s info: Running rethinkdb 1.12.4-0ubuntu1~precise (GCC 4.6.3)...
2014-05-01T01:25:32.397210840 0.026700s info: Running on Linux 3.2.0-29-generic x86_64
2014-05-01T01:25:32.397258453 0.026748s info: Using cache size of 593 MB
2014-05-01T01:25:32.397418814 0.026909s info: Loading data from directory /data/rethinkdb/db0
2014-05-01T01:25:32.564420794 0.193910s info: Listening for intracluster connections on port 29015
2014-05-01T01:25:32.588540351 0.218030s info: Listening for client driver connections on port 28015
2014-05-01T01:25:32.588758499 0.218248s info: Listening for administrative HTTP connections on port 8080
2014-05-01T01:25:32.588759897 0.218249s info: Listening on addresses: 127.0.0.1, 10.161.142.205
2014-05-01T01:25:32.588760444 0.218250s info: To fully expose RethinkDB on the network, bind to all addresses
2014-05-01T01:25:32.588761001 0.218250s info: by adding bind=all' to the config file (/etc/rethinkdb/instances.d/db0.conf). 2014-05-01T01:25:32.588761535 0.218251s info: Server ready 2014-05-15T11:12:29.635045878 1244817.264535s error: Error in src/rdb_protocol/lazy_json.cc at line 19: 2014-05-15T11:12:29.635112958 1244817.264602s error: Guarantee failed: [res == archive_result_t::SUCCESS] Deserialization of rdb value failed with error archive_result_t::RANGE_ERROR. 2014-05-15T11:12:29.635135037 1244817.264624s error: Backtrace: 2014-05-15T11:12:29.819389752 1244817.448880s error: Thu May 15 11:12:29 2014\n\n1: format_backtrace(bool) at 0xd1cb43 (/usr/bin/rethinkdb)\n2: report_fatal_error(char const*, int, char const*, ...) at 0xdd4794 (/usr/bin/rethinkdb)\n3: get_data(rdb_value_t const*, buf_parent_t) at 0xd101a8 (/usr/bin/rethinkdb)\n4: lazy_json_t::get() const at 0xd10304 (/usr/bin/rethinkdb)\n5: rget_cb_t::handle_pair(scoped_key_value_t&&, concurrent_traversal_fifo_enforcer_signal_t) at 0xc95f89 (/usr/bin/rethinkdb)\n6: concurrent_traversal_adapter_t::handle_pair_coro(scoped_key_value_t*, semaphore_acq_t*, fifo_enforcer_write_token_t, auto_drainer_t::lock_t) at 0xb38ef7 (/usr/bin/rethinkdb)\n7: callable_action_instance_t<std::_Bind<std::_Mem_fn<void (concurrent_traversal_adapter_t::*)(scoped_key_value_t*, semaphore_acq_t*, fifo_enforcer_write_token_t, auto_drainer_t::lock_t)> (concurrent_traversal_adapter_t*, scoped_key_value_t*, semaphore_acq_t*, fifo_enforcer_write_token_t, auto_drainer_t::lock_t)> >::run_action() at 0xb38dd8 (/usr/bin/rethinkdb)\n8: coro_t::run() at 0xdc751e (/usr/bin/rethinkdb) 2014-05-15T11:12:29.819499407 1244817.448989s error: Exiting. 2014-05-15T11:18:02.048885620 0.028158s info: Running rethinkdb 1.12.4-0ubuntu1~precise (GCC 4.6.3)... 2014-05-15T11:18:02.050985612 0.030258s info: Running on Linux 3.2.0-29-generic x86_64 2014-05-15T11:18:02.051027521 0.030300s info: Using cache size of 602 MB 2014-05-15T11:18:02.051150303 0.030423s info: Loading data from directory /data/rethinkdb/db0 2014-05-15T11:18:02.237516228 0.216788s info: Listening for intracluster connections on port 29015 2014-05-15T11:18:02.269815418 0.249088s info: Listening for client driver connections on port 28015 2014-05-15T11:18:02.270194158 0.249466s info: Listening for administrative HTTP connections on port 8080 2014-05-15T11:18:02.270195561 0.249468s info: Listening on addresses: 127.0.0.1, 10.161.142.205 2014-05-15T11:18:02.270196172 0.249468s info: To fully expose RethinkDB on the network, bind to all addresses 2014-05-15T11:18:02.270196707 0.249469s info: by addingbind=all' to the config file (/etc/rethinkdb/instances.d/db0.conf).
2014-05-15T11:18:02.270200599 0.249473s info: Server ready
2014-05-15T11:18:33.728954227 31.708227s error: Error in src/rdb_protocol/lazy_json.cc at line 19:
2014-05-15T11:18:33.729024758 31.708297s error: Guarantee failed: [res == archive_result_t::SUCCESS] Deserialization of rdb value failed with error archive_result_t::RANGE_ERROR.
2014-05-15T11:18:33.729044585 31.708317s error: Backtrace:
2014-05-15T11:18:33.744542726 31.723815s error: Error in src/rdb_protocol/lazy_json.cc at line 19:
2014-05-15T11:18:33.744636516 31.723909s error: Guarantee failed: [res == archive_result_t::SUCCESS] Deserialization of rdb value failed with error archive_result_t::RANGE_ERROR.
2014-05-15T11:18:33.744659853 31.723932s error: Backtrace:
2014-05-15T11:18:33.892180415 31.871454s error: Thu May 15 11:18:33 2014\n\n1: format_backtrace(bool) at 0xd1cb43 (/usr/bin/rethinkdb)\n2: report_fatal_error(char const_, int, char const_, ...) at 0xdd4794 (/usr/bin/rethinkdb)\n3: get_data(rdb_value_t const_, buf_parent_t) at 0xd101a8 (/usr/bin/rethinkdb)\n4: lazy_json_t::get() const at 0xd10304 (/usr/bin/rethinkdb)\n5: rget_cb_t::handle_pair(scoped_key_value_t&&, concurrent_traversal_fifo_enforcer_signal_t) at 0xc95f89 (/usr/bin/rethinkdb)\n6: concurrent_traversal_adapter_t::handle_pair_coro(scoped_key_value_t_, semaphore_acq_t_, fifo_enforcer_write_token_t, auto_drainer_t::lock_t) at 0xb38ef7 (/usr/bin/rethinkdb)\n7: callable_action_instance_t<std::Bind<std::Mem_fn<void (concurrent_traversal_adapter_t::)(scoped_key_value_t, semaphore_acq_t_, fifo_enforcer_write_token_t, auto_drainer_t::lock_t)> (concurrent_traversal_adapter_t_, scoped_key_value_t_, semaphore_acq_t*, fifo_enforcer_write_token_t, auto_drainer_t::lock_t)> >::run_action() at 0xb38dd8 (/usr/bin/rethinkdb)\n8: coro_t::run() at 0xdc751e (/usr/bin/rethinkdb)
2014-05-15T11:18:33.892286047 31.871558s error: Exiting.

@chrisvariety
Copy link

Here's a cleaned up backtrace, hope it's useful:

2014-05-01T01:25:32.395479491 0.024969s info: Running rethinkdb 1.12.4-0ubuntu1~precise (GCC 4.6.3)...
2014-05-01T01:25:32.397210840 0.026700s info: Running on Linux 3.2.0-29-generic x86_64
2014-05-01T01:25:32.397258453 0.026748s info: Using cache size of 593 MB
2014-05-01T01:25:32.397418814 0.026909s info: Loading data from directory /data/rethinkdb/db0
2014-05-01T01:25:32.564420794 0.193910s info: Listening for intracluster connections on port 29015
2014-05-01T01:25:32.588540351 0.218030s info: Listening for client driver connections on port 28015
2014-05-01T01:25:32.588758499 0.218248s info: Listening for administrative HTTP connections on port 8080
2014-05-01T01:25:32.588759897 0.218249s info: Listening on addresses: 127.0.0.1, 10.161.142.205
2014-05-01T01:25:32.588760444 0.218250s info: To fully expose RethinkDB on the network, bind to all addresses
2014-05-01T01:25:32.588761001 0.218250s info: by adding bind=all' to the config file (/etc/rethinkdb/instances.d/db0.conf). 2014-05-01T01:25:32.588761535 0.218251s info: Server ready
2014-05-15T11:12:29.635045878 1244817.264535s error: Error in src/rdb_protocol/lazy_json.cc at line 19:
2014-05-15T11:12:29.635112958 1244817.264602s error: Guarantee failed: [res == archive_result_t::SUCCESS] Deserialization of rdb value failed with error archive_result_t::RANGE_ERROR.
2014-05-15T11:12:29.635135037 1244817.264624s error: Backtrace:
2014-05-15T11:12:29.819389752 1244817.448880s error: Thu May 15 11:12:29 2014

1: format_backtrace(bool) at 0xd1cb43 (/usr/bin/rethinkdb)
2: report_fatal_error(char const*, int, char const*, ...) at 0xdd4794 (/usr/bin/rethinkdb)
3: get_data(rdb_value_t const*, buf_parent_t) at 0xd101a8 (/usr/bin/rethinkdb)
4: lazy_json_t::get() const at 0xd10304 (/usr/bin/rethinkdb)
5: rget_cb_t::handle_pair(scoped_key_value_t&&, concurrent_traversal_fifo_enforcer_signal_t) at 0xc95f89 (/usr/bin/rethinkdb)
6: concurrent_traversal_adapter_t::handle_pair_coro(scoped_key_value_t*, semaphore_acq_t*, fifo_enforcer_write_token_t, auto_drainer_t::lock_t) at 0xb38ef7 (/usr/bin/rethinkdb)
7: callable_action_instance_t<std::_Bind<std::_Mem_fn<void (concurrent_traversal_adapter_t::*)(scoped_key_value_t*, semaphore_acq_t*, fifo_enforcer_write_token_t, auto_drainer_t::lock_t)> (concurrent_traversal_adapter_t*, scoped_key_value_t*, semaphore_acq_t*, fifo_enforcer_write_token_t, auto_drainer_t::lock_t)> >::run_action() at 0xb38dd8 (/usr/bin/rethinkdb)
8: coro_t::run() at 0xdc751e (/usr/bin/rethinkdb)
2014-05-15T11:12:29.819499407 1244817.448989s error: Exiting.

@danielmewes
Copy link
Member

Thank you for the report @colprog. Sorry you are experiencing this problem.
Also thank you @chrisvariety for cleaning up the backtrace!

The error message is typical for problems related to snapshotting in our cache, and we have seen and fixed a couple of issues before that resulted in similar crashes.
Apparently there still is some problem left.
(this is not to say that it definitely has to be related to snapshotting, but it seems likely speaking from my past experience)

@colprog Can you tell us a bit more about your cluster configuration, data and workload?
Specifically:

  • How many nodes are in the cluster?
  • How many tables do you have?
  • What are the approximate sizes of the documents stored in those tables? How many documents are there?
  • Do you have any secondary indexes?
  • Are the tables replicated and/or sharded?
  • What types of queries do you run against the database? Are there writes going on? Do you use any complex range queries, such as r.table().map(), r.table().filter() or r.table().group()?

I know it takes some time to answer all of these questions, and I really appreciate your help. I'm asking them because I would like to replicate your setup and workload as closely as possible in order to reproduce the bug.

@danielmewes danielmewes added this to the 1.12.x milestone May 15, 2014
@colprog
Copy link
Author

colprog commented May 15, 2014

we have 1 node with 22 tables on them, no sharding or replication.
we are in development so the traffic is minimal, if you like, i could send you the db files.

running following query would crash the server:
r.db("sheath_dev").table("role").getAll("8f240ce6-8c3e-4cf2-a997-4736816fb8c0", {index: "owner"})

@danielmewes
Copy link
Member

@colprog Thank you for the info. The db file would be helpful. If you (g)zip the data directory, it should compress fairly well. You can try sending it to me in an email to daniel@rethinkdb.com. If that doesn't work we can provide you with an upload site.

@colprog
Copy link
Author

colprog commented May 15, 2014

you can try this link:
http://oss.aliyuncs.com/dgmedias_dev/rethinkdb.tar.bz2

@danielmewes
Copy link
Member

Thank you @colprog for the data. I downloaded it and can start a server with it.
I haven't tried if I can reproduce the issue yet, but will get to it very soon.
You can delete the file from your webspace.

(Internal note: the data is on newton in /home/daniel/2399-data.tar.bz2 )

@colprog
Copy link
Author

colprog commented May 15, 2014

Great! if you need anything else from my part. please let me know.

@danielmewes
Copy link
Member

I can reproduce the crash with the query you gave me @colprog . Your information and data has been extremely helpful.
I will get to this issue asap.

@danielmewes
Copy link
Member

The table file is already corrupted, and it is difficult or even impossible to trace how it got to that state from looking at the crash and the file.

My best bet right now is to a) read all the involved code again and b) try to replicate as much of the table configuration as possible and leave a server running over a long time with additional checks enabled, hoping that that will trigger the issue.

I don't know if that is actually relevant, but had you run previous versions of RethinkDB (e.g. 1.12.3) on the same data directory @colprog? Or did you start from scratch after installing 1.12.4?

@colprog
Copy link
Author

colprog commented May 16, 2014

@danielmewes the server has been upgraded from previous release since 1.11.0. but I started a new set of tables yesterday, and now it is crashing as well.

@colprog
Copy link
Author

colprog commented May 16, 2014

@danielmewes and it seems to be worse. a simple r.table("role") would crash the server

@colprog
Copy link
Author

colprog commented May 16, 2014

on a newly created and healthy table, the following query sequence seems to trigger the bug and corrupt the table:

r.db('sheath_dev').table("role").insert({"bar":{},"cleared":{},"coins":120,"contribs":0,"createTime":{"$reql_type$":"TIME","epoch_time":1400221967.619,"timezone":"+00:00"},"dailyRefreshData":{"dailyReward":true,"qhourlyReward":1},"energy":5,"energyRefreshTime":{"$reql_type$":"TIME","epoch_time":1400221967.619,"timezone":"+00:00"},"exp":0,"formation":0,"formationLevel":[0,0,0,0,0,0],"golds":20,"id":"66583953-4553-4759-8918-18e3960452dc","lastLogOff":{"$reql_type$":"TIME","epoch_time":1400221968.917,"timezone":"+00:00"},"level":1,"levelCleared":{},"levelGain":{},"name":"新玩家","owner":"8f240ce6-8c3e-4cf2-a997-4736816fb8c0","partition":"f79279b1-3b29-4e08-83d9-bdec419355ce","souls":{},"spent":0,"storageRoom":25,"store":{},"taskClaimed":[],"taskData":{},"taskDone":[],"team":["a470caaf-983c-4516-a470-30c1fd2587d0","49b90dbd-e135-450b-bbc5-f311ec3925db",null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null],"title":"","tutorial":1,"vip":0})
r.db('sheath_dev').table("role").update({update: {"dailyRefreshData": r.literal({})}})

@danielmewes
Copy link
Member

@colprog Thank you for figuring out the exact query that causes this to happen. That is incredibly helpful!
Looking into it.

@danielmewes
Copy link
Member

Here is the simplest sequence of commands that I could find to trigger the corruption:

r.tableCreate("t1");
r.table("t1").insert({});
r.table("t1").update({a: {b: r.literal({})}});
r.table("t1")

Replacing the update by r.table("t1").update({a: r.literal({})}); does not crash, nor does r.table("t1").update({a: {b: {}}});.

@danielmewes
Copy link
Member

Valgrind reports a jump depending on uninitialized data. Could be related:

==20847== Thread 2:
==20847== Conditional jump or move depends on uninitialised value(s)
==20847==    at 0x2B415BD: ql::reader_t::load_items(ql::env_t*, ql::batchspec_t const&) (datum_stream.cc:189)
==20847==    by 0x2B41F16: ql::reader_t::next_batch(ql::env_t*, ql::batchspec_t const&) (datum_stream.cc:228)
==20847==    by 0x2B455C0: ql::lazy_datum_stream_t::next_batch_impl(ql::env_t*, ql::batchspec_t const&) (datum_stream.cc:683)
==20847==    by 0x2B4437A: ql::datum_stream_t::next_batch(ql::env_t*, ql::batchspec_t const&) (datum_stream.cc:554)
==20847==    by 0x29F441F: ql::replace_term_t::eval_impl(ql::scope_env_t*, ql::eval_flags_t) (writes.cc:218)
==20847==    by 0x2A8CCC3: ql::op_term_t::term_eval(ql::scope_env_t*, ql::eval_flags_t) (op.cc:149)
==20847==    by 0x2B532DA: ql::term_t::eval(ql::scope_env_t*, ql::eval_flags_t) (term.cc:363)
==20847==    by 0x2A18FC5: ql::rewrite_term_t::term_eval(ql::scope_env_t*, ql::eval_flags_t) (rewrites.cc:41)
==20847==    by 0x2B532DA: ql::term_t::eval(ql::scope_env_t*, ql::eval_flags_t) (term.cc:363)
==20847==    by 0x2B51D3E: ql::run(ql::protob_t<Query>, rdb_protocol_t::context_t*, signal_t*, Response*, ql::stream_cache2_t*) (term.cc:228)
==20847==    by 0x2A91D4D: query2_server_t::handle(ql::protob_t<Query>, Response*, query2_server_t::context_t*) (pb_server.cc:78)
==20847==    by 0x2A97978: boost::_mfi::mf3<bool, query2_server_t, ql::protob_t<Query>, Response*, query2_server_t::context_t*>::operator()(query2_server_t*, ql::protob_t<Query>, Response*, query2_server_t::context_t*) const (mem_fn_template.hpp:393)

@coffeemug
Copy link
Contributor

/cc @mlucy

@danielmewes danielmewes modified the milestones: 1.12.5, 1.12.x May 16, 2014
@danielmewes
Copy link
Member

(reader_t::items_index isn't initialized, apparently)

@danielmewes
Copy link
Member

Initializing reader_t::items_index to 0 removes the Valgrind warning, but doesn't solve the table corruption.

@danielmewes
Copy link
Member

This crashes the server as well, without having to read any data back from the table:

r.tableCreate("t1");
r.table("t1").insert({id: "foo"});
r.table("t1").get("foo").update({a: {b: r.literal({})}}, {returnVals: true})

The error and backtrace don't look very helpful:

error: Error in src/rpc/connectivity/cluster.cc at line 988:
error: Guarantee failed: [conn != __null] Attempted to kill connection to myself.
error: Backtrace:
error: Fri May 16 16:36:42 2014

       1: rethinkdb_backtrace(void**, int) at rethinkdb_backtrace.cc:100
       2: backtrace_t::backtrace_t() at backtrace.cc:202
       3: lazy_backtrace_formatter_t::lazy_backtrace_formatter_t() at backtrace.cc:282
       4: format_backtrace(bool) at backtrace.cc:197
       5: report_fatal_error(char const*, int, char const*, ...) at errors.cc:83
       6: connectivity_cluster_t::kill_connection(peer_id_t) at cluster.cc:988
       7: non-virtual thunk to connectivity_cluster_t::kill_connection(peer_id_t) at distccd_a87aa0e1.ii:998
       8: message_multiplexer_t::client_t::kill_connection(peer_id_t) at multiplexer.cc:105
       9: mailbox_manager_t::mailbox_read_coroutine(peer_id_t, threadnum_t, unsigned long, std::vector<char, std::allocator<char> >*, long) at mailbox.cc:201
        [...]
error: Exiting.
Trace/breakpoint trap (core dumped)

It doesn't seem necessary to serialize the updated value into the table to get bad behavior.
(Edit: The kill_connection() that's triggering the crash is actually called as a consequence of attempting to deserialize the write response though)

@danielmewes
Copy link
Member

@mlucy: This is the datum that's actually generated by the update():

{
    "a":    {
        "b":    {
            "$reql_type$":  "LITERAL",
            "value":    { }
                }
    },
    "id":   "foo"
}

As you can see the LITERAL pseudotype becomes part of the document.
When the datum is deserialized, the datum_t::datum_t(std::map) constructor is called, which runs datum_t::maybe_sanitize_ptype(). maybe_sanitize_ptype() throws an exception because the literal keyword is not allowed in that context.

Any idea how the literal keyword can end up in the result of an update()?

@danielmewes
Copy link
Member

Oh I see it never recurses into "a" because the field "a" does not exist on the left side of the merge. Hmm...

Besides the problem with merge(), we should make absolutely sure that either maybe_sanitize_ptype() is run on any datum_t before it's stored in a table, or alternatively that loading a bad pseudo type from disk doesn't crash the server.

danielmewes pushed a commit that referenced this issue May 17, 2014
@danielmewes
Copy link
Member

A fix is implemented in branch daniel_2399. In code review 1509 by @mlucy

@AtnNn
Copy link
Member

AtnNn commented May 20, 2014

How is the review going?

@danielmewes
Copy link
Member

The fix has been merged into v1.12.x 3185a66
and next 4cbd347 (@AtnNn)

It will ship with RethinkDB 1.12.5.

danielmewes pushed a commit that referenced this issue May 20, 2014
danielmewes pushed a commit that referenced this issue May 20, 2014
@AtnNn
Copy link
Member

AtnNn commented May 22, 2014

@colprog The fix for this issue has been released today as part of RethinkDB 1.12.5.

@colprog
Copy link
Author

colprog commented May 22, 2014

Great, thanks! I'll upgrade right away

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants