Database corruption (triggers segfault) #52

danielmewes · 2012-11-13T09:28:06Z

I'm getting a crash similar to #33
Except that instead of "Segmentation fault from reading the address (nil).", I have "Segmentation fault from reading the address 0xb0.".
Not sure if it is the same bug or a different one.

Release mode output:

info: Listening for intracluster connections on port 29016.
info: Connected to server "Leshrac" 98a56f75-29ae-43bc-9b18-4302d706faed
info: Listening for client driver connections on port 28016.
info: Listening for administrative HTTP connections on port 8081.
info: Server ready
error: Error in arch/runtime/thread_pool.cc at line 323:
error: Segmentation fault from reading the address 0xb0.
error: Backtrace:
error: Tue Nov 13 10:17:15 2012

       1: ./rethinkdb() [0x97bd2f]
       2: ./rethinkdb() [0x97d0e4]
       3: ./rethinkdb() [0x97b763]
       4: ./rethinkdb() [0x4d5522]
       5: +0xf8f0 at 0x7fbb65e498f0 (/lib/libpthread.so.0)
       6: ./rethinkdb() [0x9349ae]
       7: ./rethinkdb() [0x92e9aa]
       8: ./rethinkdb() [0x909871]
       9: ./rethinkdb() [0x4d3fbc]
error: Exiting.
[2524] worker: Couldn't read job function: end-of-file received
[2524] worker: Failed to accept job, quitting.
[2526] worker: Couldn't read job function: end-of-file received
[2530] worker: Couldn't read job function: end-of-file received
[2526] worker: Failed to accept job, quitting.
[2530] worker: Failed to accept job, quitting.
Segmentation fault

Debug mode output (waayyy more helpful I assume):

info: Our machine ID is e79fe4db-389d-4727-bf68-b86b3ac66eb8
info: Listening for intracluster connections on port 29016.
info: Connected to server "Leshrac" 98a56f75-29ae-43bc-9b18-4302d706faed
info: Listening for client driver connections on port 28016.
info: Listening for administrative HTTP connections on port 8081.
info: Server ready
error: Error in serializer/log/log_serializer.cc at line 416:
error: Assertion failed: [ls_token] 
error: Backtrace:
error: Tue Nov 13 10:26:38 2012

       1: lazy_backtrace_t::lazy_backtrace_t() at backtrace.cc:251
       2: format_backtrace(bool) at backtrace.cc:198
       3: report_fatal_error(char const*, int, char const*, ...) at errors.cc:65
       4: log_serializer_t::block_read(intrusive_ptr_t<ls_block_token_pointee_t> const&, void*, file_account_t*, linux_iocallback_t*) at log_serializer.cc:416
       5: log_serializer_t::block_read(intrusive_ptr_t<ls_block_token_pointee_t> const&, void*, file_account_t*) at log_serializer.cc:392
       6: translator_serializer_t::block_read(intrusive_ptr_t<ls_block_token_pointee_t> const&, void*, file_account_t*) at translator.cc:239
       7: mc_inner_buf_t::load_inner_buf(bool, file_account_t*) at mirrored.cc:166
       8: boost::_mfi::mf2<void, mc_inner_buf_t, bool, file_account_t*>::operator()(mc_inner_buf_t*, bool, file_account_t*) const at mem_fn_template.hpp:275
       9: void boost::_bi::list3<boost::_bi::value<mc_inner_buf_t*>, boost::_bi::value<bool>, boost::_bi::value<file_account_t*> >::operator()<boost::_mfi::mf2<void, mc_inner_buf_t, bool, file_account_t*>, boost::_bi::list0>(boost::_bi::type<void>, boost::_mfi::mf2<void, mc_inner_buf_t, bool, file_account_t*>&, boost::_bi::list0&, int) at bind.hpp:386
       10: boost::_bi::bind_t<void, boost::_mfi::mf2<void, mc_inner_buf_t, bool, file_account_t*>, boost::_bi::list3<boost::_bi::value<mc_inner_buf_t*>, boost::_bi::value<bool>, boost::_bi::value<file_account_t*> > >::operator()() at bind_template.hpp:21
       11: callable_action_instance_t<boost::_bi::bind_t<void, boost::_mfi::mf2<void, mc_inner_buf_t, bool, file_account_t*>, boost::_bi::list3<boost::_bi::value<mc_inner_buf_t*>, boost::_bi::value<bool>, boost::_bi::value<file_account_t*> > > >::run_action() at runtime_utils.hpp:57
       12: callable_action_wrapper_t::run() at runtime_utils.cc:58
       13: coro_t::run() at coroutines.cc:178
error: Exiting.
Crashing while already crashed. Printing error message to stderr.

The data on which the crash occurs is the result of running the stress client on a sharded (but non-replicated) database for 6 hours. After that I enabled replication and shut down the second server, because I wanted to test outdated reads. Upon restarting the second server, the crash occurred.

To reproduce, please download my RethinkDB data_dirs from here:
http://danielmewes.dnsalias.net/~daniel/.private/cluster_data_segfault.tar.bz2

Then start server 1 (should start up properly):

rethinkdb -d rethinkdb_cluster_data

and server 2 (should crash):

rethinkdb serve -d rethinkdb_cluster_data2 -o 1 -j localhost:29015

The text was updated successfully, but these errors were encountered:

danielmewes · 2012-11-13T09:29:05Z

# rethinkdb --version
rethinkdb 1.2.4-20-g298a7d-dirty (debug)

danielmewes · 2012-11-13T09:50:26Z

It seems that a block is referenced which is actually not in the LBA (or at least not the in-memory LBA that the serializer has in memory after startup).
The block ID of the block that's not in the LBA index is 79 (after translation, 19 before translation).

In case you want to try reproducing also the potential database corruption: I had 4 shards, with the primaries split between the two servers (2 each, no replication / no secondaries). Then I ran the following stress client configuration:

stress -d 21600s -c 32 -q - -w 1/4/8/64/0/0/0/8 --ignore-protocol-errors -s rethinkdb,localhost:28015 -s rethinkdb,localhost:28016

danielmewes · 2012-11-13T10:12:41Z

This is where the read request for block id 19 actually originates from (the assertion is mine for debugging):

error: Error in btree/parallel_traversal.cc at line 493:
error: Assertion failed: [block_id != 19] 
error: Backtrace:
error: Tue Nov 13 11:10:08 2012

       1: lazy_backtrace_t::lazy_backtrace_t() at backtrace.cc:251
       2: format_backtrace(bool) at backtrace.cc:198
       3: report_fatal_error(char const*, int, char const*, ...) at errors.cc:65
       4: interesting_children_callback_t::receive_interesting_child(int) at parallel_traversal.cc:493
       5: backfill_traversal_helper_t::annoying_t::do_got_subtree_recencies() at backfill.cc:83
       <the usual boost bind / event queue gibberish>

danielmewes · 2012-11-13T10:59:40Z

Ok, I can reproduce this rather reliably now.

Enable AGGRESSIVE_BUF_UNLOADING in the buffer cache
Start with a completely fresh database, two servers
Run the stress client as described above
After a few seconds, raise the number of shards to 4 (restart stress client afterwards)
Keep enabling and disabling replication (set replicas to 2 and 1 respectively). Make sure that the stress client is restarted whenever it terminates due to aborted transactions.
Watch the database being corrupted.

My best guess is a race with replication / backfilling.

(by the way: I'd recommend adding a line guarantee(data_token); to line 166 of mirrored.cc between the index_read and the block_read. This gives a slightly more graceful crash on corrupted database files (i.e. no seg fault). Corrupted database files might be encountered occasionally even in the absence of any bug in RethinkDB after all.)

coffeemug · 2012-11-13T20:51:44Z

@danielmewes -- you're a price among men!

This is definitely @srh's domain.

coffeemug · 2012-11-13T20:52:49Z

@danielmewes -- also, could you submit a pull request with your proposed change?

srh · 2012-11-14T01:42:42Z

@coffeemug I have no idea in the world how race conditions in the clustering layer would be my domain.

jdoliner · 2012-11-14T02:40:46Z

I'll look in to this. I think bug reporter is going to have some bubbles in the pipeline given that we need to write a separate outside service and I can fill them with this.

srh · 2012-11-14T02:48:53Z

Actually I think I'm just being an idiot, even if the clustering layer is messed up or if there's some race condition in the code sending stuff into the btree, that shouldn't cause btree code (or whatever) to become broken.

srh · 2012-11-14T02:49:29Z

@jdoliner I assigned this back to me because I think you were taken in by my moronic comment, but if you want to work for it feel free to assign it back to yourself.

danielmewes · 2012-11-14T08:06:31Z

Well, my speculation about a problem with the backfilling code was due to the fact that the bug only shows for me while replication is being enabled.
Maybe it accesses the b-tree slightly differently than regular (point) updates?
Otherwise, I guess this must be a problem with the b-tree code itself or shudder the buffer cache (snapshotting maybe?)?

srh · 2012-11-14T18:34:57Z

Joe said this would fit in his schedule more nicely, so assigning this back to him.

jdoliner · 2012-11-15T01:36:16Z

Alright more progress for the interested: the problem is that this causes block to be looked up which the index doesn't know about. Here's the backtrace of where the faulty call is issued:

lazy_backtrace_t - backtrace.cc:251
mc_inner_buf_t - buffer_cache/mirrored/mirrored.cc:202
mc_buf_lock_t - buffer_cache/mirrored/mirrored.cc:523
scc_buf_lock_t - ./buffer_cache/semantic_checking.tcc:154
void find_keyvalue_location_for_write<rdb_value_t>(scc_transaction_t<mc_cache_t>*, superblock_t*, btree_key_t const*, keyvalue_location_t<rdb_value_t>*, eviction_priority_t*, btree_stats_t*) - ./btree/operations.tcc:62
rdb_set(store_key_t const&, boost::shared_ptr<scoped_cJSON_t>, bool, btree_slice_t*, repli_timestamp_t, scc_transaction_t<mc_cache_t>*, superblock_t*, rdb_protocol_t::point_write_response_t*) - rdb_protocol/btree.cc:207
operator() - rdb_protocol/protocol.cc:802
internal_visit<const rdb_protocol_t::backfill_chunk_t::key_value_pair_t> - /usr/local/include/boost/variant/variant.hpp:859
visitation_impl_invoke_impl<boost::detail::variant::invoke_visitor<const {anonymous}::receive_backfill_visitor_t>, const void*, rdb_protocol_t::backfill_chunk_t::key_value_pair_t> - /usr/local/include/boost/variant/detail/visitation_impl.hpp:145
visitation_impl_invoke<boost::detail::variant::invoke_visitor<const {anonymous}::receive_backfill_visitor_t>, const void*, rdb_protocol_t::backfill_chunk_t::key_value_pair_t, boost::variant<rdb_protocol_t::backfill_chunk_t::delete_range_t, rdb_protocol_t::backfill_chunk_t::delete_key_t, rdb_protocol_t::backfill_chunk_t::key_value_pair_t>::has_fallback_type_> - /usr/local/include/boost/variant/detail/visitation_impl.hpp:174
visitation_impl<mpl_::int_<0>, boost::detail::variant::visitation_impl_step<boost::mpl::l_iter<boost::mpl::l_item<mpl_::long_<3l>, rdb_protocol_t::backfill_chunk_t::delete_range_t, boost::mpl::l_item<mpl_::long_<2l>, rdb_protocol_t::backfill_chunk_t::delete_key_t, boost::mpl::l_item<mpl_::long_<1l>, rdb_protocol_t::backfill_chunk_t::key_value_pair_t, boost::mpl::l_end> > > >, boost::mpl::l_iter<boost::mpl::l_end> >, boost::detail::variant::invoke_visitor<const {anonymous}::receive_backfill_visitor_t>, const void*, boost::variant<rdb_protocol_t::backfill_chunk_t::delete_range_t, rdb_protocol_t::backfill_chunk_t::delete_key_t, rdb_protocol_t::backfill_chunk_t::key_value_pair_t>::has_fallback_type_> - /usr/local/include/boost/variant/detail/visitation_impl.hpp:260
internal_apply_visitor_impl<boost::detail::variant::invoke_visitor<const {anonymous}::receive_backfill_visitor_t>, const void*> - /usr/local/include/boost/variant/variant.hpp:1803
internal_apply_visitor<boost::detail::variant::invoke_visitor<const {anonymous}::receive_backfill_visitor_t> > - /usr/local/include/boost/variant/variant.hpp:1825
apply_visitor<const {anonymous}::receive_backfill_visitor_t> - /usr/local/include/boost/variant/variant.hpp:1847
apply_visitor<{anonymous}::receive_backfill_visitor_t, const boost::variant<rdb_protocol_t::backfill_chunk_t::delete_range_t, rdb_protocol_t::backfill_chunk_t::delete_key_t, rdb_protocol_t::backfill_chunk_t::key_value_pair_t> > - /usr/local/include/boost/variant/detail/apply_visitor_unary.hpp:77
rdb_protocol_t::store_t::protocol_receive_backfill(btree_slice_t*, scc_transaction_t<mc_cache_t>*, superblock_t*, signal_t*, rdb_protocol_t::backfill_chunk_t const&) - rdb_protocol/protocol.cc:833
btree_store_t<rdb_protocol_t>::receive_backfill(rdb_protocol_t::backfill_chunk_t const&, object_buffer_t<fifo_enforcer_sink_t::exit_write_t>*, signal_t*) - btree/btree_store.cc:146
store_subview_t<rdb_protocol_t>::receive_backfill(rdb_protocol_t::backfill_chunk_t const&, object_buffer_t<fifo_enforcer_sink_t::exit_write_t>*, signal_t*) - ./protocol_api.hpp:547
store_subview_t<rdb_protocol_t>::receive_backfill(rdb_protocol_t::backfill_chunk_t const&, object_buffer_t<fifo_enforcer_sink_t::exit_write_t>*, signal_t*) - ./protocol_api.hpp:547
chunk_callback_t<rdb_protocol_t>::apply_backfill_chunk(fifo_enforcer_write_token_t, rdb_protocol_t::backfill_chunk_t const&, signal_t*) - clustering/immediate_consistency/branch/backfillee.cc:58
chunk_callback_t<rdb_protocol_t>::coro_pool_callback(backfill_queue_entry_t<rdb_protocol_t>, signal_t*) - clustering/immediate_consistency/branch/backfillee.cc:83
coro_pool_t<backfill_queue_entry_t<rdb_protocol_t> >::worker_run(backfill_queue_entry_t<rdb_protocol_t>, auto_drainer_t::lock_t) - ./concurrency/coro_pool.hpp:69
boost::_mfi::mf2<void, coro_pool_t<backfill_queue_entry_t<rdb_protocol_t> >, backfill_queue_entry_t<rdb_protocol_t>, auto_drainer_t::lock_t>::operator()(coro_pool_t<backfill_queue_entry_t<rdb_protocol_t> >*, backfill_queue_entry_t<rdb_protocol_t>, auto_drainer_t::lock_t) const - /usr/local/include/boost/bind/mem_fn_template.hpp:280
void boost::_bi::list3<boost::_bi::value<coro_pool_t<backfill_queue_entry_t<rdb_protocol_t> >*>, boost::_bi::value<backfill_queue_entry_t<rdb_protocol_t> >, boost::_bi::value<auto_drainer_t::lock_t> >::operator()<boost::_mfi::mf2<void, coro_pool_t<backfill_queue_entry_t<rdb_protocol_t> >, backfill_queue_entry_t<rdb_protocol_t>, auto_drainer_t::lock_t>, boost::_bi::list0>(boost::_bi::type<void>, boost::_mfi::mf2<void,coro_pool_t<backfill_queue_entry_t<rdb_protocol_t> >, backfill_queue_entry_t<rdb_protocol_t>, auto_drainer_t::lock_t>&, boost::_bi::list0&, int) - /usr/local/include/boost/bind/bind.hpp:392
boost::_bi::bind_t<void, boost::_mfi::mf2<void, coro_pool_t<backfill_queue_entry_t<rdb_protocol_t> >, backfill_queue_entry_t<rdb_protocol_t>, auto_drainer_t::lock_t>, boost::_bi::list3<boost::_bi::value<coro_pool_t<backfill_queue_entry_t<rdb_protocol_t> >*>, boost::_bi::value<backfill_queue_entry_t<rdb_protocol_t> >, boost::_bi::value<auto_drainer_t::lock_t> > >::operator()() - /usr/local/include/boost/bind/bind_template.hpp:21
callable_action_instance_t<boost::_bi::bind_t<void, boost::_mfi::mf2<void, coro_pool_t<backfill_queue_entry_t<rdb_protocol_t> >, backfill_queue_entry_t<rdb_protocol_t>, auto_drainer_t::lock_t>, boost::_bi::list3<boost::_bi::value<coro_pool_t<backfill_queue_entry_t<rdb_protocol_t> >*>, boost::_bi::value<backfill_queue_entry_t<rdb_protocol_t> >, boost::_bi::value<auto_drainer_t::lock_t> > > >::run_action() - ./arch/runtime/runtime_utils.hpp:57
callable_action_wrapper_t::run() - arch/runtime/runtime_utils.cc:58
coro_t::run() - arch/runtime/coroutines.cc:178

As we can see it's coming from backfilling which makes me hopeful that we may be able to recreate this with a unit test.

jdoliner · 2012-11-15T20:56:28Z

Alright, I believe I have a theory as to how this comes up:

In mc_inner_buf_t::mc_inner_buf_t we spawn a coroutine which loads the inner buf using spawn_now_dangerously. There's a comment there that says we do this to make sure load_inner_buf() acquires the lock first.

In load_inner_buf we then switch to the serializer's thread and read the block. However since we switch threads there's no guarantee that the thing the comment says we need actually happens.

We get this error when we rapidly change the replica count from 1 to 2. Changing to 2 causes a backfill to happen. Changing to 1 causes all the data to be deleted. I suspect that a write from the backfill is somehow racing with a delete from the erase range.

danielmewes · 2012-11-15T21:21:38Z

I have to admit the name of the spawn method alone doesn't inspire confidence. ;-)
Is this just grandmother's old spawn_now under a new flashy name or something new?

-----Original Message-----
From: Joe Doliner notifications@github.com
Date: Thu, 15 Nov 2012 12:56:32
To: rethinkdb/rethinkdbrethinkdb@noreply.github.com
Reply-To: rethinkdb/rethinkdb reply@reply.github.com
Cc: Daniel Mewesdanielmewes@onlinehome.de
Subject: Re: [rethinkdb] Database corruption (triggers segfault) (#52)

Alright, I believe I have a theory as to how this comes up:

In mc_inner_buf_t::mc_inner_buf_t we spawn a coroutine which loads the inner buf using spawn_now_dangerously. There's a comment there that says we do this to make sure load_inner_buf() acquires the lock first.

In load_inner_buf we then switch to the serializer's thread and read the block. However since we switch threads there's no guarantee that the thing the comment says we need actually happens.

We get this error when we rapidly change the replica count from 1 to 2. Changing to 2 causes a backfill to happen. Changing to 1 causes all the data to be deleted. I suspect that a write from the backfill is somehow racing with a delete from the erase range.

Reply to this email directly or view it on GitHub:
#52 (comment)

jdoliner · 2012-11-15T21:23:57Z

My thoughts exactly. This is basically grandmother's old spawn_now with a new less flashy name that hopefully deters people from using it. Obviously deterance hasn't worked as well as we might have hoped.

jdoliner · 2012-11-15T21:43:56Z

Alright, this theory is wrong I ran with one thread and verfied that nothing was getting reordered and it still comes up. Rats.

jdoliner · 2012-11-15T23:10:06Z

Alright some more information has been discovered, the problem arises when someone tries to read a deleted block. I have confirmed that the block does indeed get deleted before it's read.

This is the backtrace of the offending delete:

1: lazy_backtrace_t::lazy_backtrace_t() at 0x1df7b81 (../build/debug/rethinkdb)
2: writeback_t::local_buf_t::mark_block_id_deleted() at 0x1808b53 (../build/debug/rethinkdb)
3: mc_buf_lock_t::release() at 0x17fe86f (../build/debug/rethinkdb)
4: scc_buf_lock_t<mc_cache_t>::release() at 0x17bcb74 (../build/debug/rethinkdb)
5: check_and_handle_underfull(value_sizer_t, scc_transaction_t<mc_cache_t>, scc_buf_lock_t<mc_cache_t>, scc_buf_lock_t<mc_cache_t>, superblock_t_, btree_key_t const_) at 0x17d25de (../build/debug/rethinkdb)
6: void apply_keyvalue_change<rdb_value_t>(scc_transaction_t<mc_cache_t>, keyvalue_location_t<rdb_value_t>, btree_key_t const_, repli_timestamp_t, bool, key_modification_callback_t<rdb_value_t>, eviction_priority_t) at 0x1e54a27 (../build/debug/rethinkdb)
7: kv_location_set(keyvalue_location_t<rdb_value_t>, store_key_t const&, boost::shared_ptr<scoped_cJSON_t>, btree_slice_t, repli_timestamp_t, scc_transaction_t<mc_cache_t>) at 0x1e51fde (../build/debug/rethinkdb)
8: rdb_set(store_key_t const&, boost::shared_ptr<scoped_cJSON_t>, bool, btree_slice_t, repli_timestamp_t, scc_transaction_t<mc_cache_t>, superblock_t, rdb_protocol_t::point_write_response_t_) at 0x1e5276d (../build/debug/rethinkdb)
9: ../build/debug/rethinkdb() [0x1e686e6]
10: ../build/debug/rethinkdb() [0x1e6e791]
11: ../build/debug/rethinkdb() [0x1e6da66]
12: ../build/debug/rethinkdb() [0x1e6cd37]
13: ../build/debug/rethinkdb() [0x1e6bb03]
14: ../build/debug/rethinkdb() [0x1e699e5]
15: ../build/debug/rethinkdb() [0x1e696ba]
16: ../build/debug/rethinkdb() [0x1e69298]
17: ../build/debug/rethinkdb() [0x1e69033]
18: rdb_protocol_t::store_t::protocol_receive_backfill(btree_slice_t_, scc_transaction_t<mc_cache_t>, superblock_t, signal_t_, rdb_protocol_t::backfill_chunk_t const&) at 0x1e68849 (../build/debug/rethinkdb)
19: btree_store_t<rdb_protocol_t>::receive_backfill(rdb_protocol_t::backfill_chunk_t const&, object_buffer_t<fifo_enforcer_sink_t::exit_write_t>, signal_t) at 0x17daf97 (../build/debug/rethinkdb)
20: store_subview_t<rdb_protocol_t>::receive_backfill(rdb_protocol_t::backfill_chunk_t const&, object_buffer_t<fifo_enforcer_sink_t::exit_write_t>, signal_t) at 0x1ad49f0 (../build/debug/rethinkdb)
21: store_subview_t<rdb_protocol_t>::receive_backfill(rdb_protocol_t::backfill_chunk_t const&, object_buffer_t<fifo_enforcer_sink_t::exit_write_t>, signal_t) at 0x1ad49f0 (../build/debug/rethinkdb)
22: chunk_callback_t<rdb_protocol_t>::apply_backfill_chunk(fifo_enforcer_write_token_t, rdb_protocol_t::backfill_chunk_t const&, signal_t_) at 0x1b96177 (../build/debug/rethinkdb)
23: chunk_callback_t<rdb_protocol_t>::coro_pool_callback(backfill_queue_entry_t<rdb_protocol_t>, signal_t_) at 0x1b92937 (../build/debug/rethinkdb)
24: coro_pool_t<backfill_queue_entry_t<rdb_protocol_t> >::worker_run(backfill_queue_entry_t<rdb_protocol_t>, auto_drainer_t::lock_t) at 0x1b7ff40 (../build/debug/rethinkdb)
25: boost::mfi::mf2<void, coro_pool_t<backfill_queue_entry_t<rdb_protocol_t> >, backfill_queue_entry_t<rdb_protocol_t>, auto_drainer_t::lock_t>::operator()(coro_pool_t<backfill_queue_entry_t<rdb_protocol_t> >, backfill_queue_entry_t<rdb_protocol_t>, auto_drainer_t::lock_t) const at 0x1b9998c (../build/debug/rethinkdb)
26: void boost::_bi::list3boost::bi::value<coro_pool_t<backfill_queue_entry_t<rdb_protocol_t >>, boost::_bi::value<backfill_queue_entry_t<rdb_protocol_t> >, boost::_bi::value<auto_drainer_t::lock_t> >::operator()<boost::_mfi::mf2<void, coro_pool_t<backfill_queue_entry_t<rdb_protocol_t> >, backfill_queue_entry_t<rdb_protocol_t>, auto_drainer_t::lock_t>, boost::_bi::list0>(boost::_bi::type, boost::_mfi::mf2<void, coro_pool_t<backfill_queue_entry_t<rdb_protocol_t> >, backfill_queue_entry_t<rdb_protocol_t>, auto_drainer_t::lock_t>&, boost::_bi::list0&, int) at 0x1b985fa (../build/debug/rethinkdb)
27: boost::_bi::bind_t<void, boost::_mfi::mf2<void, coro_pool_t<backfill_queue_entry_t<rdb_protocol_t> >, backfill_queue_entry_t<rdb_protocol_t>, auto_drainer_t::lock_t>, boost::_bi::list3boost::bi::value<coro_pool_t<backfill_queue_entry_t<rdb_protocol_t >>, boost::_bi::value<backfill_queue_entry_t<rdb_protocol_t> >, boost::_bi::value<auto_drainer_t::lock_t> > >::operator()() at 0x1b95eb3 (../build/debug/rethinkdb)
28: callable_action_instance_t<boost::_bi::bind_t<void, boost::_mfi::mf2<void, coro_pool_t<backfill_queue_entry_t<rdb_protocol_t> >, backfill_queue_entry_t<rdb_protocol_t>, auto_drainer_t::lock_t>, boost::_bi::list3boost::bi::value<coro_pool_t<backfill_queue_entry_t<rdb_protocol_t >>, boost::_bi::value<backfill_queue_entry_t<rdb_protocol_t> >, boost::_bi::value<auto_drainer_t::lock_t> > > >::run_action() at 0x1b92504 (../build/debug/rethinkdb)
29: callable_action_wrapper_t::run() at 0x177c6c4 (../build/debug/rethinkdb)
30: coro_t::run() at 0x17741cd (../build/debug/rethinkdb)

I'm also not formatting this because it takes forever. But we can see that check_and_handle_underfull appears, at least superficially, to be the culprit.

danielmewes · 2012-11-16T09:33:41Z

There's light at the end of the tunnel.

I have what I believe to be a fix. Currently in testing...

danielmewes · 2012-11-16T10:37:17Z

Okaaay. I don't understand the full nature of this problem nor of the fix. So the fix might well be incomplete. It might have side-effects (though I cannot think of any).

First, this is what fixes the problem for me:
in mirrored.cc, method mc_inner_buf_t::allocate, the else case, right behind "inner_buf->data_token.reset();", add the following lines:

...
inner_buf->snap_refcount = 0;
inner_buf->block_sequence_id = NULL_BLOCK_SEQUENCE_ID;
inner_buf->data_token.reset();

// Is this the fix?
inner_buf->writeback_buf().set_dirty(true);
inner_buf->writeback_buf().set_recency_dirty(true);
inner_buf->writeback_buf().set_needs_flush(true);

return inner_buf;
...

I'm unsure whether the set_recency_dirty and set_needs_flush are required. The last one is important for patches only, which seem to be disabled by default. I didn't test if it is required with patches enabled or not, but it shouldn't hurt either I think.

Ok, here's a bit of background for this part of the code. It is all from my memories, so things might have slightly changed or I might mix up some details. Anyway, let's start with what happens when you delete a block.

You acquire a lock on the buf and set the do_delete flag to true.
When you release the buf (mc_buf_lock_t::release()), the command to delete is passed on to the writeback, which later passes it on to the serializer. This happens though the line "inner_buf->writeback_buf().mark_block_id_deleted()".
Important: When the writeback happens, and the writeback sees that the block is tagged for deletion, it releases the block id (writeback_t::do_concurrent_flush()).
Now we should also delete the inner_buf, because the block associated with it has been deleted. However things get more complicated here, and that is due to snapshots:
If there are snapshots of the buffer around (inner_buf->safe_to_unload() returns false then), we cannot delete the inner_buf from memory yet. The snapshots are still using it, even though the most current version has been deleted. So we have to stick with it for a little (or much) longer.
Let's say somebody else now allocates a new block. We assume that the writeback for the deleted block has already gone through, so that someone has a chance to get the block id of the block that was previously deleted. And this is what the else-branch in mc_inner_buf_t::allocate is all about. If this happens, we don't actually create a new inner_buf (which wouldn't work, because for each block_id we can only have one inner_buf in the cache at any time). We instead reset the existing one (that merely contains snapshots at this point), so it can be re-used.
Now what my change from above does is that it makes sure that in such a case, the re-used block gets serialized during the next writeback, thereby also undoing the deletion in the serializer.

However this shouldn't be necessary. First we don't seem to set_dirty in the other alternative, where we allocate a fresh inner_buf. And that one doesn't seem to cause any problems. Second, everybody who allocates a fresh block should also initialize it before releasing the lock on it. When it gets initialized (either by applying patches or through get_data_major_write() ), set_dirty should be called anyway.
This is what I don't understand, and why this fix might not actually fix the right thing. I'm passing this to @jdoliner to investigate further. (I might also have another look later)

A last note:
One might suggest: Why can't we wait with marking the block_id_deleted() until the inner_buf is completely released (including all of its snapshots)? This would simplify stuff a lot, wouldn't it?
Unfortunately we cannot. The reason for this is that the last transaction that releases the buffer might not be the write transaction (the one that marked the inner_buf for deletion in the first place). It might be just a regular non-privileged snapshot or rwi_read_outdated_ok transaction. If such a read-only transaction would tell the writeback to delete a block, terrible things could happen. Primarily, it wouldn't synchronize with the flush lock and could mess up a running writeback. Many other pieces of code might break in subtle ways if read transactions suddenly trigger writebacks.
(Something related answers @srh 's comment in mirrored.cc "// TODO(sam): Shouldn't these already be set somehow?" No, we cannot because then the block_id would be freed even though there's still a mc_buf_lock_t in rwi_write mode held on the current version of the block. The next one who needs a new block might then get the same block id where somebody else is still holding a write lock on. Scary...)

jdoliner · 2012-11-16T20:14:51Z

Alright I think the most obvious for me to do here is write a unittest which emulates the scenario you discuss and we'll see if it does indeed cause this problem.

jdoliner · 2012-11-19T22:04:16Z

I think the API is really confusing. Being able to allocate a block and then have it disappear because you didn't write to it is really weird.

danielmewes · 2012-11-19T22:20:34Z

Yeah, but also one shouldn't allocate a block and then not initialize it with any data. At the very least whoever allocates the block should put some header/magic into it.

There are two things at fault here it seems, which together manage to create catastrophe.

By the way: Are new blocks zeroed out anywhere? I can't find where that happens, which would mean that part of the data written to disk is random chunks of uninitialized memory.

There is this piece of code in mirrored.cc

#if !defined(NDEBUG) || defined(VALGRIND)
    // The memory allocator already filled this with 0xBD, but it's nice to be able to distinguish
    // between problems with uninitialized memory and problems with uninitialized blocks
    memset(inner_buf->data.get(), 0xCD, cache->serializer->get_block_size().value());
#endif

But it is not enabled in release mode and I cannot find how the allocator would guarantee to initialize the data (memory is ultimately allocated using posix_memalign, which doesn't mention initialization in its documentation) . Am I missing something?

coffeemug · 2012-11-19T22:31:32Z

It's perfectly ok to write random data in release mode -- there is no need to zero it out if the code is correct (other than perhaps to ensure easier debugging later -- but that comes at a cost of performance).

Perhaps the API should have a guarantee that triggers if the block has been allocated but not written to? (though this isn't an ideal solution, since it doesn't prevent bugs at compile time, which a better API could).

danielmewes · 2012-11-19T22:52:26Z

@coffeemug I disagree. Maybe RethinkDB will have support for authentication in a couple years and thereby handle passwords. Having code around that writes random pieces of memory to disk strikes me as really bad. This specific behavior will sure enough be forgotten eventually, and can have really bad side effects in the future (like writing unencrypted passwords to disk). Whatever, not really a bug for now...

I'll pull-request a patch that makes sure that blocks once allocated, will be persisted no matter whether initialized or not. This should actually be done in the buf_lock_t constructor, not in the mc_inner_buf_t::allocate method, where I had put the fix before...

jdoliner · 2012-11-19T22:56:01Z

So here's something. I just went through our code and discovered no places in which we allocate a block without initializing it. Which seems to contradict our proposed mechanism here. We should look in to figuring out where this block gets reallocated from.

danielmewes · 2012-11-19T23:56:55Z

Ok, so here's the thing. When initializing a block, that is usually done by calling get_data_major_write() on the buffer lock. get_data_major_write() in turn calls ensure_flush(), which is meant to - well - ensure that the buffer gets flushed.
Not let's take a look at that little function:

void mc_buf_lock_t::ensure_flush() {
    [... asserts and stuff]
    if (!inner_buf->writeback_buf().needs_flush()) {
        // We bypass the patching system, make sure this buffer gets flushed.
        inner_buf->writeback_buf().set_needs_flush(true);
        // ... we can also get rid of existing patches at this point.
        inner_buf->cache->patch_memory_storage.drop_patches(inner_buf->block_id);
        // Make sure that the buf is marked as dirty
        inner_buf->writeback_buf().set_dirty();
        inner_buf->data_token.reset();
    }
}

As we can see, it does stuff whenever needs_flush has not been set yet. This is quite sane, as ensure_flush() is the only place where needs_flush gets set from, and it doesn't make sense to "ensure" a flush twice before another writeback has happened (which would reset the flag).

Now there is one little problem with this logic: the part of mc_inner_buf_t::allocate() that is responsible for re-setting an existing buf_t so it can be re-used for a fresh block misses on one crutial part: It doesn't reset the writeback_buf. Apart from the dirty flag, it also doesn't reset the needs_flush flag. So let's consider this situation:

A block #X is modified and ensure_flush() is called in the process of it. needs_flush is set to true.
Block #X is marked for deletion and the lock on the buffer is released (mc_buf_lock_t::release()). As part of that, block ID X is passed on for deletion to the writeback, and the dirty flag is reset to false (otherwise writeback would first delete the block, just to write it back to disk afterwards again). However the needs_flush flag is not touched and remains true (nothing wrong with that so far).
A writeback is performed, block id X is marked in the LBA as deleted, the block id added to the free list.
Another transaction requests a new block. As it happens, it gets block id X assigned for the new block.
Now assume that a snapshot on the old buffer is still around. This triggers the execution of the buf re-using code in mc_inner_buf_t::allocate().
The transaction now initialized the fresh block, by calling get_data_major_write(), which in turn calls ensure_flush(). Unfortunately the needs_flush flag was never unset. ensure_flush() becomes a no-op and doesn't set the dirty but.
The transaction releases the buffer, it eventually gets evicted from the cache.
The buffer never got persisted to disk, as its dirty flag was never set.

I propose fixing this by making allocate() also reset the writeback state properly. A pull-request is on the way.

Proposed fix for #52

ghost assigned srh Nov 13, 2012

ghost assigned jdoliner Nov 14, 2012

ghost assigned srh Nov 14, 2012

danielmewes mentioned this issue Nov 14, 2012

Guarantee a successful index read before reading block data #57

Merged

ghost assigned jdoliner Nov 14, 2012

danielmewes mentioned this issue Nov 20, 2012

Proposed fix for #52 #86

Merged

jdoliner closed this as completed in c25841c Nov 26, 2012

jdoliner added a commit that referenced this issue Nov 26, 2012

Merge pull request #86 from danielmewes/rethinkdb-next

6965a5d

Proposed fix for #52

underrun mentioned this issue Nov 7, 2013

segfault during secondary index creation after sharding #1621

Closed

ferreiramarcelo mentioned this issue Jan 21, 2021

[Snyk] Upgrade vinyl from 0.4.6 to 0.5.3 ferreiramarcelo/rethinkdb#1

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Database corruption (triggers segfault) #52

Database corruption (triggers segfault) #52

danielmewes commented Nov 13, 2012

danielmewes commented Nov 13, 2012

danielmewes commented Nov 13, 2012

danielmewes commented Nov 13, 2012

danielmewes commented Nov 13, 2012

coffeemug commented Nov 13, 2012

coffeemug commented Nov 13, 2012

srh commented Nov 14, 2012

jdoliner commented Nov 14, 2012

srh commented Nov 14, 2012

srh commented Nov 14, 2012

danielmewes commented Nov 14, 2012

srh commented Nov 14, 2012

jdoliner commented Nov 15, 2012

jdoliner commented Nov 15, 2012

danielmewes commented Nov 15, 2012

jdoliner commented Nov 15, 2012

jdoliner commented Nov 15, 2012

jdoliner commented Nov 15, 2012

danielmewes commented Nov 16, 2012

danielmewes commented Nov 16, 2012

jdoliner commented Nov 16, 2012

jdoliner commented Nov 19, 2012

danielmewes commented Nov 19, 2012

coffeemug commented Nov 19, 2012

danielmewes commented Nov 19, 2012

jdoliner commented Nov 19, 2012

danielmewes commented Nov 19, 2012

Database corruption (triggers segfault) #52

Database corruption (triggers segfault) #52

Comments

danielmewes commented Nov 13, 2012

danielmewes commented Nov 13, 2012

danielmewes commented Nov 13, 2012

danielmewes commented Nov 13, 2012

danielmewes commented Nov 13, 2012

coffeemug commented Nov 13, 2012

coffeemug commented Nov 13, 2012

srh commented Nov 14, 2012

jdoliner commented Nov 14, 2012

srh commented Nov 14, 2012

srh commented Nov 14, 2012

danielmewes commented Nov 14, 2012

srh commented Nov 14, 2012

jdoliner commented Nov 15, 2012

jdoliner commented Nov 15, 2012

danielmewes commented Nov 15, 2012

jdoliner commented Nov 15, 2012

jdoliner commented Nov 15, 2012

jdoliner commented Nov 15, 2012

danielmewes commented Nov 16, 2012

danielmewes commented Nov 16, 2012

jdoliner commented Nov 16, 2012

jdoliner commented Nov 19, 2012

danielmewes commented Nov 19, 2012

coffeemug commented Nov 19, 2012

danielmewes commented Nov 19, 2012

jdoliner commented Nov 19, 2012

danielmewes commented Nov 19, 2012