Crash during rapid resharding #728

jdoliner · 2013-04-26T22:23:30Z

I uncovered the following crash while rapidly resharding:

error: Error in ../src/buffer_cache/mirrored/mirrored.cc at line 657:
error: Assertion failed: [inner_buf->version_id <= version_to_access] 
error: Backtrace:
error: Fri Apr 26 15:15:57 2013

       1: lazy_backtrace_t::lazy_backtrace_t() at backtrace.cc:249
       2: format_backtrace(bool) at backtrace.cc:196
       3: report_fatal_error(char const*, int, char const*, ...) at errors.cc:71
       4: mc_buf_lock_t::acquire_block(unsigned long) at mirrored.cc:657
       5: mc_buf_lock_t::initialize(unsigned long, file_account_t*, lock_in_line_callback_t*) at mirrored.cc:577
       6: mc_buf_lock_t::mc_buf_lock_t(mc_transaction_t*, unsigned int, access_t, buffer_cache_order_mode_t, lock_in_line_callback_t*) at mirrored.cc:517
       7: scc_buf_lock_t<mc_cache_t>::scc_buf_lock_t(scc_transaction_t<mc_cache_t>*, unsigned int, access_t, buffer_cache_order_mode_t, lock_in_line_callback_t*) at semantic_checking.tcc:138
       8: process_a_leaf_node(traversal_state_t*, scoped_ptr_t<scc_buf_lock_t<mc_cache_t> >*, int, btree_key_t const*, btree_key_t const*) at parallel_traversal.cc:469
       9: do_a_subtree_traversal_fsm_t::on_node_ready(scoped_ptr_t<scc_buf_lock_t<mc_cache_t> >*) at parallel_traversal.cc:400
       10: acquire_a_node_fsm_t::you_may_acquire() at parallel_traversal.cc:262
       11: boost::_mfi::mf0<void, acquisition_waiter_callback_t>::operator()(acquisition_waiter_callback_t*) const at mem_fn_template.hpp:50
       12: void boost::_bi::list1<boost::_bi::value<acquisition_waiter_callback_t*> >::operator()<boost::_mfi::mf0<void, acquisition_waiter_callback_t>, boost::_bi::list0>(boost::_bi::type<void>, boost::_mfi::mf0<void, acquisition_waiter_callback_t>&, boost::_bi::list0&, int) at bind.hpp:254
       13: boost::_bi::bind_t<void, boost::_mfi::mf0<void, acquisition_waiter_callback_t>, boost::_bi::list1<boost::_bi::value<acquisition_waiter_callback_t*> > >::operator()() at bind_template.hpp:21
       14: callable_action_instance_t<boost::_bi::bind_t<void, boost::_mfi::mf0<void, acquisition_waiter_callback_t>, boost::_bi::list1<boost::_bi::value<acquisition_waiter_callback_t*> > > >::run_action() at runtime_utils.hpp:58
       15: callable_action_wrapper_t::run() at runtime_utils.cc:67
       16: coro_t::run() at coroutines.cc:178

To reproduce:

Insert about 1M documents (I have no idea how important this number is.)
Rapidly change the number of shards in the UI (don't wait for the resharding process to finish before changing the number.)

This issue is post secondary indexes so they're a likely candidate. I currently have no hypothesis for how they could be involved though. I had no secondary indexes created at the time I got this and thus I can't think of a reason they'd be spawning parallel traversals. It seems more likely that it's the backfill parallel traversal that's erroring. But who knows.

jdoliner · 2013-04-26T22:24:58Z

There's a core file for this crash in ~jdoliner/issue_728 on newton.

Tryneus · 2013-05-02T03:09:01Z

Working on this, by the way.

jdoliner · 2013-05-02T17:55:40Z

Come talk to me about this. I have a few things you could do to prevent #727 from getting in your way too much.

Tryneus · 2013-05-03T23:33:06Z

Keep running into hanging issues while trying to reproduce this. Even after your suggestions, I've got a secondary with its reactors stuck in nothing_when_done_erasing.

Tryneus · 2013-05-03T23:33:56Z

You had a cluster when first having this issue, right? How many replicas for the table?

jdoliner · 2013-05-04T00:10:34Z

Yeah cluster of 3 machines. No replicas. Meaning just the master.

On Friday, May 3, 2013, Tryneus wrote:

You had a cluster when first having this issue, right? How many replicas
for the table?

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/728#issuecomment-17423698
.

Tryneus · 2013-05-09T03:31:20Z

I have never once observed this crash, not sure what else I can do.

Tryneus · 2013-05-09T22:42:06Z

Aha, finally got this error today, based off of the issue_727 branch, so this is unfortunately still around.

danielmewes · 2013-05-09T22:50:01Z

Wow, could this be another long-hidden bug in the snapshotting code? I feel with you guys :-|

Tryneus · 2013-05-14T00:22:49Z

Moving this to 1.5.x since there is no way we can track this down in time for 1.5.

coffeemug · 2013-05-22T18:55:32Z

@Tryneus -- what's the state of this issue?

Tryneus · 2013-05-22T19:44:38Z

No progress on this, still have not seen this again despite tons of testing. Until we have a consistent way to reproduce this, I can't really expect us to track this down.

AtnNn · 2013-07-02T22:33:14Z

Moving to 1.7.x

coffeemug · 2013-07-09T20:34:13Z

Just talked to Marc. This didn't show up in a while and we can't fix it until we have a consistent way to reproduce it. Moving to backlog for now.

danielmewes · 2013-10-07T19:45:11Z

Just noticed that this is probably exactly the problem I'm seeing here: #1380 (there was a second bug in that issue which is already fixed).

I have a set of data files for a 3-node cluster with which this can be easily reproduced. I'm currently looking into this. Assigning to myself.
So far I still don't understand what's going on. Locking should forbid the thing we are seeing here from ever happening.

danielmewes · 2013-10-07T21:49:43Z

The changes in 17ed59b made the problem much more difficult to reproduce (which is good, because the bug is now relatively rare), but I still saw it popping up twice since then. Investigating further...

danielmewes · 2013-10-07T23:48:29Z

So we somehow have the following situation (in chronological order):

Write transaction t1 acquires its first block (I assume the superblock). This initializes its version id.
(t1 releases the superblock)
Write transaction t2 acquires its first block (I assume the superblock). This initializes its version id.
t2 gets a lock on a certain block b
(t2 releases b)
t1 gets a lock on the same block b

The snapshotting it turns out was designed under the assumption that a write transaction can never bypass another one. This seems to be violated somewhere, which might be a bug, or maybe it is by design?

danielmewes · 2013-10-08T02:58:41Z

There is strong evidence that our snapshotting logic cannot cope with the stats block which we acquire in parallel_traversal.cc on line 480 (process_a_leaf_node()). We acquire the stats block without any ordering guarantees, sometimes even repeatedly from the same transaction.

It might be possible that our snapshotting code could actually handle such use cases with minor modifications. But I'm not sure.

One solution would be to start a new transaction just to update the stat block each time we have to do so. We would lose the atomicity guarantees though, which is probably not very nice.

The other alternative is to acquire the stat block once, and to only ever acquire it while we still hold the superblock. We would then hold on to the lock while the parallel traversal is proceeding.
I imagine this might pose some issues for concurrency?
@Tryneus , @jdoliner do you have an idea how important it is for our performance that we yield the stat block?

jdoliner · 2013-10-08T05:42:40Z

I can confirm that this is a block that we do acquire out of order which is by design.

I don't think either of your solutions will work for the reasons you give. How hard would it be to relax the constraint? Another option would be to use the same synchronization method for the stat block as we do for the secondary index block. One thing I'm not sure of is how we handle the stat block when we do an erase_range. It's been a while since I touched this code. That could be kind of hard to do using sindex_tokens for synchronization.

danielmewes · 2013-10-08T18:10:24Z

Thanks for chiming in. I'm going to have a deeper look at what it takes to expand the snapshotting logic today.

danielmewes · 2013-10-10T01:46:36Z

So it turns out that this can be worked around in the cache rather easily.

However, with the implementation I have in mind, snapshotting semantics would be violated for the affected blocks.

Let's look at an example with three transactions. First let's say that everything happens in order, and we do not run into the conflict that this issue is about:

trx-write-1 acquires superblock (version is initialized)
trx-read-2 acquires superblock (version is initialized) and is snapshotted
trx-write-3 acquires superblock (version is initialized)
trx-write-1 acquires stat block, modifies it
trx-write-3 acquires stat block, modifies it
trx-read-2 acquires stat block. It can see the changes made by trx-write-1, but it cannot see the changes made by trx-write-3. This is the correct behavior.

Now let's swap steps 4 and 5. Our current implementation has undefined behavior in this case. The adapted one would do the following:
...
4. trx-write-3 acquires stat block, modifies it
5. trx-write-1 acquires stat block, modifies it
6. trx-read-2 acquires stat block. It can see neither the changes made by trx-write-1 nor trx-write-3. It should see the changes made by trx-write-1. This behavior is incorrect. And it is theoretically impossible to behave correctly in this scenario without making additional assumptions on what kinds of operations the write transactions can perform. trx-read-2 can either see none of the changes, or both of them (or deadlock, which I guess would be correct). After all trx-write-1 might have modified the stat block depending on changes previously made by trx-write-3.

@jdoliner Would such a behavior be acceptable for the stats block? For a snapshotted read transaction, the stats block would (occasionally) appear inconsistent with the rest of the btree.

danielmewes · 2013-10-10T02:08:24Z

This is the proposed change by the way: aee406d

It does solve the crash (well, it removes the assertion actually. But at the same time adds a well-defined behavior for cases were it used to be violated).

jdoliner · 2013-10-10T17:40:21Z

Short answer yes those semantics are fine.

Longer answer:
These semantic violations are independent of snapshotting. We generally allow txns to acquire the stat block in whatever order they want so a read, whether snapshotted or not may or may not see the changes made by the currently running write txns. The 2 risks that exist with violating a DAG locking structure are deadlocks and inconsistent data. The first is solved by adhering to the rule that while you have the stat block acquired you can't acquire any other blocks so it's in essence a completely separate tree. The second is (mostly) solved by the fact that the only operations permitted on the stat block are increment and decrement which commute. This means that unlike with other blocks writes can happen in any order and eventually result in the same value. This does allow you to see the effects of some of the currently running writes but not others. We just live with that for now and only use the stat block as an out of date count on the tree.

danielmewes · 2013-10-10T21:22:29Z

The fix is in code review 978.

danielmewes · 2013-10-15T19:55:48Z

This is in next as of commit c42fbc9 .

danielmewes · 2013-10-15T20:01:38Z

... and also cherry-picked into v1.10.x as of d582a0d

jdoliner mentioned this issue Apr 29, 2013

Test what happens when you mix sindex creation with rapid resharding. #581

Closed

ghost assigned Tryneus May 1, 2013

This was referenced May 6, 2013

Assertion failed: [subscriptions.empty()] while rapidly resharding. #747

Closed

Valgrind errors and segfault during progress_app_t::handle #770

Closed

jdoliner mentioned this issue May 14, 2013

Support secondary and compound indexes #88

Closed

ghost assigned danielmewes Oct 7, 2013

danielmewes mentioned this issue Oct 7, 2013

Crashes with smallish table during stressful rebalance after replication & sharding change #1380

Closed

danielmewes closed this as completed in 17ed59b Oct 15, 2013

danielmewes mentioned this issue Feb 27, 2014

Database corruption: "Assertion failed: [key_in_range(k, left_excl, right_incl)] " #2015

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crash during rapid resharding #728

Crash during rapid resharding #728

jdoliner commented Apr 26, 2013

jdoliner commented Apr 26, 2013

Tryneus commented May 2, 2013

jdoliner commented May 2, 2013

Tryneus commented May 3, 2013

Tryneus commented May 3, 2013

jdoliner commented May 4, 2013

Tryneus commented May 9, 2013

Tryneus commented May 9, 2013

danielmewes commented May 9, 2013

Tryneus commented May 14, 2013

coffeemug commented May 22, 2013

Tryneus commented May 22, 2013

AtnNn commented Jul 2, 2013

coffeemug commented Jul 9, 2013

danielmewes commented Oct 7, 2013

danielmewes commented Oct 7, 2013

danielmewes commented Oct 7, 2013

danielmewes commented Oct 8, 2013

jdoliner commented Oct 8, 2013

danielmewes commented Oct 8, 2013

danielmewes commented Oct 10, 2013

danielmewes commented Oct 10, 2013

jdoliner commented Oct 10, 2013

danielmewes commented Oct 10, 2013

danielmewes commented Oct 15, 2013

danielmewes commented Oct 15, 2013

Crash during rapid resharding #728

Crash during rapid resharding #728

Comments

jdoliner commented Apr 26, 2013

jdoliner commented Apr 26, 2013

Tryneus commented May 2, 2013

jdoliner commented May 2, 2013

Tryneus commented May 3, 2013

Tryneus commented May 3, 2013

jdoliner commented May 4, 2013

Tryneus commented May 9, 2013

Tryneus commented May 9, 2013

danielmewes commented May 9, 2013

Tryneus commented May 14, 2013

coffeemug commented May 22, 2013

Tryneus commented May 22, 2013

AtnNn commented Jul 2, 2013

coffeemug commented Jul 9, 2013

danielmewes commented Oct 7, 2013

danielmewes commented Oct 7, 2013

danielmewes commented Oct 7, 2013

danielmewes commented Oct 8, 2013

jdoliner commented Oct 8, 2013

danielmewes commented Oct 8, 2013

danielmewes commented Oct 10, 2013

danielmewes commented Oct 10, 2013

jdoliner commented Oct 10, 2013

danielmewes commented Oct 10, 2013

danielmewes commented Oct 15, 2013

danielmewes commented Oct 15, 2013