Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash during rapid resharding #728

Closed
jdoliner opened this issue Apr 26, 2013 · 26 comments
Closed

Crash during rapid resharding #728

jdoliner opened this issue Apr 26, 2013 · 26 comments
Assignees
Labels
Milestone

Comments

@jdoliner
Copy link
Contributor

I uncovered the following crash while rapidly resharding:

error: Error in ../src/buffer_cache/mirrored/mirrored.cc at line 657:
error: Assertion failed: [inner_buf->version_id <= version_to_access] 
error: Backtrace:
error: Fri Apr 26 15:15:57 2013

       1: lazy_backtrace_t::lazy_backtrace_t() at backtrace.cc:249
       2: format_backtrace(bool) at backtrace.cc:196
       3: report_fatal_error(char const*, int, char const*, ...) at errors.cc:71
       4: mc_buf_lock_t::acquire_block(unsigned long) at mirrored.cc:657
       5: mc_buf_lock_t::initialize(unsigned long, file_account_t*, lock_in_line_callback_t*) at mirrored.cc:577
       6: mc_buf_lock_t::mc_buf_lock_t(mc_transaction_t*, unsigned int, access_t, buffer_cache_order_mode_t, lock_in_line_callback_t*) at mirrored.cc:517
       7: scc_buf_lock_t<mc_cache_t>::scc_buf_lock_t(scc_transaction_t<mc_cache_t>*, unsigned int, access_t, buffer_cache_order_mode_t, lock_in_line_callback_t*) at semantic_checking.tcc:138
       8: process_a_leaf_node(traversal_state_t*, scoped_ptr_t<scc_buf_lock_t<mc_cache_t> >*, int, btree_key_t const*, btree_key_t const*) at parallel_traversal.cc:469
       9: do_a_subtree_traversal_fsm_t::on_node_ready(scoped_ptr_t<scc_buf_lock_t<mc_cache_t> >*) at parallel_traversal.cc:400
       10: acquire_a_node_fsm_t::you_may_acquire() at parallel_traversal.cc:262
       11: boost::_mfi::mf0<void, acquisition_waiter_callback_t>::operator()(acquisition_waiter_callback_t*) const at mem_fn_template.hpp:50
       12: void boost::_bi::list1<boost::_bi::value<acquisition_waiter_callback_t*> >::operator()<boost::_mfi::mf0<void, acquisition_waiter_callback_t>, boost::_bi::list0>(boost::_bi::type<void>, boost::_mfi::mf0<void, acquisition_waiter_callback_t>&, boost::_bi::list0&, int) at bind.hpp:254
       13: boost::_bi::bind_t<void, boost::_mfi::mf0<void, acquisition_waiter_callback_t>, boost::_bi::list1<boost::_bi::value<acquisition_waiter_callback_t*> > >::operator()() at bind_template.hpp:21
       14: callable_action_instance_t<boost::_bi::bind_t<void, boost::_mfi::mf0<void, acquisition_waiter_callback_t>, boost::_bi::list1<boost::_bi::value<acquisition_waiter_callback_t*> > > >::run_action() at runtime_utils.hpp:58
       15: callable_action_wrapper_t::run() at runtime_utils.cc:67
       16: coro_t::run() at coroutines.cc:178

To reproduce:

  • Insert about 1M documents (I have no idea how important this number is.)
  • Rapidly change the number of shards in the UI (don't wait for the resharding process to finish before changing the number.)

This issue is post secondary indexes so they're a likely candidate. I currently have no hypothesis for how they could be involved though. I had no secondary indexes created at the time I got this and thus I can't think of a reason they'd be spawning parallel traversals. It seems more likely that it's the backfill parallel traversal that's erroring. But who knows.

@jdoliner
Copy link
Contributor Author

There's a core file for this crash in ~jdoliner/issue_728 on newton.

@Tryneus
Copy link
Member

Tryneus commented May 2, 2013

Working on this, by the way.

@jdoliner
Copy link
Contributor Author

jdoliner commented May 2, 2013

Come talk to me about this. I have a few things you could do to prevent #727 from getting in your way too much.

@Tryneus
Copy link
Member

Tryneus commented May 3, 2013

Keep running into hanging issues while trying to reproduce this. Even after your suggestions, I've got a secondary with its reactors stuck in nothing_when_done_erasing.

@Tryneus
Copy link
Member

Tryneus commented May 3, 2013

You had a cluster when first having this issue, right? How many replicas for the table?

@jdoliner
Copy link
Contributor Author

jdoliner commented May 4, 2013

Yeah cluster of 3 machines. No replicas. Meaning just the master.

On Friday, May 3, 2013, Tryneus wrote:

You had a cluster when first having this issue, right? How many replicas
for the table?


Reply to this email directly or view it on GitHubhttps://github.com//issues/728#issuecomment-17423698
.

@Tryneus
Copy link
Member

Tryneus commented May 9, 2013

I have never once observed this crash, not sure what else I can do.

@Tryneus
Copy link
Member

Tryneus commented May 9, 2013

Aha, finally got this error today, based off of the issue_727 branch, so this is unfortunately still around.

@danielmewes
Copy link
Member

Wow, could this be another long-hidden bug in the snapshotting code? I feel with you guys :-|

@Tryneus
Copy link
Member

Tryneus commented May 14, 2013

Moving this to 1.5.x since there is no way we can track this down in time for 1.5.

@coffeemug
Copy link
Contributor

@Tryneus -- what's the state of this issue?

@Tryneus
Copy link
Member

Tryneus commented May 22, 2013

No progress on this, still have not seen this again despite tons of testing. Until we have a consistent way to reproduce this, I can't really expect us to track this down.

@AtnNn
Copy link
Member

AtnNn commented Jul 2, 2013

Moving to 1.7.x

@coffeemug
Copy link
Contributor

Just talked to Marc. This didn't show up in a while and we can't fix it until we have a consistent way to reproduce it. Moving to backlog for now.

@danielmewes
Copy link
Member

Just noticed that this is probably exactly the problem I'm seeing here: #1380 (there was a second bug in that issue which is already fixed).

I have a set of data files for a 3-node cluster with which this can be easily reproduced. I'm currently looking into this. Assigning to myself.
So far I still don't understand what's going on. Locking should forbid the thing we are seeing here from ever happening.

@danielmewes
Copy link
Member

The changes in 17ed59b made the problem much more difficult to reproduce (which is good, because the bug is now relatively rare), but I still saw it popping up twice since then. Investigating further...

@danielmewes
Copy link
Member

So we somehow have the following situation (in chronological order):

Write transaction t1 acquires its first block (I assume the superblock). This initializes its version id.
(t1 releases the superblock)
Write transaction t2 acquires its first block (I assume the superblock). This initializes its version id.
t2 gets a lock on a certain block b
(t2 releases b)
t1 gets a lock on the same block b

The snapshotting it turns out was designed under the assumption that a write transaction can never bypass another one. This seems to be violated somewhere, which might be a bug, or maybe it is by design?

@danielmewes
Copy link
Member

There is strong evidence that our snapshotting logic cannot cope with the stats block which we acquire in parallel_traversal.cc on line 480 (process_a_leaf_node()). We acquire the stats block without any ordering guarantees, sometimes even repeatedly from the same transaction.

It might be possible that our snapshotting code could actually handle such use cases with minor modifications. But I'm not sure.

One solution would be to start a new transaction just to update the stat block each time we have to do so. We would lose the atomicity guarantees though, which is probably not very nice.

The other alternative is to acquire the stat block once, and to only ever acquire it while we still hold the superblock. We would then hold on to the lock while the parallel traversal is proceeding.
I imagine this might pose some issues for concurrency?
@Tryneus , @jdoliner do you have an idea how important it is for our performance that we yield the stat block?

@jdoliner
Copy link
Contributor Author

jdoliner commented Oct 8, 2013

I can confirm that this is a block that we do acquire out of order which is by design.

I don't think either of your solutions will work for the reasons you give. How hard would it be to relax the constraint? Another option would be to use the same synchronization method for the stat block as we do for the secondary index block. One thing I'm not sure of is how we handle the stat block when we do an erase_range. It's been a while since I touched this code. That could be kind of hard to do using sindex_tokens for synchronization.

@danielmewes
Copy link
Member

Thanks for chiming in. I'm going to have a deeper look at what it takes to expand the snapshotting logic today.

@danielmewes
Copy link
Member

So it turns out that this can be worked around in the cache rather easily.

However, with the implementation I have in mind, snapshotting semantics would be violated for the affected blocks.

Let's look at an example with three transactions. First let's say that everything happens in order, and we do not run into the conflict that this issue is about:

  1. trx-write-1 acquires superblock (version is initialized)
  2. trx-read-2 acquires superblock (version is initialized) and is snapshotted
  3. trx-write-3 acquires superblock (version is initialized)
  4. trx-write-1 acquires stat block, modifies it
  5. trx-write-3 acquires stat block, modifies it
  6. trx-read-2 acquires stat block. It can see the changes made by trx-write-1, but it cannot see the changes made by trx-write-3. This is the correct behavior.

Now let's swap steps 4 and 5. Our current implementation has undefined behavior in this case. The adapted one would do the following:
...
4. trx-write-3 acquires stat block, modifies it
5. trx-write-1 acquires stat block, modifies it
6. trx-read-2 acquires stat block. It can see neither the changes made by trx-write-1 nor trx-write-3. It should see the changes made by trx-write-1. This behavior is incorrect. And it is theoretically impossible to behave correctly in this scenario without making additional assumptions on what kinds of operations the write transactions can perform. trx-read-2 can either see none of the changes, or both of them (or deadlock, which I guess would be correct). After all trx-write-1 might have modified the stat block depending on changes previously made by trx-write-3.

@jdoliner Would such a behavior be acceptable for the stats block? For a snapshotted read transaction, the stats block would (occasionally) appear inconsistent with the rest of the btree.

@danielmewes
Copy link
Member

This is the proposed change by the way: aee406d

It does solve the crash (well, it removes the assertion actually. But at the same time adds a well-defined behavior for cases were it used to be violated).

@jdoliner
Copy link
Contributor Author

Short answer yes those semantics are fine.

Longer answer:
These semantic violations are independent of snapshotting. We generally allow txns to acquire the stat block in whatever order they want so a read, whether snapshotted or not may or may not see the changes made by the currently running write txns. The 2 risks that exist with violating a DAG locking structure are deadlocks and inconsistent data. The first is solved by adhering to the rule that while you have the stat block acquired you can't acquire any other blocks so it's in essence a completely separate tree. The second is (mostly) solved by the fact that the only operations permitted on the stat block are increment and decrement which commute. This means that unlike with other blocks writes can happen in any order and eventually result in the same value. This does allow you to see the effects of some of the currently running writes but not others. We just live with that for now and only use the stat block as an out of date count on the tree.

@danielmewes
Copy link
Member

The fix is in code review 978.

@danielmewes
Copy link
Member

This is in next as of commit c42fbc9 .

@danielmewes
Copy link
Member

... and also cherry-picked into v1.10.x as of d582a0d

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants