-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Database corruption (triggers segfault) #52
Comments
|
It seems that a block is referenced which is actually not in the LBA (or at least not the in-memory LBA that the serializer has in memory after startup). In case you want to try reproducing also the potential database corruption: I had 4 shards, with the primaries split between the two servers (2 each, no replication / no secondaries). Then I ran the following stress client configuration:
|
This is where the read request for block id 19 actually originates from (the assertion is mine for debugging):
|
Ok, I can reproduce this rather reliably now.
My best guess is a race with replication / backfilling. (by the way: I'd recommend adding a line guarantee(data_token); to line 166 of mirrored.cc between the index_read and the block_read. This gives a slightly more graceful crash on corrupted database files (i.e. no seg fault). Corrupted database files might be encountered occasionally even in the absence of any bug in RethinkDB after all.) |
@danielmewes -- you're a price among men! This is definitely @srh's domain. |
@danielmewes -- also, could you submit a pull request with your proposed change? |
@coffeemug I have no idea in the world how race conditions in the clustering layer would be my domain. |
I'll look in to this. I think bug reporter is going to have some bubbles in the pipeline given that we need to write a separate outside service and I can fill them with this. |
Actually I think I'm just being an idiot, even if the clustering layer is messed up or if there's some race condition in the code sending stuff into the btree, that shouldn't cause btree code (or whatever) to become broken. |
@jdoliner I assigned this back to me because I think you were taken in by my moronic comment, but if you want to work for it feel free to assign it back to yourself. |
Well, my speculation about a problem with the backfilling code was due to the fact that the bug only shows for me while replication is being enabled. |
Joe said this would fit in his schedule more nicely, so assigning this back to him. |
Alright more progress for the interested: the problem is that this causes block to be looked up which the index doesn't know about. Here's the backtrace of where the faulty call is issued:
As we can see it's coming from backfilling which makes me hopeful that we may be able to recreate this with a unit test. |
Alright, I believe I have a theory as to how this comes up: In mc_inner_buf_t::mc_inner_buf_t we spawn a coroutine which loads the inner buf using spawn_now_dangerously. There's a comment there that says we do this to make sure load_inner_buf() acquires the lock first. In load_inner_buf we then switch to the serializer's thread and read the block. However since we switch threads there's no guarantee that the thing the comment says we need actually happens. We get this error when we rapidly change the replica count from 1 to 2. Changing to 2 causes a backfill to happen. Changing to 1 causes all the data to be deleted. I suspect that a write from the backfill is somehow racing with a delete from the erase range. |
I have to admit the name of the spawn method alone doesn't inspire confidence. ;-) -----Original Message----- Alright, I believe I have a theory as to how this comes up: In mc_inner_buf_t::mc_inner_buf_t we spawn a coroutine which loads the inner buf using spawn_now_dangerously. There's a comment there that says we do this to make sure load_inner_buf() acquires the lock first. In load_inner_buf we then switch to the serializer's thread and read the block. However since we switch threads there's no guarantee that the thing the comment says we need actually happens. We get this error when we rapidly change the replica count from 1 to 2. Changing to 2 causes a backfill to happen. Changing to 1 causes all the data to be deleted. I suspect that a write from the backfill is somehow racing with a delete from the erase range. Reply to this email directly or view it on GitHub: |
My thoughts exactly. This is basically grandmother's old spawn_now with a new less flashy name that hopefully deters people from using it. Obviously deterance hasn't worked as well as we might have hoped. |
Alright, this theory is wrong I ran with one thread and verfied that nothing was getting reordered and it still comes up. Rats. |
Alright some more information has been discovered, the problem arises when someone tries to read a deleted block. I have confirmed that the block does indeed get deleted before it's read. This is the backtrace of the offending delete: 1: lazy_backtrace_t::lazy_backtrace_t() at 0x1df7b81 (../build/debug/rethinkdb) I'm also not formatting this because it takes forever. But we can see that check_and_handle_underfull appears, at least superficially, to be the culprit. |
There's light at the end of the tunnel. I have what I believe to be a fix. Currently in testing... |
Okaaay. I don't understand the full nature of this problem nor of the fix. So the fix might well be incomplete. It might have side-effects (though I cannot think of any). First, this is what fixes the problem for me:
I'm unsure whether the set_recency_dirty and set_needs_flush are required. The last one is important for patches only, which seem to be disabled by default. I didn't test if it is required with patches enabled or not, but it shouldn't hurt either I think. Ok, here's a bit of background for this part of the code. It is all from my memories, so things might have slightly changed or I might mix up some details. Anyway, let's start with what happens when you delete a block.
However this shouldn't be necessary. First we don't seem to set_dirty in the other alternative, where we allocate a fresh inner_buf. And that one doesn't seem to cause any problems. Second, everybody who allocates a fresh block should also initialize it before releasing the lock on it. When it gets initialized (either by applying patches or through get_data_major_write() ), set_dirty should be called anyway. A last note: |
Alright I think the most obvious for me to do here is write a unittest which emulates the scenario you discuss and we'll see if it does indeed cause this problem. |
I think the API is really confusing. Being able to allocate a block and then have it disappear because you didn't write to it is really weird. |
Yeah, but also one shouldn't allocate a block and then not initialize it with any data. At the very least whoever allocates the block should put some header/magic into it. There are two things at fault here it seems, which together manage to create catastrophe. By the way: Are new blocks zeroed out anywhere? I can't find where that happens, which would mean that part of the data written to disk is random chunks of uninitialized memory. There is this piece of code in mirrored.cc
But it is not enabled in release mode and I cannot find how the allocator would guarantee to initialize the data (memory is ultimately allocated using posix_memalign, which doesn't mention initialization in its documentation) . Am I missing something? |
It's perfectly ok to write random data in release mode -- there is no need to zero it out if the code is correct (other than perhaps to ensure easier debugging later -- but that comes at a cost of performance). Perhaps the API should have a guarantee that triggers if the block has been allocated but not written to? (though this isn't an ideal solution, since it doesn't prevent bugs at compile time, which a better API could). |
@coffeemug I disagree. Maybe RethinkDB will have support for authentication in a couple years and thereby handle passwords. Having code around that writes random pieces of memory to disk strikes me as really bad. This specific behavior will sure enough be forgotten eventually, and can have really bad side effects in the future (like writing unencrypted passwords to disk). Whatever, not really a bug for now... I'll pull-request a patch that makes sure that blocks once allocated, will be persisted no matter whether initialized or not. This should actually be done in the buf_lock_t constructor, not in the mc_inner_buf_t::allocate method, where I had put the fix before... |
So here's something. I just went through our code and discovered no places in which we allocate a block without initializing it. Which seems to contradict our proposed mechanism here. We should look in to figuring out where this block gets reallocated from. |
Ok, so here's the thing. When initializing a block, that is usually done by calling get_data_major_write() on the buffer lock. get_data_major_write() in turn calls ensure_flush(), which is meant to - well - ensure that the buffer gets flushed.
As we can see, it does stuff whenever needs_flush has not been set yet. This is quite sane, as ensure_flush() is the only place where needs_flush gets set from, and it doesn't make sense to "ensure" a flush twice before another writeback has happened (which would reset the flag). Now there is one little problem with this logic: the part of mc_inner_buf_t::allocate() that is responsible for re-setting an existing buf_t so it can be re-used for a fresh block misses on one crutial part: It doesn't reset the writeback_buf. Apart from the dirty flag, it also doesn't reset the needs_flush flag. So let's consider this situation:
I propose fixing this by making allocate() also reset the writeback state properly. A pull-request is on the way. |
I'm getting a crash similar to #33
Except that instead of "Segmentation fault from reading the address (nil).", I have "Segmentation fault from reading the address 0xb0.".
Not sure if it is the same bug or a different one.
Release mode output:
Debug mode output (waayyy more helpful I assume):
The data on which the crash occurs is the result of running the stress client on a sharded (but non-replicated) database for 6 hours. After that I enabled replication and shut down the second server, because I wanted to test outdated reads. Upon restarting the second server, the crash occurred.
To reproduce, please download my RethinkDB data_dirs from here:
http://danielmewes.dnsalias.net/~daniel/.private/cluster_data_segfault.tar.bz2
Then start server 1 (should start up properly):
and server 2 (should crash):
The text was updated successfully, but these errors were encountered: