Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix use-after-free bug in indexer state #896

Merged
merged 10 commits into from
Jun 9, 2020
Merged

Fix use-after-free bug in indexer state #896

merged 10 commits into from
Jun 9, 2020

Conversation

lava
Copy link
Member

@lava lava commented Jun 3, 2020

NOTE: I did not see any mechanism to ensure that index and indexer are destroyed in the correct order, but if one exists it would probably better to fix that mechanism rather than introducing shared_ptr here.

@dominiklohmann dominiklohmann added the bug Incorrect behavior label Jun 4, 2020
@tobim
Copy link
Member

tobim commented Jun 4, 2020

This should actually be covered by the protocol between the index and the indexers. The indexers send a done message once they are finished processing all incoming table slices, this is then handled by index_state::decrement_indexer_count() in the index actor.

Copy link
Member Author

@lava lava left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit concerned about this comment in index.cpp:171: (it seems i cant add comments to lines that were not changed in this PR?)

    // Flush all unpersisted partitions. This only writes the meta state of
    // each partition. For actually writing the contents of each INDEXER we
    // need to rely on messaging.

If I'm understanding this correctly, it seems to suggest that indexers are supposed to continue writing after the index has shut down?

Also, it would be nice to write a test for this, but I'm not sure how to construct one.

libvast/src/system/index.cpp Outdated Show resolved Hide resolved
@tobim
Copy link
Member

tobim commented Jun 5, 2020

FYI: I can locally reproduce the issue with

python integration/integration.py -d it --app build/ci/bin/vast -t "Conn log counting"

The current change does not fix the problem.

@lava
Copy link
Member Author

lava commented Jun 5, 2020

@tobim I added a missing return; statement so it may be fixed now, although I still cant reproduce this locally on my machine.

@tobim tobim force-pushed the story/ch17132 branch 2 times, most recently from 215bf80 to 337d6b8 Compare June 9, 2020 12:44
@tobim
Copy link
Member

tobim commented Jun 9, 2020

@tenzir/backend a review would be very welcome.

CMakeLists.txt Show resolved Hide resolved
libvast/src/system/indexer_downstream_manager.cpp Outdated Show resolved Hide resolved
CHANGELOG.md Outdated Show resolved Hide resolved
Benno Evers and others added 7 commits June 9, 2020 16:33
When receiving an `exit` message in the index, we flush the
meta state of all partitions but dont wait until all indexers
are finished.

This can cause memory corruption issues, because the indexers
contain pointers to `measurement` structs that are stored
inside the `partition` classes, which are destructed along with
the index.

This could also lead to data loss, when an indexer handles a batch
after the index was already destructed.
@tobim
Copy link
Member

tobim commented Jun 9, 2020

I'll wait a bit with the rebase push so you can verify if your comments are addressed.

@dominiklohmann
Copy link
Member

I resolved every item but one. From my side this is ready to be merged, but I cannot reproduce the issue locally, which means I also cannot test it.

Copy link
Member

@dominiklohmann dominiklohmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code makes sense to me. Since I cannot test this, please test this again on your end before merging to verify (since you're the only that managed to reproduce it locally).

Copy link
Member Author

@lava lava left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, apart from some minor things below. I couldn't test it locally, but for a CI issue probably a green CI run will be good enough anyways. (and i also cant approve because i'm the PR creator)

@@ -11,6 +11,9 @@ Every entry has a category for which we use the following visual abbreviations:

## Unreleased

- 🐞 A use after free bug would sometimes crash the node while it was shutting
down. [#896](https://github.com/tenzir/vast/pull/896)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: The bug didn't crash the node, it was our asan instrumentation that did ;)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically true, but that is just luck, the memory location could have been given back to the OS. Also I don't want to get too technical in the changelog.

configure Outdated Show resolved Hide resolved
CMakeLists.txt Outdated Show resolved Hide resolved
VAST_ASSERT(*it != nullptr);
if (buffered(**it) == 0u) {
// ... either removing them directly if the buffers are empty,
// meaning all table slices have been forwarded to the indexers,...
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this part an optimization, or necessary for correctness? I was wondering why we need to distinguish between empty/non-empty here and in unregister_partition(), but not in register_partition() below.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe both instances are purely defensive, I don't want to rely on emit_batches_impl() to clean up partitions that are already done from its point of view (although it currently does).

@tobim tobim merged commit bcd4544 into master Jun 9, 2020
@tobim tobim deleted the story/ch17132 branch June 9, 2020 17:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Incorrect behavior
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants