Fix use-after-free bug in indexer state #896

lava · 2020-06-03T16:37:54Z

NOTE: I did not see any mechanism to ensure that index and indexer are destroyed in the correct order, but if one exists it would probably better to fix that mechanism rather than introducing shared_ptr here.

tobim · 2020-06-04T07:29:36Z

This should actually be covered by the protocol between the index and the indexers. The indexers send a done message once they are finished processing all incoming table slices, this is then handled by index_state::decrement_indexer_count() in the index actor.

lava

I'm a bit concerned about this comment in index.cpp:171: (it seems i cant add comments to lines that were not changed in this PR?)

    // Flush all unpersisted partitions. This only writes the meta state of
    // each partition. For actually writing the contents of each INDEXER we
    // need to rely on messaging.

If I'm understanding this correctly, it seems to suggest that indexers are supposed to continue writing after the index has shut down?

Also, it would be nice to write a test for this, but I'm not sure how to construct one.

libvast/src/system/index.cpp

tobim · 2020-06-05T08:30:32Z

FYI: I can locally reproduce the issue with

python integration/integration.py -d it --app build/ci/bin/vast -t "Conn log counting"

The current change does not fix the problem.

lava · 2020-06-05T09:26:31Z

@tobim I added a missing return; statement so it may be fixed now, although I still cant reproduce this locally on my machine.

tobim · 2020-06-09T13:40:49Z

@tenzir/backend a review would be very welcome.

libvast/src/system/indexer_downstream_manager.cpp

CMakeLists.txt

libvast/src/system/indexer_downstream_manager.cpp

CHANGELOG.md

When receiving an `exit` message in the index, we flush the meta state of all partitions but dont wait until all indexers are finished. This can cause memory corruption issues, because the indexers contain pointers to `measurement` structs that are stored inside the `partition` classes, which are destructed along with the index. This could also lead to data loss, when an indexer handles a batch after the index was already destructed.

tobim · 2020-06-09T14:35:24Z

I'll wait a bit with the rebase push so you can verify if your comments are addressed.

dominiklohmann · 2020-06-09T14:38:34Z

I resolved every item but one. From my side this is ready to be merged, but I cannot reproduce the issue locally, which means I also cannot test it.

dominiklohmann

The code makes sense to me. Since I cannot test this, please test this again on your end before merging to verify (since you're the only that managed to reproduce it locally).

lava

Looks good to me, apart from some minor things below. I couldn't test it locally, but for a CI issue probably a green CI run will be good enough anyways. (and i also cant approve because i'm the PR creator)

lava · 2020-06-09T15:22:26Z

CHANGELOG.md

@@ -11,6 +11,9 @@ Every entry has a category for which we use the following visual abbreviations:

 ## Unreleased

+- 🐞 A use after free bug would sometimes crash the node while it was shutting
+  down. [#896](https://github.com/tenzir/vast/pull/896)


nit: The bug didn't crash the node, it was our asan instrumentation that did ;)

Technically true, but that is just luck, the memory location could have been given back to the OS. Also I don't want to get too technical in the changelog.

configure

CMakeLists.txt

lava · 2020-06-09T15:36:53Z

libvast/src/system/indexer_downstream_manager.cpp

+    VAST_ASSERT(*it != nullptr);
+    if (buffered(**it) == 0u) {
+      // ... either removing them directly if the buffers are empty,
+      // meaning all table slices have been forwarded to the indexers,...


Is this part an optimization, or necessary for correctness? I was wondering why we need to distinguish between empty/non-empty here and in unregister_partition(), but not in register_partition() below.

I believe both instances are purely defensive, I don't want to rely on emit_batches_impl() to clean up partitions that are already done from its point of view (although it currently does).

dominiklohmann added the bug Incorrect behavior label Jun 4, 2020

lava force-pushed the story/ch17132 branch from 3a604af to d8e2a4d Compare June 4, 2020 10:54

lava commented Jun 4, 2020

View reviewed changes

tobim reviewed Jun 4, 2020

View reviewed changes

libvast/src/system/index.cpp Outdated Show resolved Hide resolved

lava force-pushed the story/ch17132 branch from d8e2a4d to 27b773c Compare June 5, 2020 09:13

tobim force-pushed the story/ch17132 branch 2 times, most recently from 215bf80 to 337d6b8 Compare June 9, 2020 12:44

dominiklohmann requested changes Jun 9, 2020

View reviewed changes

libvast/src/system/indexer_downstream_manager.cpp Show resolved Hide resolved

CMakeLists.txt Show resolved Hide resolved

libvast/src/system/indexer_downstream_manager.cpp Outdated Show resolved Hide resolved

dominiklohmann reviewed Jun 9, 2020

View reviewed changes

CHANGELOG.md Outdated Show resolved Hide resolved

tobim force-pushed the story/ch17132 branch from c8b5226 to be4521f Compare June 9, 2020 14:33

Benno Evers and others added 7 commits June 9, 2020 16:33

Shutdown index stream stage in exit handler

8a77b8e

Add an option to not look for an installed CAF

cd63056

Add missing blockeing flags to the integration set

a75cfd7

Clarify cleanup logic

c6912e5

Expose the no system CAF flag in configure script

5ac2962

Add a changelog entry for the index shutdown bug

6158754

Add assertions to help clarify object lifetimes

c0aa98f

tobim force-pushed the story/ch17132 branch from be4521f to c0aa98f Compare June 9, 2020 15:04

dominiklohmann approved these changes Jun 9, 2020

View reviewed changes

lava commented Jun 9, 2020

View reviewed changes

tobim added 2 commits June 9, 2020 17:56

Improve cmake option naming

7aab75f

Merge branch 'master' into story/ch17132

a4ca70a

tobim merged commit bcd4544 into master Jun 9, 2020

tobim deleted the story/ch17132 branch June 9, 2020 17:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix use-after-free bug in indexer state #896

Fix use-after-free bug in indexer state #896

lava commented Jun 3, 2020

tobim commented Jun 4, 2020

lava left a comment

tobim commented Jun 5, 2020

lava commented Jun 5, 2020

tobim commented Jun 9, 2020

tobim commented Jun 9, 2020

dominiklohmann commented Jun 9, 2020

dominiklohmann left a comment

lava left a comment

lava Jun 9, 2020

tobim Jun 9, 2020

lava Jun 9, 2020

tobim Jun 9, 2020

Fix use-after-free bug in indexer state #896

Fix use-after-free bug in indexer state #896

Conversation

lava commented Jun 3, 2020

tobim commented Jun 4, 2020

lava left a comment

Choose a reason for hiding this comment

tobim commented Jun 5, 2020

lava commented Jun 5, 2020

tobim commented Jun 9, 2020

tobim commented Jun 9, 2020

dominiklohmann commented Jun 9, 2020

dominiklohmann left a comment

Choose a reason for hiding this comment

lava left a comment

Choose a reason for hiding this comment

lava Jun 9, 2020

Choose a reason for hiding this comment

tobim Jun 9, 2020

Choose a reason for hiding this comment

lava Jun 9, 2020

Choose a reason for hiding this comment

tobim Jun 9, 2020

Choose a reason for hiding this comment