Skip to content

Fix segfault cuvs bench#2088

Merged
rapids-bot[bot] merged 2 commits into
rapidsai:release/26.06from
aamijar:fix-segfault-cuvs-bench
May 15, 2026
Merged

Fix segfault cuvs bench#2088
rapids-bot[bot] merged 2 commits into
rapidsai:release/26.06from
aamijar:fix-segfault-cuvs-bench

Conversation

@aamijar
Copy link
Copy Markdown
Member

@aamijar aamijar commented May 14, 2026

Resolves #2087

What does this PR do?

There is a race condition on the std::optional !handle_.has_value().
While thread 1 works on assigning handle with mmap, thread 2 can race through and try to assign it as well. However, during reassignment it causes destructor of thread 1 to be called which calls unmap, leading to a segfault.

To solve this we use std::once so that other threads can't enter the critical section.
std::once is non reentrant, so we need to also refactor from using recursion to a while loop.
std::once is not movable, so we allocate it on the heap and use a unique_pointer to keep track of it instead.
We need to ensure member variables of blob are movable since blob is used in a std::variant

@aamijar aamijar requested a review from a team as a code owner May 14, 2026 22:18
@aamijar aamijar self-assigned this May 14, 2026
@aamijar aamijar added non-breaking Introduces a non-breaking change bug Something isn't working labels May 14, 2026
@aamijar aamijar moved this to In Progress in Unstructured Data Processing May 14, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 14, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 7875b8d6-7ac9-429c-af0f-a75b4ee25fcf

📥 Commits

Reviewing files that changed from the base of the PR and between 39f5d9a and fa616f0.

📒 Files selected for processing (2)
  • cpp/bench/ann/src/common/blob.hpp
  • cpp/bench/ann/src/common/dataset.hpp
🚧 Files skipped from review as they are similar to previous changes (2)
  • cpp/bench/ann/src/common/dataset.hpp
  • cpp/bench/ann/src/common/blob.hpp

📝 Walkthrough

Summary by CodeRabbit

  • Refactor
    • Improved thread-safety for lazy-initialized memory blobs to prevent concurrent access and re-entry issues.
    • Enhanced prewarming of blob-backed data before worker threads run to ensure consistent, ready-to-read state.
    • Optimized allocation-failure fallback to retry and disable problematic options without recursive initialization or unsafe re-entry.

Walkthrough

Replaces unsynchronized lazy mmap initialization in blob_mmap::handle() with std::call_once using a heap-allocated once_flag and an internal retry loop for hugepage fallbacks. Also pre-initializes ground-truth blob-backed data on the main thread before spawning worker threads.

Changes

Thread-safety fix for blob_mmap

Layer / File(s) Summary
Thread-safe handle initialization with call_once
cpp/bench/ann/src/common/blob.hpp
SPDX year extended to 2026. Added <memory> and <mutex>. Rewrote blob_mmap::handle() to use a heap-allocated std::once_flag and std::call_once for one-time, thread-safe mmap initialization; hugepage-related mmap failures are retried inside the call_once lambda rather than via recursive handle() re-entry.
Prewarm ground-truth blob-backed data
cpp/bench/ann/src/common/dataset.hpp
ground_truth_map constructor now forces ground_truth_set.data() and, if present, filter_bitset->data(MemoryType::kHostMmap) to initialize on the main thread before launching worker threads.

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Fix segfault cuvs bench' directly summarizes the main change: fixing a segmentation fault in the cuvs benchmark code, which matches the core objective.
Description check ✅ Passed The description clearly explains the race condition fix using std::call_once and refactoring to a while loop, directly relating to the changeset in blob.hpp and dataset.hpp.
Linked Issues check ✅ Passed The changes successfully implement the required fix: synchronizing lazy mmap initialization with std::call_once [#2087], refactoring recursion to a while loop, heap-allocating std::once_flag via unique_ptr, and pre-initializing blob state in ground_truth_map constructor.
Out of Scope Changes check ✅ Passed All changes are directly in-scope: updating copyright year, adding required headers (, ), synchronizing mmap initialization, and pre-warming blob storage—all addressing the race condition documented in #2087.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Member

@dantegd dantegd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch on this one!

One thing I noticed while reading: ground_truth_map's ctor in dataset.hpp is the only place that goes parallel directly over a blob (everywhere else in dataset serializes via mutex_). With this PR it's safe, but if you ever want to make it future proof against new lazy paths in blob.hpp, prewarming ground_truth_set.data() and filter_bitset->data(...) on the main thread before spawning workers would do it. Not blocking, just a thought.

Comment thread cpp/bench/ann/src/common/blob.hpp Outdated
Comment on lines +430 to +434
// Heap-allocate the once_flag so that blob_mmap remains movable (std::once_flag
// itself is neither copyable nor movable). Multiple threads can race on the
// first call to handle() via blob<T>::data(); without this serialization each
// racing thread would mmap the file and the losing emplace would munmap the
// winner's mapping out from under it -- a flaky SIGSEGV in ann benchmarks.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tiny suggestion — the comment explains the destructor-during-emplace half of the bug really well, but the other half is that the reference handle() already returned to the winning thread also gets invalidated by the second emplace. Maybe worth tweaking the last line to something like:

// racing thread would mmap the file and the losing emplace would both munmap // the winner's mapping AND invalidate the reference the winner had already // returned to its caller -- a flaky SIGSEGV in ann benchmarks.

Totally optional, the existing comment is already good.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review! I addressed the future proofing and comments update here 39f5d9a

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't see 39f5d9a, you meant fa616f0 right?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it got switched because of the rebase

@aamijar aamijar changed the base branch from main to release/26.06 May 15, 2026 18:04
@aamijar aamijar requested review from a team as code owners May 15, 2026 18:04
@aamijar aamijar requested a review from bdice May 15, 2026 18:04
@aamijar aamijar force-pushed the fix-segfault-cuvs-bench branch from 39f5d9a to fa616f0 Compare May 15, 2026 18:18
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 15, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@aamijar aamijar removed request for a team and bdice May 15, 2026 18:19
@aamijar
Copy link
Copy Markdown
Member Author

aamijar commented May 15, 2026

/ok to test fa616f0

@aamijar
Copy link
Copy Markdown
Member Author

aamijar commented May 15, 2026

/merge

@rapids-bot rapids-bot Bot merged commit 972de08 into rapidsai:release/26.06 May 15, 2026
164 of 166 checks passed
@github-project-automation github-project-automation Bot moved this from In Progress to Done in Unstructured Data Processing May 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working non-breaking Introduces a non-breaking change

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

[CI] Segfault in ann cuvs_bench binaries

2 participants