Skip to content

fix(clp-s): Use smart pointers in the SchemaMatch pass to prevent use-after-free (fixes #1986).#1990

Merged
gibber9809 merged 5 commits intoy-scope:mainfrom
gibber9809:fix-1986
Feb 19, 2026
Merged

fix(clp-s): Use smart pointers in the SchemaMatch pass to prevent use-after-free (fixes #1986).#1990
gibber9809 merged 5 commits intoy-scope:mainfrom
gibber9809:fix-1986

Conversation

@gibber9809
Copy link
Contributor

@gibber9809 gibber9809 commented Feb 13, 2026

Description

This PR fixes a use-after-free bug in the SchemaMatch pass where a cached mapping of column Id -> set<ColumnDescriptor*> could end up referencing invalid pointers after some parts of the AST were constant propagated away. The bug and the address sanitizer output that helped us catch it is detailed in #1986.

The fix is simply to change the mapping to use a set<shared_ptr<ColumnDescriptor>>, since the AST itself already uses smart pointers.

Besides this change, we also perform a small amount of cleanup in the surrounding code.

Checklist

  • The PR satisfies the contribution guidelines.
  • This is a breaking change and that has been indicated in the PR title, OR this isn't a
    breaking change.
  • Necessary docs have been updated, OR no docs need to be updated.

Validation performed

  • Re-ran unit tests with address sanitizer enabled and observed that the use-after-free no longer occurred.
  • Spot-checked that this change has no measurable impact on search speed using open source mongodb dataset and queries.

Summary by CodeRabbit

  • Refactor
    • Improved internal pointer and ownership handling for schema matching to increase stability and memory safety.
    • Strengthened handling of unresolved or ambiguous columns, producing more reliable resolution across complex schemas.
    • Made const-correctness and lookup paths more consistent, enhancing overall robustness of schema-to-column mapping.

… can be invalidated during constant propagation.
@gibber9809 gibber9809 requested a review from a team as a code owner February 13, 2026 19:41
@gibber9809 gibber9809 linked an issue Feb 13, 2026 that may be closed by this pull request
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 13, 2026

Walkthrough

Replaces raw ColumnDescriptor* usage with std::shared_ptr<ast::ColumnDescriptor> const& across SchemaMatch, updates internal maps/sets to store shared_ptrs, switches insertions to try_emplace/emplace, and builds OrExpr for unresolved descriptors where applicable.

Changes

Cohort / File(s) Summary
SchemaMatch core refactor
components/core/src/clp_s/search/SchemaMatch.hpp, components/core/src/clp_s/search/SchemaMatch.cpp
APIs changed to accept std::shared_ptr<ast::ColumnDescriptor> const&; internal containers switched from raw-pointer keys/values to shared_ptr-aware storage (use of descriptor.get() for raw-keying where needed). Map/set insertions migrated to try_emplace/emplace; lookups use .at() in iterations; unresolved column handling now constructs OrExpr of candidate descriptors. Minor const-correctness and iteration adjustments throughout.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related issues

  • Issue #1986: Directly related — migrates SchemaMatch caches and APIs from raw ColumnDescriptor* to std::shared_ptr<ast::ColumnDescriptor> to address dangling-pointer/use-after-free concerns.
🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main change: replacing raw pointers with smart pointers in SchemaMatch to fix a use-after-free bug, with a reference to the related issue.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
components/core/src/clp_s/search/SchemaMatch.hpp (1)

72-78: 🧹 Nitpick | 🔵 Trivial

Remaining raw-pointer maps could become stale for the same reason.

m_column_to_descriptor now correctly holds shared_ptr to keep descriptors alive across AST mutations. However, m_descriptor_to_schema (line 77) and m_unresolved_descriptor_to_descriptor (line 78) are still keyed by raw ColumnDescriptor*. These are safe today because:

  1. m_unresolved_descriptor_to_descriptor is cleared in run() before the second pass.
  2. m_descriptor_to_schema is populated in populate_schema_mapping() which runs after constant propagation, so the AST is stable.

But this is fragile — if the pass ordering changes, these maps would suffer the same use-after-free. Consider documenting this invariant or, for consistency, switching these to shared_ptr-based keys as well.

🤖 Fix all issues with AI agents
In `@components/core/src/clp_s/search/SchemaMatch.cpp`:
- Around line 350-354: The loop is copying a std::shared_ptr on each iteration
causing unnecessary atomic ref-count ops; change the loop to take the element by
reference (use auto const& descriptor) so you don't copy the shared_ptr, then
continue to call descriptor->is_pure_wildcard() and descriptor.get() when
inserting into m_descriptor_to_schema via try_emplace (symbols:
m_column_to_descriptor, descriptor, is_pure_wildcard(), m_descriptor_to_schema,
try_emplace, descriptor.get()).

Copy link
Member

@LinZhihao-723 LinZhihao-723 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two nit catch otherwise lgtm.

gibber9809 and others added 2 commits February 19, 2026 11:13
Co-authored-by: Lin Zhihao <59785146+LinZhihao-723@users.noreply.github.com>
Copy link
Member

@LinZhihao-723 LinZhihao-723 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR title lgtm.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
components/core/src/clp_s/search/SchemaMatch.cpp (1)

361-389: 🧹 Nitpick | 🔵 Trivial

m_descriptor_to_schema still keyed by raw ColumnDescriptor* — consider whether this should also migrate.

intersect_schemas and intersect_and_sub_expr index m_descriptor_to_schema with raw ColumnDescriptor* (lines 382, 447, 453, 460, 469). These pointers are obtained from column.get() on live AST nodes, so they're valid during the intersection pass (which runs after constant propagation is complete). However, this leaves a mixed pointer-ownership model: m_column_to_descriptor uses shared_ptr values while m_descriptor_to_schema uses raw-pointer keys.

This is safe today because m_descriptor_to_schema is only populated and consumed within a single run() invocation after constant propagation. If future changes introduce another constant-propagation pass between populate_schema_mapping and intersect_schemas, the same use-after-free category could resurface. No action required for this PR, but worth documenting the invariant.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/core/src/clp_s/search/SchemaMatch.cpp` around lines 361 - 389,
intersect_schemas and intersect_and_sub_expr currently index
m_descriptor_to_schema with raw ColumnDescriptor* while m_column_to_descriptor
uses shared_ptr, which mixes ownership and risks use-after-free; either migrate
m_descriptor_to_schema to use shared_ptr<ColumnDescriptor> keys (update all
access sites in intersect_schemas, intersect_and_sub_expr,
populate_schema_mapping, and any places that call m_descriptor_to_schema to use
the shared_ptr from m_column_to_descriptor) or, if you choose not to change
types now, add a clear invariant comment and a runtime check in
run()/populate_schema_mapping guaranteeing that m_descriptor_to_schema is
populated and consumed within a single pass (and assert that pointers in
m_descriptor_to_schema match those held by m_column_to_descriptor) so future
refactors don't introduce lifetime bugs.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@components/core/src/clp_s/search/SchemaMatch.cpp`:
- Around line 361-389: intersect_schemas and intersect_and_sub_expr currently
index m_descriptor_to_schema with raw ColumnDescriptor* while
m_column_to_descriptor uses shared_ptr, which mixes ownership and risks
use-after-free; either migrate m_descriptor_to_schema to use
shared_ptr<ColumnDescriptor> keys (update all access sites in intersect_schemas,
intersect_and_sub_expr, populate_schema_mapping, and any places that call
m_descriptor_to_schema to use the shared_ptr from m_column_to_descriptor) or, if
you choose not to change types now, add a clear invariant comment and a runtime
check in run()/populate_schema_mapping guaranteeing that m_descriptor_to_schema
is populated and consumed within a single pass (and assert that pointers in
m_descriptor_to_schema match those held by m_column_to_descriptor) so future
refactors don't introduce lifetime bugs.

---

Duplicate comments:
In `@components/core/src/clp_s/search/SchemaMatch.cpp`:
- Around line 350-355: Replace the redundant count+at lookup around
m_column_to_descriptor with a single find: use
m_column_to_descriptor.find(column_id) and check the iterator against end(),
then iterate over the found iterator->second (using the existing auto const&
descriptor) and update m_descriptor_to_schema (as currently done with
try_emplace and descriptor.get()) — this removes the extra map lookup while
keeping the descriptor->is_pure_wildcard() check and current insertion logic for
schema_id/column_id.

@gibber9809 gibber9809 merged commit 309f125 into y-scope:main Feb 19, 2026
27 checks passed
@junhaoliao junhaoliao added this to the February 2026 milestone Feb 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

clp-s: Use-after-free in SchemaMatch.

3 participants