Skip to content

Fix BBKNN Trimming#659

Merged
Intron7 merged 5 commits into
mainfrom
fix-bbknn
May 15, 2026
Merged

Fix BBKNN Trimming#659
Intron7 merged 5 commits into
mainfrom
fix-bbknn

Conversation

@Intron7
Copy link
Copy Markdown
Member

@Intron7 Intron7 commented May 15, 2026

This fixes #657.

@Intron7
Copy link
Copy Markdown
Member Author

Intron7 commented May 15, 2026

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 15, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 15, 2026

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Get PR details and diff
gh pr view 659 --repo scverse/rapids-singlecell --json title,body,files,additions,deletions,commits

echo "=== PR DIFF ==="
gh pr diff 659 --repo scverse/rapids-singlecell

echo "=== LINKED ISSUE ==="
gh issue view 657 --repo scverse/rapids-singlecell

Length of output: 20602


🏁 Script executed:

#!/bin/bash
echo "=== Full test_neighbors.py imports and helpers ==="
head -60 tests/test_neighbors.py

echo ""
echo "=== Search for 'trim' stored in adata.uns ==="
rg -n "params.*trim|trim.*params|uns.*trim|trim.*uns" --type=py

echo ""
echo "=== trimming_gpu and trimming_cpu aliases in tests ==="
rg -n "trimming_gpu\|trimming_cpu\|trimming_gpu\|import.*trimming" tests/test_neighbors.py

echo ""
echo "=== Where trim is stored in _neighbors/__init__.py ==="
rg -n "trim" src/rapids_singlecell/preprocessing/_neighbors/__init__.py

echo ""
echo "=== itertools usage in test_neighbors.py ==="
rg -n "itertools" tests/test_neighbors.py

echo ""
echo "=== _trimming signature call sites ==="
rg -n "_trimming\|trimming" src/rapids_singlecell/preprocessing/_neighbors/ --type=py

Length of output: 2828


🏁 Script executed:

#!/bin/bash
echo "=== params dict and adata.uns storage in _neighbors/__init__.py ==="
sed -n '410,460p' src/rapids_singlecell/preprocessing/_neighbors/__init__.py

echo ""
echo "=== check if trim is included in params dict ==="
sed -n '420,445p' src/rapids_singlecell/preprocessing/_neighbors/__init__.py

echo ""
echo "=== full bbknn function signature (docstring + trim param) ==="
sed -n '255,345p' src/rapids_singlecell/preprocessing/_neighbors/__init__.py

echo ""
echo "=== edge case: block_size=1 budget check in bbknn.cu ==="
grep -n "block_size\|SHARED_MEM" src/rapids_singlecell/_cuda/bbknn/bbknn.cu

echo ""
echo "=== _TRIM_SORT_THRESHOLD usage context ==="
grep -n -A5 -B5 "_TRIM_SORT_THRESHOLD\|sort_tile_size\|max_row_nnz" src/rapids_singlecell/preprocessing/_neighbors/_helper/__init__.py

Length of output: 9009


@Intron7 Here is my review of the changes.


Summary

This PR correctly addresses both root causes from #657:

  1. Missing per-row sortfuzzy_simplicial_set treats the first non-zero distance per row as rho; concatenating per-batch columns without sorting caused incorrect sigma estimation and weight collapse near 1.0.
  2. Wrong trim default — was 10 * neighbors_within_batch, now correctly 10 * total_neighbors (= 10 * neighbors_within_batch * n_batches), matching upstream bbknn.

The CUDA crash fix (adaptive block-size halving for large trim) and the new block-cooperative find_top_k_per_row_sorted_kernel are solid engineering additions. Test coverage is comprehensive.


Issues

1. launch_find_top_k_per_row: shared-memory budget not guaranteed at block_size == 1

File: src/rapids_singlecell/_cuda/bbknn/bbknn.cu, in launch_find_top_k_per_row

The while-loop halves block_size until the budget fits or block_size == 1, but then unconditionally launches with whatever shared_mem_size remains:

while (block_size > 1 &&
       static_cast<size_t>(block_size) * per_thread_bytes > SHARED_MEM_BUDGET) {
    block_size /= 2;
}
// shared_mem_size = block_size * per_thread_bytes — may still exceed budget if block_size==1

If block_size reaches 1 and trim > 12288 (i.e., trim * 4 > 48 KB), the kernel launch will request more shared memory than the budget and silently fail or crash. A simple guard would prevent this:

if (static_cast<size_t>(block_size) * per_thread_bytes > SHARED_MEM_BUDGET) {
    throw std::runtime_error(
        "find_top_k_per_row: trim too large for shared-memory budget; "
        "use find_top_k_per_row_sorted instead");
}

In practice this requires max_row_nnz > 2048 and trim > 12288 simultaneously (the Python dispatch avoids it when max_row_nnz <= sort_tile_size()), but a safety check costs nothing.


2. _TRIM_SORT_THRESHOLD = 100 is undocumented

File: src/rapids_singlecell/preprocessing/_neighbors/_helper/__init__.py, line 133

The constant has no comment explaining its origin. A brief annotation would help future readers (e.g., empirical benchmark cutoff, or chosen to keep the "thread" kernel's shared-memory usage well under the 48 KB limit):

# Empirically, the block-cooperative CUB kernel is faster for trim >= 100;
# below this threshold the per-thread top-k kernel has less overhead.
_TRIM_SORT_THRESHOLD = 100

Observations / Nitpicks (non-blocking)

  • Large nnz bypass (pre-existing): when connectivities.nnz >= np.iinfo(np.int32).max, trimming is silently skipped. Not introduced here, but worth a follow-up issue.
  • test_bbknn_connectivities_not_collapsed thresholds (mean < 0.7, fraction > 0.99 < 0.5): magic numbers with no comment. A one-line note explaining they are regression bounds against the pre-fix behaviour would clarify intent.
  • find_top_k_per_row_sorted_kernel load pattern: elements are loaded with a blocked stride (threadIdx.x * ITEMS_PER_THREAD + i), which is the default CUB blocked arrangement — correct and consistent. An explicit BLOCK_LOAD_WARP_TRANSPOSE variant would improve L1 cache utilisation, but is not required for correctness.

The sort fix and trim default are the critical correctness changes, and both look right. Addressing item 1 above (or at minimum the explicit raise) would make the CUDA path fully robust. Everything else is minor.

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 15, 2026

Review Change Stack

📝 Walkthrough

Summary by CodeRabbit

  • Bug Fixes

    • Fixed bbknn connectivity divergence from upstream implementation. Per-batch neighbor distances are now sorted before filtering, aligning with upstream behavior.
    • Trimming kernel no longer crashes when using large trim values and now provides improved performance through enhanced kernel dispatch.
  • Changes

    • Updated default trim parameter calculation to base on total neighbors across all batches instead of single-batch neighbors.

Walkthrough

This PR resolves bbknn divergence from upstream scanpy.external.pp.bbknn by introducing per-neighbor distance sorting, correcting the trim default formula, adding a block-cooperative radix-sort GPU kernel for large trim values, and implementing automatic kernel selection logic.

Changes

BBKNN Alignment and Trimming Improvements

Layer / File(s) Summary
Sorted radix-sort kernel implementation
src/rapids_singlecell/_cuda/bbknn/kernels_bbknn.cuh
New templated CUDA kernel find_top_k_per_row_sorted_kernel uses CUB's BlockRadixSort::SortDescending to compute trim thresholds per row; processes one block per row, pads short rows, and extracts the trim-th largest value from sorted descending order.
Kernel launcher and Python bindings
src/rapids_singlecell/_cuda/bbknn/bbknn.cu
Adds dynamic block-size reduction loop to launch_find_top_k_per_row for shared-memory constraints; introduces launch_find_top_k_per_row_sorted helper; exports new find_top_k_per_row_sorted function and sort_tile_size() query through Python bindings.
BBKNN neighbor sorting and trim default
src/rapids_singlecell/preprocessing/_neighbors/__init__.py
After per-batch KNN collection, sorts each observation's neighbors by ascending distance and reorders indices accordingly; updates default trim from 10 * neighbors_within_batch to 10 * total_neighbors to align with upstream formula.
Trimming kernel selection and dispatch
src/rapids_singlecell/preprocessing/_neighbors/_helper/__init__.py
Extends _trimming to accept optional kernel parameter; implements auto-selection comparing maximum row nonzero count against sort_tile_size() to choose between sorted and thread kernels; dispatches to appropriate function.
Regression and kernel-equivalence tests
tests/test_neighbors.py
Adds tests for per-row distance sorting, preventing connectivity weight collapse, and matching upstream trim default; replaces single trimming test with parameterized variants for multiple trim values and both kernel implementations.
Release notes documentation
docs/release-notes/0.15.1.md
Documents bbknn bug fixes: neighbor distance sorting, trim default alignment, large-trim crash resolution, and block-cooperative sort kernel auto-dispatch.

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 23.08% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Fix BBKNN Trimming' directly addresses the main changeset, which implements fixes to BBKNN to resolve connectivity divergence from upstream bbknn, including sorting neighbors by distance and adjusting trimming behavior.
Description check ✅ Passed The description 'This fixes #657' is related to the changeset as it references the linked issue describing the BBKNN connectivity mismatch bug being addressed.
Linked Issues check ✅ Passed The code changes comprehensively address issue #657 requirements: per-batch KNN results are globally reordered by sorting neighbor distances [#657], a new sorted kernel is auto-dispatched for large trim values to improve speed, trimming behavior is aligned with upstream defaults (trim now defaults to total_neighbors instead of neighbors_within_batch) [#657], and connectivity construction via fuzzy_simplicial_set now uses distance-normalized (sorted) neighbors matching scanpy's approach [#657].
Out of Scope Changes check ✅ Passed All changes are directly related to fixing the BBKNN connectivity mismatch issue: CUDA kernel improvements for trimming, distance sorting logic, default trim calculation changes, and comprehensive test coverage for the fixes.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix-bbknn

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (3)
src/rapids_singlecell/preprocessing/_neighbors/_helper/__init__.py (1)

133-133: 💤 Low value

Consider adding a brief comment documenting the threshold rationale.

The constant _TRIM_SORT_THRESHOLD = 100 lacks documentation. A brief comment explaining why 100 was chosen as the cutoff for kernel auto-selection would improve maintainability.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/rapids_singlecell/preprocessing/_neighbors/_helper/__init__.py` at line
133, Add a brief inline comment above the constant _TRIM_SORT_THRESHOLD = 100
that documents the rationale for choosing 100 as the cutoff used in kernel
auto-selection (e.g., tradeoff between sort/trim overhead and expected
neighborhood sizes, empirical benchmark, or reference to the algorithmic
behavior), mention units (number of neighbors/samples) and when it should be
adjusted; keep the comment short (1–2 sentences) and reference
_TRIM_SORT_THRESHOLD and the kernel auto-selection logic so future maintainers
can understand and tune it.
tests/test_neighbors.py (2)

158-166: ⚡ Quick win

Consider comparing against scanpy reference instead of hard-coded thresholds.

The hard-coded thresholds (mean < 0.7, > 0.99 fraction < 0.5) are specific to pbmc68k_reduced and may be fragile if the dataset changes or when testing other datasets. Since the PR objective is to match scanpy.external.pp.bbknn behavior, comparing the connectivity weight distribution (e.g., mean, percentiles) against scanpy's output would provide a more robust regression guard.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_neighbors.py` around lines 158 - 166, The test
test_bbknn_connectivities_not_collapsed uses fragile hard-coded thresholds;
instead run scanpy.external.pp.bbknn on a copy of the same AnnData (use
pbmc68k_reduced() to get a fresh copy), extract its connectivities via
adata_ref.obsp["connectivities"].data and compare distribution metrics (e.g.,
mean and a percentile like 99th) between the bbknn output and the scanpy
reference (from bbknn(adata, ...)) using numeric assertions (e.g.,
numpy.testing.assert_allclose or small relative tolerances) so the test asserts
equality/near-equality to scanpy behavior rather than fixed thresholds. Ensure
you reference the existing functions/objects:
test_bbknn_connectivities_not_collapsed, pbmc68k_reduced, bbknn, and
adata.obsp["connectivities"].data.

147-222: 🏗️ Heavy lift

Add edge case coverage per coding guidelines.

All new tests use only pbmc68k_reduced(). The coding guidelines require tests to "cover edge cases (single row, empty input, max-size input) beyond just 'runs without error'". Consider adding:

  • Single-row input to test per-row logic boundaries (sorting, trimming)
  • Empty or minimal-batch scenarios for bbknn
  • Large trim values relative to row nonzero counts to stress kernel selection

As per coding guidelines, Test validation: include numerical correctness checks against CPU reference implementations, and cover edge cases (single row, empty input, max-size input) beyond just 'runs without error'.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_neighbors.py` around lines 147 - 222, Add edge-case test cases
alongside the existing ones: create new tests that call bbknn and
trimming_gpu/trimming_cpu using a single-row AnnData (use pbmc68k_reduced() then
slice to one observation to exercise per-row sorting/trimming boundaries), an
empty/minimal-batch AnnData (zero rows or a single batch with zero neighbors) to
verify bbknn handles empty inputs, and a test that sets trim much larger than
available nonzeros to force the kernel paths; in each new test compare numerical
results to the CPU reference (trimming_cpu, adata.obsp connectivities from
bbknn, and X_to_GPU conversions) using the same assert_array_equal-style checks
so GPU kernels and bbknn outputs match the CPU reference.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@src/rapids_singlecell/preprocessing/_neighbors/_helper/__init__.py`:
- Line 133: Add a brief inline comment above the constant _TRIM_SORT_THRESHOLD =
100 that documents the rationale for choosing 100 as the cutoff used in kernel
auto-selection (e.g., tradeoff between sort/trim overhead and expected
neighborhood sizes, empirical benchmark, or reference to the algorithmic
behavior), mention units (number of neighbors/samples) and when it should be
adjusted; keep the comment short (1–2 sentences) and reference
_TRIM_SORT_THRESHOLD and the kernel auto-selection logic so future maintainers
can understand and tune it.

In `@tests/test_neighbors.py`:
- Around line 158-166: The test test_bbknn_connectivities_not_collapsed uses
fragile hard-coded thresholds; instead run scanpy.external.pp.bbknn on a copy of
the same AnnData (use pbmc68k_reduced() to get a fresh copy), extract its
connectivities via adata_ref.obsp["connectivities"].data and compare
distribution metrics (e.g., mean and a percentile like 99th) between the bbknn
output and the scanpy reference (from bbknn(adata, ...)) using numeric
assertions (e.g., numpy.testing.assert_allclose or small relative tolerances) so
the test asserts equality/near-equality to scanpy behavior rather than fixed
thresholds. Ensure you reference the existing functions/objects:
test_bbknn_connectivities_not_collapsed, pbmc68k_reduced, bbknn, and
adata.obsp["connectivities"].data.
- Around line 147-222: Add edge-case test cases alongside the existing ones:
create new tests that call bbknn and trimming_gpu/trimming_cpu using a
single-row AnnData (use pbmc68k_reduced() then slice to one observation to
exercise per-row sorting/trimming boundaries), an empty/minimal-batch AnnData
(zero rows or a single batch with zero neighbors) to verify bbknn handles empty
inputs, and a test that sets trim much larger than available nonzeros to force
the kernel paths; in each new test compare numerical results to the CPU
reference (trimming_cpu, adata.obsp connectivities from bbknn, and X_to_GPU
conversions) using the same assert_array_equal-style checks so GPU kernels and
bbknn outputs match the CPU reference.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 15907752-43b0-4ab9-91a1-befb828eb7e1

📥 Commits

Reviewing files that changed from the base of the PR and between 61eda66 and db2eba5.

📒 Files selected for processing (6)
  • docs/release-notes/0.15.1.md
  • src/rapids_singlecell/_cuda/bbknn/bbknn.cu
  • src/rapids_singlecell/_cuda/bbknn/kernels_bbknn.cuh
  • src/rapids_singlecell/preprocessing/_neighbors/__init__.py
  • src/rapids_singlecell/preprocessing/_neighbors/_helper/__init__.py
  • tests/test_neighbors.py

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 15, 2026

Codecov Report

❌ Patch coverage is 94.44444% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 88.14%. Comparing base (61eda66) to head (3a84899).

Files with missing lines Patch % Lines
...ecell/preprocessing/_neighbors/_helper/__init__.py 92.30% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #659      +/-   ##
==========================================
+ Coverage   88.13%   88.14%   +0.01%     
==========================================
  Files          96       96              
  Lines        7045     7060      +15     
==========================================
+ Hits         6209     6223      +14     
- Misses        836      837       +1     
Files with missing lines Coverage Δ
...ds_singlecell/preprocessing/_neighbors/__init__.py 91.11% <100.00%> (+0.41%) ⬆️
...ecell/preprocessing/_neighbors/_helper/__init__.py 79.16% <92.30%> (+2.11%) ⬆️

@Intron7 Intron7 merged commit 2926263 into main May 15, 2026
11 of 15 checks passed
@Intron7 Intron7 deleted the fix-bbknn branch May 15, 2026 10:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] bbknn mismatch to scanpy.external.pp.bbknn

2 participants