Fix BBKNN Trimming by Intron7 · Pull Request #659 · scverse/rapids-singlecell

Intron7 · 2026-05-15T09:04:04Z

This fixes #657.

Intron7 · 2026-05-15T09:20:15Z

coderabbitai · 2026-05-15T09:20:20Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai · 2026-05-15T09:23:31Z

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Get PR details and diff
gh pr view 659 --repo scverse/rapids-singlecell --json title,body,files,additions,deletions,commits

echo "=== PR DIFF ==="
gh pr diff 659 --repo scverse/rapids-singlecell

echo "=== LINKED ISSUE ==="
gh issue view 657 --repo scverse/rapids-singlecell

Length of output: 20602

🏁 Script executed:

#!/bin/bash
echo "=== Full test_neighbors.py imports and helpers ==="
head -60 tests/test_neighbors.py

echo ""
echo "=== Search for 'trim' stored in adata.uns ==="
rg -n "params.*trim|trim.*params|uns.*trim|trim.*uns" --type=py

echo ""
echo "=== trimming_gpu and trimming_cpu aliases in tests ==="
rg -n "trimming_gpu\|trimming_cpu\|trimming_gpu\|import.*trimming" tests/test_neighbors.py

echo ""
echo "=== Where trim is stored in _neighbors/__init__.py ==="
rg -n "trim" src/rapids_singlecell/preprocessing/_neighbors/__init__.py

echo ""
echo "=== itertools usage in test_neighbors.py ==="
rg -n "itertools" tests/test_neighbors.py

echo ""
echo "=== _trimming signature call sites ==="
rg -n "_trimming\|trimming" src/rapids_singlecell/preprocessing/_neighbors/ --type=py

Length of output: 2828

🏁 Script executed:

#!/bin/bash
echo "=== params dict and adata.uns storage in _neighbors/__init__.py ==="
sed -n '410,460p' src/rapids_singlecell/preprocessing/_neighbors/__init__.py

echo ""
echo "=== check if trim is included in params dict ==="
sed -n '420,445p' src/rapids_singlecell/preprocessing/_neighbors/__init__.py

echo ""
echo "=== full bbknn function signature (docstring + trim param) ==="
sed -n '255,345p' src/rapids_singlecell/preprocessing/_neighbors/__init__.py

echo ""
echo "=== edge case: block_size=1 budget check in bbknn.cu ==="
grep -n "block_size\|SHARED_MEM" src/rapids_singlecell/_cuda/bbknn/bbknn.cu

echo ""
echo "=== _TRIM_SORT_THRESHOLD usage context ==="
grep -n -A5 -B5 "_TRIM_SORT_THRESHOLD\|sort_tile_size\|max_row_nnz" src/rapids_singlecell/preprocessing/_neighbors/_helper/__init__.py

Length of output: 9009

@Intron7 Here is my review of the changes.

Summary

This PR correctly addresses both root causes from #657:

Missing per-row sort — fuzzy_simplicial_set treats the first non-zero distance per row as rho; concatenating per-batch columns without sorting caused incorrect sigma estimation and weight collapse near 1.0.
Wrong trim default — was 10 * neighbors_within_batch, now correctly 10 * total_neighbors (= 10 * neighbors_within_batch * n_batches), matching upstream bbknn.

The CUDA crash fix (adaptive block-size halving for large trim) and the new block-cooperative find_top_k_per_row_sorted_kernel are solid engineering additions. Test coverage is comprehensive.

Issues

1. `launch_find_top_k_per_row`: shared-memory budget not guaranteed at `block_size == 1`

File: src/rapids_singlecell/_cuda/bbknn/bbknn.cu, in launch_find_top_k_per_row

The while-loop halves block_size until the budget fits or block_size == 1, but then unconditionally launches with whatever shared_mem_size remains:

while (block_size > 1 &&
       static_cast<size_t>(block_size) * per_thread_bytes > SHARED_MEM_BUDGET) {
    block_size /= 2;
}
// shared_mem_size = block_size * per_thread_bytes — may still exceed budget if block_size==1

If block_size reaches 1 and trim > 12288 (i.e., trim * 4 > 48 KB), the kernel launch will request more shared memory than the budget and silently fail or crash. A simple guard would prevent this:

if (static_cast<size_t>(block_size) * per_thread_bytes > SHARED_MEM_BUDGET) {
    throw std::runtime_error(
        "find_top_k_per_row: trim too large for shared-memory budget; "
        "use find_top_k_per_row_sorted instead");
}

In practice this requires max_row_nnz > 2048 and trim > 12288 simultaneously (the Python dispatch avoids it when max_row_nnz <= sort_tile_size()), but a safety check costs nothing.

2. `_TRIM_SORT_THRESHOLD = 100` is undocumented

File: src/rapids_singlecell/preprocessing/_neighbors/_helper/__init__.py, line 133

The constant has no comment explaining its origin. A brief annotation would help future readers (e.g., empirical benchmark cutoff, or chosen to keep the "thread" kernel's shared-memory usage well under the 48 KB limit):

# Empirically, the block-cooperative CUB kernel is faster for trim >= 100;
# below this threshold the per-thread top-k kernel has less overhead.
_TRIM_SORT_THRESHOLD = 100

Observations / Nitpicks (non-blocking)

Large nnz bypass (pre-existing): when connectivities.nnz >= np.iinfo(np.int32).max, trimming is silently skipped. Not introduced here, but worth a follow-up issue.
test_bbknn_connectivities_not_collapsed thresholds (mean < 0.7, fraction > 0.99 < 0.5): magic numbers with no comment. A one-line note explaining they are regression bounds against the pre-fix behaviour would clarify intent.
find_top_k_per_row_sorted_kernel load pattern: elements are loaded with a blocked stride (threadIdx.x * ITEMS_PER_THREAD + i), which is the default CUB blocked arrangement — correct and consistent. An explicit BLOCK_LOAD_WARP_TRANSPOSE variant would improve L1 cache utilisation, but is not required for correctness.

The sort fix and trim default are the critical correctness changes, and both look right. Addressing item 1 above (or at minimum the explicit raise) would make the CUDA path fully robust. Everything else is minor.

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai · 2026-05-15T09:25:37Z

📝 Walkthrough

Summary by CodeRabbit

Bug Fixes
- Fixed bbknn connectivity divergence from upstream implementation. Per-batch neighbor distances are now sorted before filtering, aligning with upstream behavior.
- Trimming kernel no longer crashes when using large trim values and now provides improved performance through enhanced kernel dispatch.
Changes
- Updated default trim parameter calculation to base on total neighbors across all batches instead of single-batch neighbors.

Walkthrough

This PR resolves bbknn divergence from upstream scanpy.external.pp.bbknn by introducing per-neighbor distance sorting, correcting the trim default formula, adding a block-cooperative radix-sort GPU kernel for large trim values, and implementing automatic kernel selection logic.

Changes

BBKNN Alignment and Trimming Improvements

Layer / File(s)	Summary
Sorted radix-sort kernel implementation `src/rapids_singlecell/_cuda/bbknn/kernels_bbknn.cuh`	New templated CUDA kernel `find_top_k_per_row_sorted_kernel` uses CUB's `BlockRadixSort::SortDescending` to compute trim thresholds per row; processes one block per row, pads short rows, and extracts the trim-th largest value from sorted descending order.
Kernel launcher and Python bindings `src/rapids_singlecell/_cuda/bbknn/bbknn.cu`	Adds dynamic block-size reduction loop to `launch_find_top_k_per_row` for shared-memory constraints; introduces `launch_find_top_k_per_row_sorted` helper; exports new `find_top_k_per_row_sorted` function and `sort_tile_size()` query through Python bindings.
BBKNN neighbor sorting and trim default `src/rapids_singlecell/preprocessing/_neighbors/__init__.py`	After per-batch KNN collection, sorts each observation's neighbors by ascending distance and reorders indices accordingly; updates default trim from `10 * neighbors_within_batch` to `10 * total_neighbors` to align with upstream formula.
Trimming kernel selection and dispatch `src/rapids_singlecell/preprocessing/_neighbors/_helper/__init__.py`	Extends `_trimming` to accept optional `kernel` parameter; implements auto-selection comparing maximum row nonzero count against `sort_tile_size()` to choose between sorted and thread kernels; dispatches to appropriate function.
Regression and kernel-equivalence tests `tests/test_neighbors.py`	Adds tests for per-row distance sorting, preventing connectivity weight collapse, and matching upstream trim default; replaces single trimming test with parameterized variants for multiple trim values and both kernel implementations.
Release notes documentation `docs/release-notes/0.15.1.md`	Documents bbknn bug fixes: neighbor distance sorting, trim default alignment, large-trim crash resolution, and block-cooperative sort kernel auto-dispatch.

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 23.08% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'Fix BBKNN Trimming' directly addresses the main changeset, which implements fixes to BBKNN to resolve connectivity divergence from upstream bbknn, including sorting neighbors by distance and adjusting trimming behavior.
Description check	✅ Passed	The description 'This fixes `#657`' is related to the changeset as it references the linked issue describing the BBKNN connectivity mismatch bug being addressed.
Linked Issues check	✅ Passed	The code changes comprehensively address issue `#657` requirements: per-batch KNN results are globally reordered by sorting neighbor distances [`#657`], a new sorted kernel is auto-dispatched for large trim values to improve speed, trimming behavior is aligned with upstream defaults (trim now defaults to total_neighbors instead of neighbors_within_batch) [`#657`], and connectivity construction via fuzzy_simplicial_set now uses distance-normalized (sorted) neighbors matching scanpy's approach [`#657`].
Out of Scope Changes check	✅ Passed	All changes are directly related to fixing the BBKNN connectivity mismatch issue: CUDA kernel improvements for trimming, distance sorting logic, default trim calculation changes, and comprehensive test coverage for the fixes.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix-bbknn

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (3)

src/rapids_singlecell/preprocessing/_neighbors/_helper/__init__.py (1)
133-133: 💤 Low value

Consider adding a brief comment documenting the threshold rationale.

The constant _TRIM_SORT_THRESHOLD = 100 lacks documentation. A brief comment explaining why 100 was chosen as the cutoff for kernel auto-selection would improve maintainability.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/rapids_singlecell/preprocessing/_neighbors/_helper/__init__.py` at line
133, Add a brief inline comment above the constant _TRIM_SORT_THRESHOLD = 100
that documents the rationale for choosing 100 as the cutoff used in kernel
auto-selection (e.g., tradeoff between sort/trim overhead and expected
neighborhood sizes, empirical benchmark, or reference to the algorithmic
behavior), mention units (number of neighbors/samples) and when it should be
adjusted; keep the comment short (1–2 sentences) and reference
_TRIM_SORT_THRESHOLD and the kernel auto-selection logic so future maintainers
can understand and tune it.
tests/test_neighbors.py (2)
158-166: ⚡ Quick win

Consider comparing against scanpy reference instead of hard-coded thresholds.

The hard-coded thresholds (mean < 0.7, > 0.99 fraction < 0.5) are specific to pbmc68k_reduced and may be fragile if the dataset changes or when testing other datasets. Since the PR objective is to match scanpy.external.pp.bbknn behavior, comparing the connectivity weight distribution (e.g., mean, percentiles) against scanpy's output would provide a more robust regression guard.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_neighbors.py` around lines 158 - 166, The test
test_bbknn_connectivities_not_collapsed uses fragile hard-coded thresholds;
instead run scanpy.external.pp.bbknn on a copy of the same AnnData (use
pbmc68k_reduced() to get a fresh copy), extract its connectivities via
adata_ref.obsp["connectivities"].data and compare distribution metrics (e.g.,
mean and a percentile like 99th) between the bbknn output and the scanpy
reference (from bbknn(adata, ...)) using numeric assertions (e.g.,
numpy.testing.assert_allclose or small relative tolerances) so the test asserts
equality/near-equality to scanpy behavior rather than fixed thresholds. Ensure
you reference the existing functions/objects:
test_bbknn_connectivities_not_collapsed, pbmc68k_reduced, bbknn, and
adata.obsp["connectivities"].data.
147-222: 🏗️ Heavy lift

Add edge case coverage per coding guidelines.

All new tests use only pbmc68k_reduced(). The coding guidelines require tests to "cover edge cases (single row, empty input, max-size input) beyond just 'runs without error'". Consider adding:

Single-row input to test per-row logic boundaries (sorting, trimming)

Empty or minimal-batch scenarios for bbknn

Large trim values relative to row nonzero counts to stress kernel selection

As per coding guidelines, Test validation: include numerical correctness checks against CPU reference implementations, and cover edge cases (single row, empty input, max-size input) beyond just 'runs without error'.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_neighbors.py` around lines 147 - 222, Add edge-case test cases
alongside the existing ones: create new tests that call bbknn and
trimming_gpu/trimming_cpu using a single-row AnnData (use pbmc68k_reduced() then
slice to one observation to exercise per-row sorting/trimming boundaries), an
empty/minimal-batch AnnData (zero rows or a single batch with zero neighbors) to
verify bbknn handles empty inputs, and a test that sets trim much larger than
available nonzeros to force the kernel paths; in each new test compare numerical
results to the CPU reference (trimming_cpu, adata.obsp connectivities from
bbknn, and X_to_GPU conversions) using the same assert_array_equal-style checks
so GPU kernels and bbknn outputs match the CPU reference.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@src/rapids_singlecell/preprocessing/_neighbors/_helper/__init__.py`:
- Line 133: Add a brief inline comment above the constant _TRIM_SORT_THRESHOLD =
100 that documents the rationale for choosing 100 as the cutoff used in kernel
auto-selection (e.g., tradeoff between sort/trim overhead and expected
neighborhood sizes, empirical benchmark, or reference to the algorithmic
behavior), mention units (number of neighbors/samples) and when it should be
adjusted; keep the comment short (1–2 sentences) and reference
_TRIM_SORT_THRESHOLD and the kernel auto-selection logic so future maintainers
can understand and tune it.

In `@tests/test_neighbors.py`:
- Around line 158-166: The test test_bbknn_connectivities_not_collapsed uses
fragile hard-coded thresholds; instead run scanpy.external.pp.bbknn on a copy of
the same AnnData (use pbmc68k_reduced() to get a fresh copy), extract its
connectivities via adata_ref.obsp["connectivities"].data and compare
distribution metrics (e.g., mean and a percentile like 99th) between the bbknn
output and the scanpy reference (from bbknn(adata, ...)) using numeric
assertions (e.g., numpy.testing.assert_allclose or small relative tolerances) so
the test asserts equality/near-equality to scanpy behavior rather than fixed
thresholds. Ensure you reference the existing functions/objects:
test_bbknn_connectivities_not_collapsed, pbmc68k_reduced, bbknn, and
adata.obsp["connectivities"].data.
- Around line 147-222: Add edge-case test cases alongside the existing ones:
create new tests that call bbknn and trimming_gpu/trimming_cpu using a
single-row AnnData (use pbmc68k_reduced() then slice to one observation to
exercise per-row sorting/trimming boundaries), an empty/minimal-batch AnnData
(zero rows or a single batch with zero neighbors) to verify bbknn handles empty
inputs, and a test that sets trim much larger than available nonzeros to force
the kernel paths; in each new test compare numerical results to the CPU
reference (trimming_cpu, adata.obsp connectivities from bbknn, and X_to_GPU
conversions) using the same assert_array_equal-style checks so GPU kernels and
bbknn outputs match the CPU reference.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 15907752-43b0-4ab9-91a1-befb828eb7e1

📥 Commits

Reviewing files that changed from the base of the PR and between 61eda66 and db2eba5.

📒 Files selected for processing (6)

docs/release-notes/0.15.1.md
src/rapids_singlecell/_cuda/bbknn/bbknn.cu
src/rapids_singlecell/_cuda/bbknn/kernels_bbknn.cuh
src/rapids_singlecell/preprocessing/_neighbors/__init__.py
src/rapids_singlecell/preprocessing/_neighbors/_helper/__init__.py
tests/test_neighbors.py

codecov-commenter · 2026-05-15T09:34:21Z

Codecov Report

❌ Patch coverage is 94.44444% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 88.14%. Comparing base (61eda66) to head (3a84899).

Files with missing lines	Patch %	Lines
...ecell/preprocessing/_neighbors/_helper/__init__.py	92.30%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #659      +/-   ##
==========================================
+ Coverage   88.13%   88.14%   +0.01%     
==========================================
  Files          96       96              
  Lines        7045     7060      +15     
==========================================
+ Hits         6209     6223      +14     
- Misses        836      837       +1

Files with missing lines	Coverage Δ
...ds_singlecell/preprocessing/_neighbors/__init__.py	`91.11% <100.00%> (+0.41%)`	⬆️
...ecell/preprocessing/_neighbors/_helper/__init__.py	`79.16% <92.30%> (+2.11%)`	⬆️

Intron7 added 2 commits May 14, 2026 19:13

Inital fix

4df6b7a

add release note and up blocksort size

db2eba5

Intron7 added the run-gpu-ci label May 15, 2026

github-actions Bot removed the run-gpu-ci label May 15, 2026

coderabbitai Bot reviewed May 15, 2026

View reviewed changes

adress comments

c008311

Intron7 added the run-gpu-ci label May 15, 2026

github-actions Bot removed the run-gpu-ci label May 15, 2026

Merge branch 'main' into fix-bbknn

3a84899

Intron7 added the run-gpu-ci label May 15, 2026

github-actions Bot removed the run-gpu-ci label May 15, 2026

Merge branch 'main' into fix-bbknn

3379560

Intron7 merged commit 2926263 into main May 15, 2026
11 of 15 checks passed

Intron7 deleted the fix-bbknn branch May 15, 2026 10:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix BBKNN Trimming#659

Fix BBKNN Trimming#659
Intron7 merged 5 commits into
mainfrom
fix-bbknn

Intron7 commented May 15, 2026

Uh oh!

Intron7 commented May 15, 2026

Uh oh!

coderabbitai Bot commented May 15, 2026

Uh oh!

coderabbitai Bot commented May 15, 2026

Uh oh!

coderabbitai Bot commented May 15, 2026

Summary by CodeRabbit

Walkthrough

Changes

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

codecov-commenter commented May 15, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Intron7 commented May 15, 2026

Uh oh!

Intron7 commented May 15, 2026

Uh oh!

coderabbitai Bot commented May 15, 2026

Uh oh!

coderabbitai Bot commented May 15, 2026

Summary

Issues

1. launch_find_top_k_per_row: shared-memory budget not guaranteed at block_size == 1

2. _TRIM_SORT_THRESHOLD = 100 is undocumented

Observations / Nitpicks (non-blocking)

Uh oh!

coderabbitai Bot commented May 15, 2026

Summary by CodeRabbit

Walkthrough

Changes

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

1. `launch_find_top_k_per_row`: shared-memory budget not guaranteed at `block_size == 1`

2. `_TRIM_SORT_THRESHOLD = 100` is undocumented

codecov-commenter commented May 15, 2026 •

edited

Loading