64 bit indptr csr support#620
Conversation
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
📝 WalkthroughWalkthroughThis PR refactors CUDA kernels and Python bindings across 20+ modules to support 64-bit ( Changes64-bit Sparse Index Type Support & Binding Consolidation
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
|
There was a problem hiding this comment.
Actionable comments posted: 12
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
tests/test_int64_indptr.py (1)
73-933:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftAdd CPU-reference and edge-case assertions, not only int32/int64 parity.
This suite is great for dispatch parity, but it can still pass when both paths are wrong in the same way. Please add numerical checks against NumPy/SciPy references for each operation family, and parameterize edge cases (single-row, empty input, and a stress/max-size shape used in CI).
As per coding guidelines
tests/**/*.py: Test validation: include numerical correctness checks against CPU reference implementations, and cover edge cases (single row, empty input, max-size input) beyond just 'runs without error'🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/test_int64_indptr.py` around lines 73 - 933, The tests currently only assert int32 vs int64 parity; update each test family (functions like test_sparse_qc_csr/test_sparse_qc_csc, test_mean_var_major/minor, test_nan_mean_major/minor, test_sparse2dense_c/test_sparse2dense_f, test_sparse_norm_res_csr/csc, test_sparse_aggr, test_sparse_var, etc.) to also compute a CPU reference using NumPy/SciPy (using _make_dense_cpu or SciPy sparse operations) and assert numerical equality (or use _close where floating rounding is expected) between the GPU result and the CPU reference rather than only between idx dtypes; additionally parameterize critical tests to include edge cases (single-row, empty matrix, and a CI-stress large shape) by reusing _make_csr/_make_csc with those shapes and adding assertions for those cases so each kernel is validated for correctness and edge behavior beyond int32/int64 parity, keeping existing helpers _eq/_close for comparisons.src/rapids_singlecell/_cuda/sparse2dense/kernels_s2d.cuh (1)
23-24:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAdd negative index guard to prevent out-of-bounds memory access.
Line 23 only checks
idx >= minorbut does not guard against negativeidxvalues. Ifidxis negative, it passes the guard and produces incorrect calculations at lines 29/32, leading to out-of-bounds memory access in the atomicAdd at line 34.Suggested fix
- if (idx >= minor) { + if (idx < 0 || idx >= minor) { continue; }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/rapids_singlecell/_cuda/sparse2dense/kernels_s2d.cuh` around lines 23 - 24, The current guard only checks "idx >= minor" which allows negative idx values to slip through and cause out-of-bounds writes in the subsequent atomicAdd; modify the check around the kernel's index validation (the condition that currently reads "if (idx >= minor) { continue; }") to also guard against negative indices (e.g., ensure idx < 0 is handled) so that any idx < 0 or idx >= minor will skip processing, preventing invalid memory accesses in the atomicAdd using idx.
🧹 Nitpick comments (1)
src/rapids_singlecell/_cuda/mean_var/kernels_mv.cuh (1)
18-19: 💤 Low valueHard-coded shared memory size should be a named constant.
The shared memory arrays use a raw literal
64which should matchBLOCK_SIZE_MAJORdefined inmean_var.cu. Per coding guidelines, numeric literals for block sizes and shared memory sizes should be defined as named constants.Consider defining this in the header or passing as a template parameter to ensure consistency:
constexpr int BLOCK_SIZE_MAJOR = 64; // ... __shared__ double mean_place[BLOCK_SIZE_MAJOR]; __shared__ double var_place[BLOCK_SIZE_MAJOR];As per coding guidelines: "All numeric literals for block sizes, tile dimensions, shared memory sizes... MUST be defined as named constants."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/rapids_singlecell/_cuda/mean_var/kernels_mv.cuh` around lines 18 - 19, Replace the hard-coded literal 64 used for shared memory arrays mean_place and var_place with a named constant (e.g., BLOCK_SIZE_MAJOR) so the size stays in sync with mean_var.cu; declare constexpr int BLOCK_SIZE_MAJOR = 64 in an appropriate header or pass it as a template parameter and update the declarations __shared__ double mean_place[BLOCK_SIZE_MAJOR]; and __shared__ double var_place[BLOCK_SIZE_MAJOR]; ensuring the symbol name matches the one used in mean_var.cu.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/rapids_singlecell/_cuda/aggr/aggr.cu`:
- Around line 17-18: The sparse-kernel launch sites compute dim3
grid((unsigned)n_cells) and dim3 block(BLOCK_SIZE_SPARSE) and can produce
invalid launches when a dimension is zero; add an early return guard in each of
the four sparse launch helper functions so they return immediately if the work
size is zero (e.g., if n_cells==0 or any other computed grid dimension == 0)
before computing dim3 grid/block or launching the kernel (references: dim3 grid,
dim3 block, BLOCK_SIZE_SPARSE).
In `@src/rapids_singlecell/_cuda/aggr/kernels_aggr.cuh`:
- Around line 56-73: csr_to_coo_kernel currently truncates 64-bit indices by
casting index[p] to int and writing into int col buffers, corrupting indices ≥
INT_MAX; fix by propagating IdxT for column buffers and bindings: change
csr_to_coo_kernel signature to accept IdxT* __restrict__ col (and ensure row/col
uses IdxT where appropriate), update the nanobind binding that currently exposes
gpu_array_c<int, Device> to accept gpu_array_c<IdxT, Device> for the output
column array, and update the callers in _aggregated.py to allocate output column
buffers matching the input index dtype; alternatively, if API change is
unacceptable, add pre-launch validation that checks max(index) < INT_MAX and
fail with a clear error (guarding launch) and document the limitation.
In `@src/rapids_singlecell/_cuda/ligrec/kernels_ligrec.cuh`:
- Around line 36-38: The code narrows 64-bit sparse indices by assigning
index[gene] (type IdxT) to an int (gene_number) in the loop inside
kernels_ligrec.cuh; update that local variable to use IdxT (or another 64-bit
type) so indices aren't truncated and repeat the same change in
mean_sparse_kernel, ensuring all uses that compute offsets/column indices (e.g.,
the local gene_number and any index-based arithmetic) operate in 64-bit to match
the explicit instantiation with IdxT = long long.
In `@src/rapids_singlecell/_cuda/ligrec/ligrec.cu`:
- Around line 180-197: The binding def_interaction advertises stream support but
the kernel launch inside launch_interaction ignores the passed stream; update
launch_interaction<T> to take a cudaStream_t (or std::uintptr_t cast to
cudaStream_t) and use that stream in the kernel launch (use <<<grid, block, 0,
stream>>> or the equivalent CUDA launch API) so ordering with upstream writes to
mean/mask/g and downstream reads from res is respected; keep the def_interaction
call to launch_interaction<T>(..., (cudaStream_t)stream) but change
launch_interaction's signature and its internal kernel invocation to accept and
use the stream.
- Around line 24-35: The sparse launcher launch_sum_count_sparse can compute
grid.x == 0 when rows == 0 which yields an invalid CUDA launch; add an
early-return guard at the top of launch_sum_count_sparse (and any other sparse
launcher that computes grid from rows) to return immediately when rows <= 0 so
no kernel is launched, and keep the rest unchanged; also fix launch_interaction
so the interaction kernel is launched with the provided cudaStream_t stream (use
<<<grid, block, 0, stream>>> for interaction_kernel) instead of the current
launch that ignores the stream. Ensure you reference the functions
sum_and_count_sparse_kernel and interaction_kernel in the fixes.
In `@src/rapids_singlecell/_cuda/nb_types.h`:
- Around line 31-45: The function max_grid_dims reads CUDA results without
checking return codes, so wrap the calls to cudaGetDevice and
cudaGetDeviceProperties in error-checks inside max_grid_dims: call cudaGetDevice
and if it returns non-success do not change cached_dev or cached (keep the safe
fallback) and return cached; if cudaGetDevice succeeds then call
cudaGetDeviceProperties and only update cached[0..2] and cached_dev when
cudaGetDeviceProperties returns success; optionally log or assert the CUDA error
before returning the fallback. Ensure you reference and protect the existing
static thread_local variables cached and cached_dev and the calls to
cudaGetDevice and cudaGetDeviceProperties in the max_grid_dims function.
In `@src/rapids_singlecell/_cuda/sparse2dense/sparse2dense.cu`:
- Around line 12-13: The max_nnz parameter is currently declared as int and must
be widened to long long everywhere to avoid truncation for per-row nnz >
INT_MAX; update the function signatures (the top-level host function(s) in
sparse2dense.cu that accept max_nnz), any device/kernel wrappers and internal
calls that pass max_nnz to strided_grid_y(), and all call sites so that max_nnz
is consistently long long; ensure strided_grid_y() invocations and any
arithmetic or loop bounds using max_nnz are updated to use 64-bit types to
preserve correct grid_y calculation and kernel grid-stride behavior.
In `@src/rapids_singlecell/_cuda/spca/kernels_spca.cuh`:
- Around line 56-58: The code casts indices[value] (IdxT) to int in
kernels_spca.cuh before bounds checking, which can truncate 64-bit indices; fix
by first reading into the wide type (e.g., IdxT raw = indices[value] or int64_t
tmp = static_cast<int64_t>(indices[value])), validate that raw is within [0,
num_genes-1] and also fits in int32 range (<= INT_MAX and >= INT_MIN) and only
then static_cast<int> to produce gene_index used by
atomicAdd(&genes[gene_index], 1); update the checks around gene_index/indices so
bounds checks operate on the wide type (raw/tmp) rather than the narrowed int to
avoid silent truncation and index corruption.
In `@src/rapids_singlecell/_cuda/spca/spca.cu`:
- Around line 16-20: The kernel launch for gram_csr_upper_kernel uses dim3
grid(nrows) which is invalid when nrows == 0; add an early return guard (check
nrows == 0) before setting block/grid and before calling
gram_csr_upper_kernel<T, IdxT> to skip the launch (matching the pattern used by
launch_check_zero_genes), leaving CUDA_CHECK_LAST_ERROR call for non-zero
launches.
In `@src/rapids_singlecell/_cuda/wilcoxon_binned/kernels_wilcoxon_binned.cuh`:
- Around line 50-52: The CSC kernel narrows indices[i] to int (col/row) then
indexes gcodes[row] without validating the narrowed row, risking
wraparound/out-of-bounds; add an explicit bounds check after narrowing (e.g.,
int row = (int)indices[i]; if (row < 0 || row >= nrows) continue;) before any
gcodes[row] access in the CSC kernel, and likewise add a similar validation for
col in the CSR kernel (validate the narrowed int col is within [gene_start,
gene_stop) before using it) so both kernels mirror the CSR row check and prevent
buffer underruns.
In `@src/rapids_singlecell/_cuda/wilcoxon_binned/wilcoxon_binned.cu`:
- Around line 31-33: The kernel launches in launch_csr_hist, launch_csc_hist,
and launch_dense_hist can be invoked with a zero-sized grid (grid(n_cells) or
grid(n_genes)), which is invalid; add an early return guard at the start of each
function to check the relevant size (e.g., if n_cells == 0 return; in
launch_csr_hist, and if n_genes == 0 return; in launch_csc_hist and
launch_dense_hist) so the subsequent dim3 grid(...) and
csr_hist_kernel/csc_hist_kernel/dense_hist_kernel launches are never called with
zero dimensions.
In `@tests/test_int64_indptr.py`:
- Line 148: Update the comparison loops that iterate over outs to use strict
zipping so length mismatches raise errors: replace occurrences like "for a, b in
zip(outs[np.int32], outs[np.int64]):" with a strict zip call (add strict=True)
in all listed callsites (the for-loops comparing outs at lines referenced, e.g.,
the loop using outs[np.int32] and outs[np.int64] and the other similar for-loops
throughout tests/test_int64_indptr.py). Ensure each zip(...) call used for
pairwise assertions (comparing tuples/iterables from outs) includes strict=True
so tuple parity is enforced.
---
Outside diff comments:
In `@src/rapids_singlecell/_cuda/sparse2dense/kernels_s2d.cuh`:
- Around line 23-24: The current guard only checks "idx >= minor" which allows
negative idx values to slip through and cause out-of-bounds writes in the
subsequent atomicAdd; modify the check around the kernel's index validation (the
condition that currently reads "if (idx >= minor) { continue; }") to also guard
against negative indices (e.g., ensure idx < 0 is handled) so that any idx < 0
or idx >= minor will skip processing, preventing invalid memory accesses in the
atomicAdd using idx.
In `@tests/test_int64_indptr.py`:
- Around line 73-933: The tests currently only assert int32 vs int64 parity;
update each test family (functions like test_sparse_qc_csr/test_sparse_qc_csc,
test_mean_var_major/minor, test_nan_mean_major/minor,
test_sparse2dense_c/test_sparse2dense_f, test_sparse_norm_res_csr/csc,
test_sparse_aggr, test_sparse_var, etc.) to also compute a CPU reference using
NumPy/SciPy (using _make_dense_cpu or SciPy sparse operations) and assert
numerical equality (or use _close where floating rounding is expected) between
the GPU result and the CPU reference rather than only between idx dtypes;
additionally parameterize critical tests to include edge cases (single-row,
empty matrix, and a CI-stress large shape) by reusing _make_csr/_make_csc with
those shapes and adding assertions for those cases so each kernel is validated
for correctness and edge behavior beyond int32/int64 parity, keeping existing
helpers _eq/_close for comparisons.
---
Nitpick comments:
In `@src/rapids_singlecell/_cuda/mean_var/kernels_mv.cuh`:
- Around line 18-19: Replace the hard-coded literal 64 used for shared memory
arrays mean_place and var_place with a named constant (e.g., BLOCK_SIZE_MAJOR)
so the size stays in sync with mean_var.cu; declare constexpr int
BLOCK_SIZE_MAJOR = 64 in an appropriate header or pass it as a template
parameter and update the declarations __shared__ double
mean_place[BLOCK_SIZE_MAJOR]; and __shared__ double var_place[BLOCK_SIZE_MAJOR];
ensuring the symbol name matches the one used in mean_var.cu.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 8f5b7164-8c8f-49b5-8450-ffdb2781ea8d
📒 Files selected for processing (37)
src/rapids_singlecell/_cuda/aggr/aggr.cusrc/rapids_singlecell/_cuda/aggr/kernels_aggr.cuhsrc/rapids_singlecell/_cuda/autocorr/autocorr.cusrc/rapids_singlecell/_cuda/autocorr/kernels_autocorr.cuhsrc/rapids_singlecell/_cuda/edistance/edistance.cusrc/rapids_singlecell/_cuda/gmm/gmm.cusrc/rapids_singlecell/_cuda/harmony/colsum/colsum.cusrc/rapids_singlecell/_cuda/harmony/kmeans/kmeans.cusrc/rapids_singlecell/_cuda/harmony/normalize/normalize.cusrc/rapids_singlecell/_cuda/harmony/outer/outer.cusrc/rapids_singlecell/_cuda/harmony/scatter/scatter.cusrc/rapids_singlecell/_cuda/hvg/hvg.cusrc/rapids_singlecell/_cuda/ligrec/kernels_ligrec.cuhsrc/rapids_singlecell/_cuda/ligrec/ligrec.cusrc/rapids_singlecell/_cuda/mean_var/kernels_mv.cuhsrc/rapids_singlecell/_cuda/mean_var/mean_var.cusrc/rapids_singlecell/_cuda/nanmean/kernels_nanmean.cuhsrc/rapids_singlecell/_cuda/nanmean/nanmean.cusrc/rapids_singlecell/_cuda/nb_types.hsrc/rapids_singlecell/_cuda/norm/kernels_norm.cuhsrc/rapids_singlecell/_cuda/norm/norm.cusrc/rapids_singlecell/_cuda/pr/kernels_pr.cuhsrc/rapids_singlecell/_cuda/pr/kernels_pr_hvg.cuhsrc/rapids_singlecell/_cuda/pr/pr.cusrc/rapids_singlecell/_cuda/qc/kernels_qc.cuhsrc/rapids_singlecell/_cuda/qc/qc.cusrc/rapids_singlecell/_cuda/qc_dask/kernels_qcd.cuhsrc/rapids_singlecell/_cuda/qc_dask/qc_kernels_dask.cusrc/rapids_singlecell/_cuda/scale/kernels_scale.cuhsrc/rapids_singlecell/_cuda/scale/scale.cusrc/rapids_singlecell/_cuda/sparse2dense/kernels_s2d.cuhsrc/rapids_singlecell/_cuda/sparse2dense/sparse2dense.cusrc/rapids_singlecell/_cuda/spca/kernels_spca.cuhsrc/rapids_singlecell/_cuda/spca/spca.cusrc/rapids_singlecell/_cuda/wilcoxon_binned/kernels_wilcoxon_binned.cuhsrc/rapids_singlecell/_cuda/wilcoxon_binned/wilcoxon_binned.cutests/test_int64_indptr.py
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #620 +/- ##
=======================================
Coverage ? 88.63%
=======================================
Files ? 98
Lines ? 7364
Branches ? 0
=======================================
Hits ? 6527
Misses ? 837
Partials ? 0
|
Cupy will add 64 bit indptr support for sparse matrices. This Draft PR is there to test the implementation and to later enable day 1 support.
Todo: Tests