Skip to content

Parallel IDF#404

Merged
kaizhang merged 2 commits intoscverse:mainfrom
joshchiou:idf_from_chunks_parallel
Jul 9, 2025
Merged

Parallel IDF#404
kaizhang merged 2 commits intoscverse:mainfrom
joshchiou:idf_from_chunks_parallel

Conversation

@joshchiou
Copy link
Copy Markdown

The current implementation of idf_from_chunks is sequential, which makes the IDF step a bottleneck for very large matrices. This PR adds a version which parallelizes the IDF counting using existing Rayon imports, which should improve performance on large datasets.

@kaizhang
Copy link
Copy Markdown
Member

kaizhang commented Jul 8, 2025

Hi Josh, thanks for the pull request! Have you tested if the new implementation produces the same output as the old one? And what is the performance gain for large datasets?

@joshchiou
Copy link
Copy Markdown
Author

joshchiou commented Jul 8, 2025

Hi Kai - I added a couple of tests with simulated matrices that show that the new implementation produces identical results to the old one.

I haven't had a chance to benchmark performance gain, but previously the single threaded approach was stuck on the idf step for a ~10M x 1M matrix for about a week without finishing. With the parallel implementation it finished in < 8 hours using 48 CPUs.

Results from the test:
https://github.com/joshchiou/SnapATAC2/tree/idf_from_chunks_parallel/snapatac2-python/tests/test_idf

Tolerance: 1.00e-12

Test 1: Random matrices with varying density
✓ Random (density=0.05): Vectors are mathematically identical (max diff < 1.00e-12)
✓ Random (density=0.1): Vectors are mathematically identical (max diff < 1.00e-12)
✓ Random (density=0.3): Vectors are mathematically identical (max diff < 1.00e-12)
✓ Random (density=0.7): Vectors are mathematically identical (max diff < 1.00e-12)

Test 2: Uniform matrix (all columns have same count)
✓ Uniform matrix: Vectors are mathematically identical (max diff < 1.00e-12)

Test 3: Sparse matrix (some columns have zero counts)
✓ Sparse matrix: Vectors are mathematically identical (max diff < 1.00e-12)

Test 4: Multiple chunks with different sizes
✓ Multiple chunks: Vectors are mathematically identical (max diff < 1.00e-12)

Test 5: Single large matrix
✓ Large matrix: Vectors are mathematically identical (max diff < 1.00e-12)

Both idf_from_chunks and idf_from_chunks_parallel produce identical results.

@kaizhang kaizhang merged commit ab079fc into scverse:main Jul 9, 2025
1 check failed
kaizhang pushed a commit that referenced this pull request Nov 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants