feat: remove ligrec parallelize#1125
Conversation
9fe8f25 to
4a60ef3
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1125 +/- ##
==========================================
- Coverage 74.05% 73.90% -0.16%
==========================================
Files 39 39
Lines 6495 6510 +15
Branches 1122 1122
==========================================
+ Hits 4810 4811 +1
- Misses 1230 1249 +19
+ Partials 455 450 -5
🚀 New features to boost your workflow:
|
d1f752c to
6230aed
Compare
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
| ) | ||
|
|
||
|
|
||
| @njit(nogil=True, cache=True) |
There was a problem hiding this comment.
Why not parallel=True + prange? Because this is being run in a thread pool? Why not just make every individual step parallel?
https://numba.pydata.org/numba-doc/dev/user/parallel.html?highlight=njit#explicit-parallel-loops
Would this require rewriting into a reduction of some sort to prevent overlapping writes?
I see in the benchmarks that the speedups with more jobs is not really scaling linearly, which is not what I would expect.
There was a problem hiding this comment.
I see in the benchmarks that the speedups with more jobs is not really scaling linearly, which is not what I would expect.
Thats a good point worth investigating
| def _worker(t: int) -> NDArrayA: | ||
| local_counts = np.zeros((n_inter, n_cpairs), dtype=np.int64) | ||
| rs = np.random.RandomState(None if seed is None else t + seed) | ||
| perm = clustering.copy() | ||
| for _ in range(chunk_sizes[t]): | ||
| rs.shuffle(perm) | ||
| _score_permutation( | ||
| data_arr, | ||
| perm, | ||
| inv_counts, | ||
| mean_obs, | ||
| interactions, | ||
| interaction_clusters, | ||
| valid, | ||
| local_counts, | ||
| ) | ||
| pbar.update(1) | ||
| return local_counts |
There was a problem hiding this comment.
Why can't this also be numba-ified with an outer-loop of some sort? Why do we still need a thread pool? I thought "one giant kernel" was the goal
Is shuffling not parallelizable? Certainly there are ways around this like argsort + randomindices or somethign? Other than that, I don't really see why therange(chunk_sizes[t]) couldn't be parallelized. Is it the validity of local_counts? Seems like there should be ways around this
There was a problem hiding this comment.
to have a responsive progress bar and to have the same shuffling results as old version.
There was a problem hiding this comment.
Could you explain a bit more
- Why is the "same results" thing a hard blocker?
clusteringseems small so copy+shuffle should be cheap as a pre-processing step i.e., do all the "shuffle" stuff ahead of time / outsidenumba - Would you expect a giant kernel to be faster? My gut is "yes" given Severin's experience/our experience with
co_occurrencebut I'm all ears
benchmark code
main results
pr results
Results compared:
both faster and cleaner code. this removes parallelize.
update: the reason main is faster when n_jobs=1 is because main sets also
numba_parallel=Trueso it's because it's still numba parallel even though it's one process.