Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace multiprocessing with joblib, parallelize ESS and TLE #24

Merged
merged 1 commit into from
Mar 11, 2024

Conversation

JesseCresswell
Copy link
Contributor

This PR addresses issue #21
I used this test code to demonstrate the issue and benchmark the solution:

import numpy as np
import skdim
import time

np.random.seed(0)
X = np.random.random((200000, 10))

estimator = skdim.id.ESS() # LocalEstimator
# estimator = skdim.id.FisherS() # GlobalEstimator
# estimator = skdim.id.lPCA() # Special Case
n_jobs = 1

t0 = time.time()
if isinstance(estimator, skdim._commonfuncs.LocalEstimator):
    lid = estimator.fit(X, n_neighbors=10, n_jobs=n_jobs).dimension_pw_
elif isinstance(estimator, skdim._commonfuncs.GlobalEstimator):
    lid = estimator.fit_pw(X, n_neighbors=10, n_jobs=n_jobs).dimension_pw_
t1 = time.time()

print(lid.mean())
print(t1-t0)

LocalEstimator ESS
On commit 427b1881b before my changes, I time the kNN computation and the ESS computation separately and vary n_jobs. I observe the following using a computer with 40 cpu.

n_jobs Time kNN (s) Time ESS (s) Mean LID
1 92.71 146.98 8.959729
2 91.05 149.36 8.959729

After the PR changes I observe

n_jobs Time kNN (s) Time ESS (s) Mean LID
1 95.44 150.98 8.959729
2 79.22 91.84 8.959729
20 99.55 17.03 8.959729
40 70.97 18.94 8.959729

Conclusions:
The ESS estimator code can be parallelized effectively. The calls to kNN computation have a lot of variance in terms of timing, and are not clearly benefiting from n_jobs>1

GlobalEstimator LPCA
Before these changes

n_jobs Time kNN (s) Time LPCA (s) Mean LID
1 94.59 83.54 7.006095
2 91.35 69.74 7.006095
20 79.12 43.31 7.006095
40 57.86 41.53 7.006095

With this PR

n_jobs Time kNN (s) Time LPCA (s) Mean LID
1 93.75 82.25 7.006095
2 88.16 47.64 7.006095
20 78.79 12.94 7.006095
40 60.44 15.94 7.006095

Conclusions:
PCA already used parallelism through the GlobalEstimator code, but joblib provides an additional speedup over multiprocessing

Additional results on TLE which I added parallelism for, and FisherS which is a GlobalEstimator that has strange parallel behaviour already, are provided in this google doc
https://docs.google.com/document/d/1Jo9VPFQwBaDP9Xj-8xiA_ueULNstXqF554VfFr6Wmeo/edit?usp=sharing

@j-bac j-bac merged commit a2a8006 into scikit-learn-contrib:master Mar 11, 2024
@j-bac
Copy link
Collaborator

j-bac commented Mar 11, 2024

thanks Jesse. Really nice speedup!

@JesseCresswell
Copy link
Contributor Author

Those were all the changes I planned to make. It would be great if you could make a new release as I'm actively using the repo for research.

@j-bac
Copy link
Collaborator

j-bac commented Mar 11, 2024

Glad to hear it's useful. Just created a new pypi release

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants