Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Computation of estimator values pointwise on kNN sets is not parallelized #21

Closed
JesseCresswell opened this issue Mar 5, 2024 · 2 comments

Comments

@JesseCresswell
Copy link
Contributor

I will use ESS as an example, since it is a pretty slow estimator. Since it is a LocalEstimator, when fit is called kNNs are first computed (if not already provided)

dists, knnidx = get_nn(X, k=self.n_neighbors, n_jobs=n_jobs)

and this in turn calls to the sklearn library which properly parallelizes the computation based on the parameter n_jobs that we can provide.

Second, a call to self.fit in the ESS class performs a simple, single-threaded for loop over the datapoints.

for i in range(len(X)):
self.dimension_pw_[i], self.essval_[i] = self._essLocalDimEst(
X[knnidx[i, :]]
)

The computations are "embarrassingly parallel". I have locally implemented parallelization and seen at least a 6x speedup in ESS computation on datasets of size 5,000 - 50,000.

I can open a PR with these changes as an example if you are willing to add joblib as a dependency.

@j-bac
Copy link
Collaborator

j-bac commented Mar 5, 2024

Makes sense and sounds like an awesome speedup! ESS is currently one of the slowest estimators but the same strategy might work for others. Please feel free to open a PR, joblib is already a dependency since it is used by sklearn. We can probably replace multiprocessing with joblib everywhere.

@JesseCresswell
Copy link
Contributor Author

Closed by PR #24

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants