-
-
Notifications
You must be signed in to change notification settings - Fork 25.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Changed implementation of Birch.predict to use pairwise_distances… #16149
Conversation
…_chunked in order to reduce memory footprint (scikit-learn#16027)
Benchmark script + output performance_16027.txt Bellow examples of running the benchmark on several sample sizes.
Is there a location in the scikit-learn source code tree where I can put the performance test? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @alexshacked. I don't think we need to store the performance tests in the codebase. Your reports in the PR discussion should be enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. Thanks @alexshacked !
You're right @jeremiedbb. On large input size, the increase in memory doesen't go beyond the 1GB chunk size
|
Should we be passing Y_norm_squared as a parameter for euclidean_distances to avoid recomputing them? |
Right @jnothman. I get it. No need to calculate YY for each chunk. |
… when processing is not parallel (scikit-learn#16027)
@@ -579,10 +580,9 @@ def predict(self, X): | |||
""" | |||
X = check_array(X, accept_sparse='csr') | |||
self._check_fit(X) | |||
reduced_distance = safe_sparse_dot(X, self.subcluster_centers_.T) | |||
reduced_distance *= -2 | |||
reduced_distance += self._subcluster_norms |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant that we could pass these sub cluster norms into pairwise_distances_argmin
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something like this ?
kwargs = { 'Y_norm_squared': self._subcluster_norms}
return self.subcluster_labels_[
pairwise_distances_argmin(X, self.subcluster_centers_, metric_kwargs = kwargs)
]
…ough pairwise_distances_argmin (scikit-learn#16027)
@jnothman _subcluster_norms is passed to pairwise_distances_argmin. pairwise.py is not changed by this PR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!!
Please add an |Efficiency|
entry to the change log at doc/whats_new/v0.23.rst
. Like the other entries there, please reference this pull request with :pr:
and credit yourself (and other contributors if applicable) with :user:
Thank you for the PR @alexshacked ! |
Fixes #16027
Attempting to reduce memory footprint of Birch.predict. Please see solution description at issue
#16027 (comment)
Benchmark script to be added soon