Skip to content

PERF Slightly speedup MinCovDet.#29835

Merged
ogrisel merged 1 commit into
scikit-learn:mainfrom
anntzer:fastmcd
Sep 16, 2024
Merged

PERF Slightly speedup MinCovDet.#29835
ogrisel merged 1 commit into
scikit-learn:mainfrom
anntzer:fastmcd

Conversation

@anntzer

@anntzer anntzer commented Sep 11, 2024

Copy link
Copy Markdown
Contributor

support doesn't need to repeatedly converted from a list of indices (from argsort) to a boolean mask (just do it once at the end); furthermore, the distances don't need to be fully sorted (in O(n log n)), rather, only the n_support first indices need to be selected (in O(n)).

Locally, this patch speeds up the following simple benchmark by ~15%.

import sklearn.covariance
import numpy as np
np.random.seed(1)
# unit gaussian plus 10% outliers
t = np.concatenate([np.random.randn(1000, 2), [2, 4] * np.random.randn(100, 2)])
%timeit sklearn.covariance.MinCovDet().fit(t).covariance_

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

@github-actions

github-actions Bot commented Sep 11, 2024

Copy link
Copy Markdown

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: fc772c7. Link to the linter CI: here

@RahulVadisetty91 RahulVadisetty91 left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

Use np.argsort to ensure exact n_support smallest distances are selected.

support_indices = np.argsort(dist)[:n_support]

@jeremiedbb jeremiedbb left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @anntzer. Even if it's not a major speed improvement, the changes are minimal so it's a net improvement to me. LGTM

Comment thread sklearn/covariance/_robust_covariance.py Outdated

@ogrisel ogrisel left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change LGTM. I timed a similar improvement. I expected the improvement to increase when increasing n_samples but it does not seem so. Anyways I agree with @jeremiedbb's comment above.

Please add a changelog entry in doc/whats_new/v1.6.rst.

Comment thread sklearn/covariance/_robust_covariance.py Outdated
Comment thread sklearn/covariance/_robust_covariance.py Outdated
`support` doesn't need to repeatedly converted from a list of indices
(from argsort) to a boolean mask (just do it once at the end);
furthermore, the distances don't need to be fully sorted (in O(n log
n)), rather, only the n_support first indices need to be selected (in
O(n)).

Locally, this patch speeds up the following simple benchmark by ~15%.

    np.random.seed(1)
    # unit gaussian plus 10% outliers
    t = np.concatenate([np.random.randn(1000, 2), [2, 4] * np.random.randn(100, 2)])
    %timeit sklearn.covariance.MinCovDet().fit(t).covariance_
@anntzer

anntzer commented Sep 13, 2024

Copy link
Copy Markdown
Contributor Author

Sure, renamed support to support_indices and added a changelog entry.

@ogrisel ogrisel left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for the follow-up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants