Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

change stats.mode to collections.Counter, for better performance. #18987

Merged
merged 8 commits into from Dec 14, 2020

Conversation

DavidKatz-il
Copy link
Contributor

Reference Issues/PRs

Fixes #18978

What does this implement/fix? Explain your changes.

As suggested in the issue we replaced scipy.stats.mode with collections.Counter since it has better performance.

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR @DavidKatz-il !

May you provide a benchmark as a comment to compare this implementation to main repo?

sklearn/impute/_base.py Show resolved Hide resolved
@DavidKatz-il
Copy link
Contributor Author

# string/object
import string
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy='most_frequent')
np.random.seed(42)
length = 10000
arr = np.random.choice(list(string.ascii_lowercase),  size=(length, length))
flat = arr.flatten()
np.put(flat, np.random.choice(flat.shape[0], size=int(flat.shape[0] * 0.1), replace=False), np.nan)
arr = np.reshape(flat, (length, length))
print('simple imputer string/object:')
%time imputer.fit(pd.DataFrame(arr))

# numeric
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy='most_frequent')
np.random.seed(42)
length = 10000
arr = np.random.uniform(size=(length, length))
flat = arr.flatten()
np.put(flat, np.random.choice(flat.shape[0], size=int(flat.shape[0] * 0.1), replace=False), np.nan)
arr = np.reshape(flat, (length, length))
print('simple imputer numeric:')
%time imputer.fit(pd.DataFrame(arr))
ArraySize stats numeric counter numeric stats string counter string
100 X 100 16 ms 7 ms 20 ms 7 ms
1000 X 1000 250 ms 373 ms 689 ms 132 ms
10000 X 10000 11.5 s 38.4 s 1min 23s 16.4 s

@glemaitre
Copy link
Contributor

Indeed, as unique in the encoder, we should use Counter only on object dtype and otherwise the mode strategy:

if values.dtype == object:
return _unique_python(values, return_inverse=return_inverse)
# numerical
out = np.unique(values, return_inverse=return_inverse)

Copy link
Member

@amueller amueller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good.

Please add an entry to the change log at doc/whats_new/v*.rst. Like the other entries there, please reference this pull request with :pr: and credit yourself (and other contributors if applicable) with :user:.

sklearn/impute/_base.py Outdated Show resolved Hide resolved
sklearn/impute/_base.py Show resolved Hide resolved
sklearn/impute/_base.py Show resolved Hide resolved
sklearn/impute/tests/test_impute.py Outdated Show resolved Hide resolved
sklearn/impute/tests/test_impute.py Outdated Show resolved Hide resolved
Copy link
Contributor

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last nitpicks

doc/whats_new/v0.24.rst Outdated Show resolved Hide resolved
sklearn/impute/tests/test_impute.py Show resolved Hide resolved
sklearn/impute/tests/test_impute.py Show resolved Hide resolved
@glemaitre
Copy link
Contributor

Thanks @DavidKatz-il

So this a fix that should be back-ported in 0.24.X for the release. I will label it.

@glemaitre glemaitre merged commit 0937b4a into scikit-learn:master Dec 14, 2020
5 checks passed
@glemaitre glemaitre added this to the 0.24 milestone Dec 14, 2020
@ogrisel ogrisel added the To backport PR merged in master that need a backport to a release branch defined based on the milestone. label Dec 16, 2020
@DavidKatz-il DavidKatz-il deleted the stats-to-counter-18978 branch December 16, 2020 20:29
glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Dec 22, 2020
@glemaitre glemaitre mentioned this pull request Dec 22, 2020
14 tasks
@glemaitre glemaitre mentioned this pull request Apr 22, 2021
12 tasks
glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Apr 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module:impute To backport PR merged in master that need a backport to a release branch defined based on the milestone.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SimpleImputer performance on strings
5 participants