New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
change stats.mode to collections.Counter, for better performance. #18987
change stats.mode to collections.Counter, for better performance. #18987
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the PR @DavidKatz-il !
May you provide a benchmark as a comment to compare this implementation to main repo?
# string/object
import string
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='most_frequent')
np.random.seed(42)
length = 10000
arr = np.random.choice(list(string.ascii_lowercase), size=(length, length))
flat = arr.flatten()
np.put(flat, np.random.choice(flat.shape[0], size=int(flat.shape[0] * 0.1), replace=False), np.nan)
arr = np.reshape(flat, (length, length))
print('simple imputer string/object:')
%time imputer.fit(pd.DataFrame(arr))
# numeric
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='most_frequent')
np.random.seed(42)
length = 10000
arr = np.random.uniform(size=(length, length))
flat = arr.flatten()
np.put(flat, np.random.choice(flat.shape[0], size=int(flat.shape[0] * 0.1), replace=False), np.nan)
arr = np.reshape(flat, (length, length))
print('simple imputer numeric:')
%time imputer.fit(pd.DataFrame(arr))
|
Indeed, as scikit-learn/sklearn/utils/_encode.py Lines 30 to 33 in c9677d6
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good.
Please add an entry to the change log at doc/whats_new/v*.rst
. Like the other entries there, please reference this pull request with :pr:
and credit yourself (and other contributors if applicable) with :user:
.
c931d20
to
a24dced
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Last nitpicks
Thanks @DavidKatz-il So this a fix that should be back-ported in 0.24.X for the release. I will label it. |
Reference Issues/PRs
Fixes #18978
What does this implement/fix? Explain your changes.
As suggested in the issue we replaced
scipy.stats.mode
withcollections.Counter
since it has better performance.