Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH Improves efficiency of ColumnTransformer for string keys #16431

Merged
merged 10 commits into from Feb 21, 2020

Conversation

thomasjpfan
Copy link
Member

Reference Issues/PRs

Fixes #16327

What does this implement/fix? Explain your changes.

Uses the pandas index to find column indices

Any other comments?

Is a benchmark for this PR

from sklearn.preprocessing import FunctionTransformer
from sklearn.compose import make_column_transformer
import numpy as np
import pandas as pd

n_features = 10_000
selected_features = 7_000
columns = [f'col_{i}' for i in range(n_features)]
df = pd.DataFrame(np.ones((1, n_features)), columns=columns)

ct = make_column_transformer(*[(FunctionTransformer(), [col])
                               for col in columns[:selected_features]])

This PR

%%timeit
ct.fit_transform(df)
# 8.13 s ± 76 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

master

%%timeit
ct.fit_transform(df)
# 16.4 s ± 62 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Copy link
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, aside for one comment below. Thanks @thomasjpfan !

else:
stop = n_columns + 1
return list(range(n_columns)[slice(start, stop)])
else:
columns = list(key)

try:
column_indices = [all_columns.index(col) for col in columns]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have some test, to gracefully fail with duplicate columns names, where get_loc wouldn't behave as expected?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not have a test for this behavior. The original behavior uses index, which returns the first appearance of the column name.

I would consider this a bug and error.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated PR with a check on the return of get_loc.

@jnothman
Copy link
Member

Please add an Efficiency entry to the change log

Copy link
Contributor

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some nitpicks otherwise LGTM

column_indices = []
for col in columns:
col_idx = all_columns.get_loc(col)
if not isinstance(col_idx, int):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if not isinstance(col_idx, int):
if not isinstance(col_idx, numbers.Integer):

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

numbers.Integer does not exist :(

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh Integral sorry

sklearn/utils/__init__.py Outdated Show resolved Hide resolved
sklearn/utils/tests/test_utils.py Outdated Show resolved Hide resolved
doc/whats_new/v0.23.rst Show resolved Hide resolved
@glemaitre glemaitre added this to REVIEWED AND WAITING FOR CHANGES in Guillaume's pet Feb 19, 2020
@thomasjpfan thomasjpfan changed the title [MRG] Improves efficiency of ColumnTransformer for string keys ENH Improves efficiency of ColumnTransformer for string keys Feb 21, 2020
@glemaitre glemaitre merged commit f703b85 into scikit-learn:master Feb 21, 2020
@glemaitre
Copy link
Contributor

Thanks @thomasjpfan

thomasjpfan added a commit to thomasjpfan/scikit-learn that referenced this pull request Feb 22, 2020
@glemaitre glemaitre moved this from REVIEWED AND WAITING FOR CHANGES to MERGED in Guillaume's pet Feb 23, 2020
panpiort8 pushed a commit to panpiort8/scikit-learn that referenced this pull request Mar 3, 2020
gio8tisu pushed a commit to gio8tisu/scikit-learn that referenced this pull request May 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

Successfully merging this pull request may close these issues.

ColumnTransformer performance very slow when columns specification a very large list of column names
4 participants