Skip to content

ENH Improves efficiency of ColumnTransformer for string keys#16431

Merged
glemaitre merged 10 commits into
scikit-learn:masterfrom
thomasjpfan:faster_column_transformer
Feb 21, 2020
Merged

ENH Improves efficiency of ColumnTransformer for string keys#16431
glemaitre merged 10 commits into
scikit-learn:masterfrom
thomasjpfan:faster_column_transformer

Conversation

@thomasjpfan

Copy link
Copy Markdown
Member

Reference Issues/PRs

Fixes #16327

What does this implement/fix? Explain your changes.

Uses the pandas index to find column indices

Any other comments?

Is a benchmark for this PR

from sklearn.preprocessing import FunctionTransformer
from sklearn.compose import make_column_transformer
import numpy as np
import pandas as pd

n_features = 10_000
selected_features = 7_000
columns = [f'col_{i}' for i in range(n_features)]
df = pd.DataFrame(np.ones((1, n_features)), columns=columns)

ct = make_column_transformer(*[(FunctionTransformer(), [col])
                               for col in columns[:selected_features]])

This PR

%%timeit
ct.fit_transform(df)
# 8.13 s ± 76 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

master

%%timeit
ct.fit_transform(df)
# 16.4 s ± 62 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

@rth rth left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, aside for one comment below. Thanks @thomasjpfan !

Comment thread sklearn/utils/__init__.py
columns = list(key)

try:
column_indices = [all_columns.index(col) for col in columns]

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have some test, to gracefully fail with duplicate columns names, where get_loc wouldn't behave as expected?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not have a test for this behavior. The original behavior uses index, which returns the first appearance of the column name.

I would consider this a bug and error.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated PR with a check on the return of get_loc.

@jnothman

Copy link
Copy Markdown
Member

Please add an Efficiency entry to the change log

@glemaitre glemaitre left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some nitpicks otherwise LGTM

Comment thread sklearn/utils/__init__.py Outdated
column_indices = []
for col in columns:
col_idx = all_columns.get_loc(col)
if not isinstance(col_idx, int):

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if not isinstance(col_idx, int):
if not isinstance(col_idx, numbers.Integer):

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

numbers.Integer does not exist :(

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh Integral sorry

Comment thread sklearn/utils/__init__.py Outdated
Comment thread sklearn/utils/tests/test_utils.py Outdated
Comment thread doc/whats_new/v0.23.rst
@thomasjpfan thomasjpfan changed the title [MRG] Improves efficiency of ColumnTransformer for string keys ENH Improves efficiency of ColumnTransformer for string keys Feb 21, 2020
@glemaitre glemaitre merged commit f703b85 into scikit-learn:master Feb 21, 2020
@glemaitre

Copy link
Copy Markdown
Member

Thanks @thomasjpfan

thomasjpfan added a commit to thomasjpfan/scikit-learn that referenced this pull request Feb 22, 2020
panpiort8 pushed a commit to panpiort8/scikit-learn that referenced this pull request Mar 3, 2020
gio8tisu pushed a commit to gio8tisu/scikit-learn that referenced this pull request May 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ColumnTransformer performance very slow when columns specification a very large list of column names

4 participants