Skip to content

FIX handle column names renaming in ColumnTransformer #28262

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Jan 31, 2024

Conversation

glemaitre
Copy link
Member

@glemaitre glemaitre commented Jan 25, 2024

closes #28260

The adapters in charge of renaming columns with set_output to a DataFrame-like object was failing in case that the original DataFrame-like as duplicated columns (that is possible in pandas.

We inverse the stacking and and renaming to avoid potential column duplicate that should not happen.

Copy link

github-actions bot commented Jan 25, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 4c16345. Link to the linter CI: here

@glemaitre
Copy link
Member Author

I could also use the code snippet of the original issue as an integration test. However, they weird stuff to be investigated in between:

  • Polars does not allow duplicated columns so it should be failing,
  • It does not fail on the issue snippet because the FunctionTransformer returns a pandas DataFrame while requesting a polars DataFrame.

So more to follow.

@glemaitre glemaitre added this to the 1.4.1 milestone Jan 25, 2024
Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion about the error message.

continue
names_out = feature_names_outs[names_idx : names_idx + X.shape[1]]
adapter.rename_columns(X, names_out)
names_idx += X.shape[1]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this code ugly but I don't see how to improve it without reshaping boilerplate

@glemaitre
Copy link
Member Author

I think it is ready for a review @ogrisel @adrinjalali

There quite some boilerplate code for generating the error but I think this is OK. For the rest, I'm just annoyed between having a flat list and a list of list. But I don't see how to simplify if I want to avoid touching more code that is not related to this issue.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more comments but otherwise LGTM.

raise ValueError(
"Concatenating DataFrames from the transformer's output lead to"
" an inconsistent number of samples. The output may have Pandas"
" Indexes that do not match."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: we could actually check if that last sentence hold to make to the error message more precise (and not use "may" when it's not the case).

But let's keep that for a follow-up PR ;)

glemaitre and others added 3 commits January 26, 2024 19:51
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like there must be a better way to do it 😅

continue
names_out = feature_names_outs[names_idx : names_idx + X.shape[1]]
adapter.rename_columns(X, names_out)
names_idx += X.shape[1]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_num_features and _num_samples instead of .shape?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we play with container that are dataframe-like container. So they will implement the shape.

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise LGTM

@adrinjalali adrinjalali enabled auto-merge (squash) January 31, 2024 16:25
@adrinjalali adrinjalali merged commit 4a28ba3 into scikit-learn:main Jan 31, 2024
glemaitre added a commit to glemaitre/scikit-learn that referenced this pull request Feb 10, 2024
…8262)

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
glemaitre added a commit to glemaitre/scikit-learn that referenced this pull request Feb 13, 2024
…8262)

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
glemaitre added a commit that referenced this pull request Feb 13, 2024
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ColumnTransformer output unexpected prefixed feature names from FunctionTransformer() step
3 participants