Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ColumnTransformer input feature name and count validation #14544

Merged
merged 10 commits into from Aug 7, 2019

Conversation

adrinjalali
Copy link
Member

Fixes #14251.

This raises a deprecation warning if the X at the time of fit and transform don't match in terms of feature names or count. It implements option (1) of #14251 (comment)

@amueller
Copy link
Member

amueller commented Aug 1, 2019

now you only need to adjust the tests ;)

@adrinjalali
Copy link
Member Author

The failing test is test_sag_regressor_computed_correctly, not related.

@NicolasHug do you agree with this version?

@NicolasHug
Copy link
Member

My general thought is that we should have gone for a stricter version of #14237 and just error for any discrepancy match, as a bug fix.

But OK for doing a deprecation. I'm OK with this but we still need to error when n_features_fit < n_features_transform and negative indices are used. This is a bug that isn't caught right now, see #14251 (comment)

@jnothman
Copy link
Member

jnothman commented Aug 5, 2019

I agree with all of that @NicolasHug. I lean a little more towards a deprecation.

@adrinjalali
Copy link
Member Author

I added a test for the case where the indexing is negative, I'd appreciate a close review on that part.

sklearn/utils/__init__.py Outdated Show resolved Hide resolved
sklearn/compose/_column_transformer.py Show resolved Hide resolved
@adrinjalali
Copy link
Member Author

@amueller you fine with the changes?

Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few comments, other than that LGTM

X_array_fewer = np.array([[0, 1, 2], ]).T
err_msg = 'Number of features'
with pytest.raises(ValueError, match=err_msg):
ct.transform(X_array_fewer)
with pytest.warns(DeprecationWarning, match=msg):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why warn when passing less features? I thought the only scenario we still accept is passing more features, not less?

This will be confusing for users since the message says it should not error until 0.24

with pytest.raises(ValueError, match=err_msg):
tf.transform(X_trans_df)
with pytest.warns(DeprecationWarning, match=warn_msg):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, why warn here since we error as well?


tf = ColumnTransformer([('bycol', Trans(), [0, -1])])
tf.fit(df_extra)
msg = "At least one negative column was used to"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you also check that no warning is raised when n_features matches with negative indexing? Just in case.

@adrinjalali
Copy link
Member Author

@NicolasHug done :)

Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Adrin, LGTM if all goes green

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

RFC ColumnTransformer input validation and requirements
4 participants