ENH Adds Column name consistency #18010

thomasjpfan · 2020-07-27T13:46:02Z

Reference Issues/PRs

jnothman · 2020-08-02T12:36:33Z

thanks for piloting this. It's looking good. Ideally we release this on all estimators at once, but yes, there's little benefit in doing it all in one mammoth PR.

…stent

thomasjpfan

This PR is ready to review. Currently, this PR only adds column name consistency to dummy and impute and has a test to check for it. Currently, there is an ignore list for all the modules that is ignored by the new tests. When we add the consistency check for each module we can remove it from the list.

The codecov message is error is strange. I am pretty sure my tests covers the patch and can verify on my local system (with --cov sklearn).

sklearn/tests/test_common.py

amueller

Looks good apart from minor nitpicks. Requires some dev docs, I think.

sklearn/base.py

amueller · 2020-09-23T19:02:37Z

sklearn/impute/_base.py

@@ -793,7 +793,7 @@ def transform(self, X):
        # Need not validate X again as it would have already been validated
        # in the Imputer calling MissingIndicator
        if not self._precomputed:
-            X = self._validate_input(X, in_fit=True)


Was that a bug? can you add a regression test?

sklearn/tests/test_array_out.py

amueller · 2020-09-23T19:07:56Z

sklearn/tests/test_array_out.py

+
+
+@pytest.mark.parametrize("array_type", ["dataframe", "dataarray"])
+def test_pandas_get_feature_names(array_type):


can you test with integer column names as well? and maybe a mix of integer, string and object column names? you know, for fun?

I remember we had a conversation about what kind of column names we'd accept. And we kinda concluded that we'd restrict column names to be str, and I think we should stick to that?

I also intuitively feel that feature names implies str. It would be a good idea to add a test to check the error message we raise for invalid feature names.

I would be okay with this if all our get_feature_names output strings, but for DictVectorizer we do not:

from sklearn.feature_extraction import DictVectorizer v = DictVectorizer(sparse=False) X = [{2: 1, 4: 2}, {3: 3, 5: 1}] v.fit(X) v.get_feature_names() # [2, 3, 4, 5]

On a second note, I think enforcing strings will break backward compatibly:

from sklearn.preprocessing import StandardScaler from sklearn.datasets import load_iris X, _ = load_iris(as_frame=True, return_X_y=True) X.columns = ['a', 'b', 2, 3] scaler = StandardScaler().fit(X) _ = scaler.transform(X) # currently works

I know integers as columns currently works, but I think we should deprecate the usage and start returning strings as column names ourselves. Having integers as column names (or row indices for that matter) creates a ton of silent bugs.

Becoming consistent would be great and since ColumnTransformer is also not accepting mixed data type (and I am not sure that we can get around that without adding some ambuiguities), I would prefer to deprecate the mixed type support where it is currently.

Another option would be to accept integer columns in _validate_data and automatically convert them to be stored as strings in feature_names_in_.

I am not sure this is a good idea though.

To summarize here are the options:

Only store feature_names_in_ when column names are all strings. If a dataframe is passed with mixed or integer column names, show a warning stating that feature_names_in_ is not set.

Support integer and/or mixed columns and deprecate.

Convert to strings and store them into feature_names_in_ and warn that this is happening.

Support any type of column names.

I am +0.5 for Option 2 as long as we update DictVectorizer in a way to always output strings. Option 3 will lead to weird edge cases with column names like: ['4', 4] (string and integer).

Option 4 is not consistent with ColumnTransformer. ColumnTransformer uses the names to actually slice the DataFrame so its reasonable to enforce strings. All the other estimators only requires an equality check, so I do not think we need to be restrictive on the column names.

To move this along I'll update the PR to option 2. (Support integer and/or mixed columns and deprecate.)

Do we need a SLEP or just a vote on how to handle column names? I think some of us would be happy with whatever pandas accepts, and some of us [really] don't want any column names other than simple strings.

How we handle names is fairly public, so having a short SLEP to formalize the behavior is useful.

sklearn/tests/test_base.py

sklearn/utils/estimator_checks.py

adrinjalali

I'm generally happy with this PR, we just need to figure the some of the cases:

sklearn/base.py

adrinjalali · 2020-09-23T20:08:23Z

sklearn/base.py

+            return
+
+        fitted_feature_names = getattr(self, "feature_names_in_", None)
+        if fitted_feature_names is None:


do we want to raise if either (fitted_feature_names is None xor feature_names_in is None) == True? I think we should

This means, we would start warning for training on numpy array and predicting on pandas dataframe. What would be a good message?

train on numpy and predict on dataframe: Convert your dataframe into a numpy array before predicting?

train on dataframe and predict on numpy array: Convert your numpy array into a dataframe before predicting?

Maybe it would be enough to mention that the type of the array in fit is different than the one at predict and we expect a consistent type between fit and predict.

Actually I don't know if we should raise an error or only a warning mentioning that no information about the columns would be available?

I'd be happy with a warning (maybe) but I think a warning when the types differ between fit and predict is a nice thing to have.

I'm ambivalent about a warning. On the one hand it seems dangerous because it means you can't check consistency, but on the other hand people will ignore warning if there's too many. I think +0.5 for the warning?

Maybe it would be enough to mention that the type of the array in fit is different than the one at predict and we expect a consistent type between fit and predict.

If we were to support xarray in the future, I think we can strictly work on type. It would have to be "are there feature names in fit" and "are there feature names in predict".

Updated the PR with the warnings.

adrinjalali · 2020-09-23T20:14:16Z

sklearn/tests/test_array_out.py

+
+
+@pytest.mark.parametrize("array_type", ["dataframe", "dataarray"])
+def test_pandas_get_feature_names(array_type):


I remember we had a conversation about what kind of column names we'd accept. And we kinda concluded that we'd restrict column names to be str, and I think we should stick to that?

adrinjalali · 2020-09-23T20:17:17Z

sklearn/utils/_array_out.py

+
+    Supports:
+       - pandas DataFrame
+       - xarray DataArray


elaborate on the requirements for the xarray coords to be considered as feature names maybe

adrinjalali · 2020-09-23T20:17:32Z

sklearn/utils/_array_out.py

+
+
+def _get_feature_names(X):
+    """Get feature names from X.


should this function enforce string feature names

I believe so, we can always relax that requirement later if we decide to.

ogrisel

Besides the existing comment and the additional comment below on the check for no_validation tagged estimators, this LGTM:

ogrisel · 2020-09-28T11:32:05Z

sklearn/tests/test_array_out.py

+
+
+@pytest.mark.parametrize("array_type", ["dataframe", "dataarray"])
+def test_pandas_get_feature_names(array_type):


I also intuitively feel that feature names implies str. It would be a good idea to add a test to check the error message we raise for invalid feature names.

sklearn/tests/test_base.py

sklearn/tests/test_common.py

ogrisel · 2020-09-28T11:43:16Z

sklearn/utils/_array_out.py

+
+
+def _get_feature_names(X):
+    """Get feature names from X.


I believe so, we can always relax that requirement later if we decide to.

sklearn/utils/estimator_checks.py

thomasjpfan · 2021-08-13T17:24:10Z

I'm not sure if I understand the question / suggestion. I really would prefer not to warn on all int column names.

It's more of question on mixed columns or datetime columns names. If we start raising errors in 1.2 because of scikit-learn's restriction on types in feature names, then we do not support "array-likes" anymore.

I'm ok with not doing a consistency check for int, but can someone remind me what the reason for that was?

I think the reasons were:

Consistent with ColumnTransformer
Using ints for selection can be complicated
Better to be restrictive now and add it in the future.

I'm still on the side of supporting ints.

amueller · 2021-08-13T19:28:10Z

Sounds good! I like the PR as it's now.

amueller · 2021-08-17T02:44:17Z

@ogrisel @glemaitre is your approval still valid? ;)

glemaitre · 2021-08-17T07:33:00Z

yep

ogrisel

LGTM. Thank you very much for the follow-up. Merging!

amueller · 2021-08-17T17:00:15Z

OMG!!!! 🥳

GaelVaroquaux · 2021-08-18T14:32:40Z

Side note: using parquet rather than CSV seems to force column names to be strings by default: "0" (that just broke my code, but that's another story).

Maybe in the long term, a way forward is to push for more parquet?

ogrisel · 2021-08-20T10:57:01Z

Maybe in the long term, a way forward is to push for more parquet?

That would be an option to explore, in particular, what are the memory copies involved when going from/to pandas to/from arrow data tables and other heterogeneous column containers.

## August 31th, 2021 ### Gael * TODO: Jeremy's renewal, Chiara's replacement, Mathis's consulting gig ### Olivier - input feature names: main PR [#18010](scikit-learn/scikit-learn#18010) that links into sub PRs - remaining (need review): [#20853](scikit-learn/scikit-learn#20853) (found a bug in `OvOClassifier.n_features_in_`) - reviewing `get_feature_names_out`: [#18444](scikit-learn/scikit-learn#18444) - next: give feedback to Chiara on ARM wheel building [#20711](scikit-learn/scikit-learn#20711) (needed for the release) - next: assist Adrin for the release process - next: investigate regression in loky that blocks the cloudpickle release [#432](cloudpipe/cloudpickle#432) - next: come back to intel to write a technical roadmap for a possible collaboration ### Julien - Was on holidays - Planned week @ Nexedi, Lille, from September 13th to 17th - Reviewed PRs - [`#20567`](scikit-learn/scikit-learn#20567) Common Private Loss module - [`#18310`](scikit-learn/scikit-learn#18310) ENH Add option to centered ICE plots (cICE) - Others PRs prior to holidays - [`#20254`](scikit-learn/scikit-learn#20254) - Adapted benchmarks on `pdist_aggregation` to test #20254 against sklearnex - Adapting PR for `fast_euclidean` and `fast_sqeuclidean` on user-facing APIs - Next: comparing against scipy's - Next: Having feedback on [#20254](scikit-learn/scikit-learn#20254) would also help - Next: I need to block time to study Cython code. ### Mathis - `sklearn_benchmarks` - Adapting benchmark script to run on Margaret - Fix issue with profiling files too big to be deployed on Github Pages - Ensure deterministic benchmark results - Working on declarative pipeline specification - Next: run long HPO benchmarks on Margaret ### Arturo - Finished MOOC! - Finished filling [Loïc's notes](https://notes.inria.fr/rgSzYtubR6uSOQIfY9Fpvw#) to find questions with score under 60% (Issue [#432](INRIA/scikit-learn-mooc#432)) - started addressing easy-to-fix questions, resulting in gitlab MRs [#21](https://gitlab.inria.fr/learninglab/mooc-scikit-learn/mooc-scikit-learn-coordination/-/merge_requests/21) and [#22](https://gitlab.inria.fr/learninglab/mooc-scikit-learn/mooc-scikit-learn-coordination/-/merge_requests/22) - currently working on expanding the notes up to 70% - Continued cross-linking forum posts with issues in GitHub, resulting in [#444](INRIA/scikit-learn-mooc#444), [#445](INRIA/scikit-learn-mooc#445), [#446](INRIA/scikit-learn-mooc#446), [#447](INRIA/scikit-learn-mooc#447) and [#448](INRIA/scikit-learn-mooc#448) ### Jérémie - back from holidays, catching up - Mathis' benchmarks - trying to find what's going on with ASV benchmarks (asv should display the versions of all build and runtime depndencies for each run) ### Guillaume - back from holidays - Next: - release with Adrin - check the PR and issue trackers ### TODO / Next - Expand Loïc’s notes up to 70% (Arturo) - Create presentation to discuss my experience doing the MOOC (Arturo) - Help with the scikit-learn release (Olivier, Guillaume) - HR: Jeremy's renewal, Chiara's replacement (Gael) - Mathis's consulting gig (Olivier, Gael, Mathis)

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

avm19

@thomasjpfan I think I found something.

avm19 · 2022-04-09T05:47:42Z

sklearn/base.py

+            if not missing_names and not missing_names:
+                message += (
+                    "Feature names must be in the same order as they were in fit.\n"
+                )


Is this a typo? I guess the intended line 478 is
if not missing_names and not unexpected_names:

Yea, it's a typo. Are you interested in opening a PR to fix?

Sure! (It is difficult to say no at this point!)

Pull request created: #23091

thomasjpfan added 5 commits July 26, 2020 21:56

ENH Adds column name consistency

779385f

BUG Fix

a579392

BUG Fix

74368fd

STY Flake8

bce5d0f

MNT Adds xarray

2ca4dbf

github-actions bot added module:ensemble module:linear_model module:utils labels Jul 27, 2020

thomasjpfan mentioned this pull request Jul 27, 2020

CLN Only check for n_features_in_ only when it exists #18011

Merged

thomasjpfan mentioned this pull request Aug 31, 2020

[WIP] Feature names with pandas or xarray data structures #16772

Closed

NicolasHug mentioned this pull request Sep 1, 2020

[WIP] MNT enforce column names consistency #17407

Closed

thomasjpfan added 11 commits September 1, 2020 16:32

Merge remote-tracking branch 'upstream/master' into column_name_consi…

f6048d7

…stent

Merge remote-tracking branch 'upstream/master' into column_name_consi…

485b5ca

…stent

Merge remote-tracking branch 'upstream/master' into column_name_consi…

df6d193

…stent

CLN Smaller diff

7465ec2

Merge remote-tracking branch 'upstream/master' into column_name_consi…

4d5c3d4

…stent

CLN Smaller diff

19583bc

Merge remote-tracking branch 'upstream/master' into column_name_consi…

cb3e6be

…stent

TST Adds test for feature_names_in

4d0840a

ENH Adds tests for coverage

53270fe

TST Fixes warning message

4f7c5e2

ENH Adds xarray

37117eb

thomasjpfan marked this pull request as ready for review September 4, 2020 21:55

thomasjpfan commented Sep 4, 2020

View reviewed changes

sklearn/tests/test_common.py Outdated Show resolved Hide resolved

BLD Force build on ci

5ed789b

amueller reviewed Sep 23, 2020

View reviewed changes

adrinjalali reviewed Sep 23, 2020

View reviewed changes

Merge branch 'master' into column_name_consistent

ee09732

ogrisel reviewed Sep 28, 2020

View reviewed changes

ENH Super restrictive on supporting ints and strs

19de717

ogrisel approved these changes Aug 17, 2021

View reviewed changes

ogrisel merged commit 416898b into scikit-learn:main Aug 17, 2021

thomasjpfan mentioned this pull request Aug 20, 2021

ENH Adds feature_names_in_ to pipeline and multiclass #20780

Merged

thomasjpfan mentioned this pull request Aug 20, 2021

DOC Adds feature_names_in_ to docstrings #20787

Merged

This was referenced Aug 20, 2021

ENH Adds feature_names_in_ to sklearn.semi_supervised #20788

Merged

ENH feature_names_in_ for sklearn.ensemble #20818

Merged

This was referenced Aug 25, 2021

ENH Adds feature_names_in_ to ColumnTransformer #20839

Merged

API Adds feature_names_in_ to kernel_approximation #20841

Merged

thomasjpfan mentioned this pull request Aug 27, 2021

ENH Adds feature_names_in_ to TransformedTargetRegressor #20868

Merged

mmccarty mentioned this pull request Sep 29, 2021

Estimators fit with dataframes cause UserWarnings on scikit-learn 1.0 dask/dask-ml#858

Closed

samronsin pushed a commit to samronsin/scikit-learn that referenced this pull request Nov 30, 2021

ENH Adds Column name consistency (scikit-learn#18010)

caccac3

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

avm19 reviewed Apr 9, 2022

View reviewed changes

This was referenced Apr 9, 2022

Tiny fix for a typo in base.py: missing_names -> unexpected_names #23091

Merged

Meta-estimators and delegating validation #20696

Open

eddiebergman mentioned this pull request Nov 15, 2022

Update scikit learn 1.2 automl/auto-sklearn#1611

Closed

54 tasks



		@pytest.mark.parametrize("array_type", ["dataframe", "dataarray"])
		def test_pandas_get_feature_names(array_type):

ENH Adds Column name consistency #18010

ENH Adds Column name consistency #18010

Conversation

thomasjpfan commented Jul 27, 2020 • edited Loading

Reference Issues/PRs

jnothman commented Aug 2, 2020

thomasjpfan left a comment

Choose a reason for hiding this comment

amueller left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thomasjpfan Jul 17, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adrinjalali left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ogrisel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thomasjpfan commented Aug 13, 2021

amueller commented Aug 13, 2021

amueller commented Aug 17, 2021

glemaitre commented Aug 17, 2021

ogrisel left a comment

Choose a reason for hiding this comment

amueller commented Aug 17, 2021

GaelVaroquaux commented Aug 18, 2021

ogrisel commented Aug 20, 2021

avm19 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thomasjpfan Apr 9, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thomasjpfan commented Jul 27, 2020 •

edited

Loading

thomasjpfan Jul 17, 2021 •

edited

Loading

thomasjpfan Apr 9, 2022 •

edited

Loading