Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH Adds Column name consistency #18010

Merged
merged 71 commits into from
Aug 17, 2021

Conversation

thomasjpfan
Copy link
Member

@thomasjpfan thomasjpfan commented Jul 27, 2020

Reference Issues/PRs

Towards #17407

@jnothman
Copy link
Member

jnothman commented Aug 2, 2020

thanks for piloting this. It's looking good. Ideally we release this on all estimators at once, but yes, there's little benefit in doing it all in one mammoth PR.

@thomasjpfan thomasjpfan marked this pull request as ready for review September 4, 2020 21:55
Copy link
Member Author

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is ready to review. Currently, this PR only adds column name consistency to dummy and impute and has a test to check for it. Currently, there is an ignore list for all the modules that is ignored by the new tests. When we add the consistency check for each module we can remove it from the list.

The codecov message is error is strange. I am pretty sure my tests covers the patch and can verify on my local system (with --cov sklearn).

sklearn/tests/test_common.py Outdated Show resolved Hide resolved
Copy link
Member

@amueller amueller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good apart from minor nitpicks. Requires some dev docs, I think.

sklearn/base.py Outdated Show resolved Hide resolved
sklearn/base.py Outdated Show resolved Hide resolved
@@ -793,7 +793,7 @@ def transform(self, X):
# Need not validate X again as it would have already been validated
# in the Imputer calling MissingIndicator
if not self._precomputed:
X = self._validate_input(X, in_fit=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was that a bug? can you add a regression test?

sklearn/tests/test_array_out.py Outdated Show resolved Hide resolved


@pytest.mark.parametrize("array_type", ["dataframe", "dataarray"])
def test_pandas_get_feature_names(array_type):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you test with integer column names as well? and maybe a mix of integer, string and object column names? you know, for fun?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember we had a conversation about what kind of column names we'd accept. And we kinda concluded that we'd restrict column names to be str, and I think we should stick to that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also intuitively feel that feature names implies str. It would be a good idea to add a test to check the error message we raise for invalid feature names.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be okay with this if all our get_feature_names output strings, but for DictVectorizer we do not:

from sklearn.feature_extraction import DictVectorizer
v = DictVectorizer(sparse=False)
X = [{2: 1, 4: 2}, {3: 3, 5: 1}]
v.fit(X)
v.get_feature_names()
# [2, 3, 4, 5]

On a second note, I think enforcing strings will break backward compatibly:

from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris

X, _ = load_iris(as_frame=True, return_X_y=True)
X.columns = ['a', 'b', 2, 3]
scaler = StandardScaler().fit(X)
_ = scaler.transform(X) # currently works

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know integers as columns currently works, but I think we should deprecate the usage and start returning strings as column names ourselves. Having integers as column names (or row indices for that matter) creates a ton of silent bugs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Becoming consistent would be great and since ColumnTransformer is also not accepting mixed data type (and I am not sure that we can get around that without adding some ambuiguities), I would prefer to deprecate the mixed type support where it is currently.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option would be to accept integer columns in _validate_data and automatically convert them to be stored as strings in feature_names_in_.

I am not sure this is a good idea though.

Copy link
Member Author

@thomasjpfan thomasjpfan Jul 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To summarize here are the options:

  1. Only store feature_names_in_ when column names are all strings. If a dataframe is passed with mixed or integer column names, show a warning stating that feature_names_in_ is not set.
  2. Support integer and/or mixed columns and deprecate.
  3. Convert to strings and store them into feature_names_in_ and warn that this is happening.
  4. Support any type of column names.

I am +0.5 for Option 2 as long as we update DictVectorizer in a way to always output strings. Option 3 will lead to weird edge cases with column names like: ['4', 4] (string and integer).

Option 4 is not consistent with ColumnTransformer. ColumnTransformer uses the names to actually slice the DataFrame so its reasonable to enforce strings. All the other estimators only requires an equality check, so I do not think we need to be restrictive on the column names.

To move this along I'll update the PR to option 2. (Support integer and/or mixed columns and deprecate.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a SLEP or just a vote on how to handle column names? I think some of us would be happy with whatever pandas accepts, and some of us [really] don't want any column names other than simple strings.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How we handle names is fairly public, so having a short SLEP to formalize the behavior is useful.

sklearn/tests/test_base.py Outdated Show resolved Hide resolved
sklearn/utils/estimator_checks.py Show resolved Hide resolved
sklearn/utils/estimator_checks.py Show resolved Hide resolved
Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm generally happy with this PR, we just need to figure the some of the cases:

sklearn/base.py Show resolved Hide resolved
sklearn/base.py Outdated
return

fitted_feature_names = getattr(self, "feature_names_in_", None)
if fitted_feature_names is None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to raise if either (fitted_feature_names is None xor feature_names_in is None) == True? I think we should

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means, we would start warning for training on numpy array and predicting on pandas dataframe. What would be a good message?

  1. train on numpy and predict on dataframe: Convert your dataframe into a numpy array before predicting?
  2. train on dataframe and predict on numpy array: Convert your numpy array into a dataframe before predicting?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it would be enough to mention that the type of the array in fit is different than the one at predict and we expect a consistent type between fit and predict.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I don't know if we should raise an error or only a warning mentioning that no information about the columns would be available?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be happy with a warning (maybe) but I think a warning when the types differ between fit and predict is a nice thing to have.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ambivalent about a warning. On the one hand it seems dangerous because it means you can't check consistency, but on the other hand people will ignore warning if there's too many. I think +0.5 for the warning?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it would be enough to mention that the type of the array in fit is different than the one at predict and we expect a consistent type between fit and predict.

If we were to support xarray in the future, I think we can strictly work on type. It would have to be "are there feature names in fit" and "are there feature names in predict".

Updated the PR with the warnings.



@pytest.mark.parametrize("array_type", ["dataframe", "dataarray"])
def test_pandas_get_feature_names(array_type):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember we had a conversation about what kind of column names we'd accept. And we kinda concluded that we'd restrict column names to be str, and I think we should stick to that?


Supports:
- pandas DataFrame
- xarray DataArray
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

elaborate on the requirements for the xarray coords to be considered as feature names maybe



def _get_feature_names(X):
"""Get feature names from X.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this function enforce string feature names

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe so, we can always relax that requirement later if we decide to.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides the existing comment and the additional comment below on the check for no_validation tagged estimators, this LGTM:



@pytest.mark.parametrize("array_type", ["dataframe", "dataarray"])
def test_pandas_get_feature_names(array_type):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also intuitively feel that feature names implies str. It would be a good idea to add a test to check the error message we raise for invalid feature names.

sklearn/tests/test_base.py Outdated Show resolved Hide resolved
sklearn/tests/test_common.py Outdated Show resolved Hide resolved


def _get_feature_names(X):
"""Get feature names from X.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe so, we can always relax that requirement later if we decide to.

sklearn/utils/estimator_checks.py Show resolved Hide resolved
sklearn/utils/estimator_checks.py Outdated Show resolved Hide resolved
@thomasjpfan
Copy link
Member Author

I'm not sure if I understand the question / suggestion. I really would prefer not to warn on all int column names.

It's more of question on mixed columns or datetime columns names. If we start raising errors in 1.2 because of scikit-learn's restriction on types in feature names, then we do not support "array-likes" anymore.

I'm ok with not doing a consistency check for int, but can someone remind me what the reason for that was?

I think the reasons were:

  1. Consistent with ColumnTransformer
  2. Using ints for selection can be complicated
  3. Better to be restrictive now and add it in the future.

I'm still on the side of supporting ints.

@amueller
Copy link
Member

Sounds good! I like the PR as it's now.

@amueller
Copy link
Member

@ogrisel @glemaitre is your approval still valid? ;)

@glemaitre
Copy link
Member

yep

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you very much for the follow-up. Merging!

@ogrisel ogrisel merged commit 416898b into scikit-learn:main Aug 17, 2021
@amueller
Copy link
Member

OMG!!!! 🥳

@GaelVaroquaux
Copy link
Member

Side note: using parquet rather than CSV seems to force column names to be strings by default: "0" (that just broke my code, but that's another story).

Maybe in the long term, a way forward is to push for more parquet?

@ogrisel
Copy link
Member

ogrisel commented Aug 20, 2021

Maybe in the long term, a way forward is to push for more parquet?

That would be an option to explore, in particular, what are the memory copies involved when going from/to pandas to/from arrow data tables and other heterogeneous column containers.

ogrisel added a commit to scikit-learn-inria-fondation/follow-up that referenced this pull request Sep 7, 2021
## August 31th, 2021

### Gael

* TODO: Jeremy's renewal, Chiara's replacement, Mathis's consulting gig

### Olivier

- input feature names: main PR [#18010](scikit-learn/scikit-learn#18010) that links into sub PRs
  - remaining (need review): [#20853](scikit-learn/scikit-learn#20853) (found a bug in `OvOClassifier.n_features_in_`)
- reviewing `get_feature_names_out`: [#18444](scikit-learn/scikit-learn#18444)
- next: give feedback to Chiara on ARM wheel building [#20711](scikit-learn/scikit-learn#20711) (needed for the release)
- next: assist Adrin for the release process
- next: investigate regression in loky that blocks the cloudpickle release [#432](cloudpipe/cloudpickle#432)
- next: come back to intel to write a technical roadmap for a possible collaboration

### Julien

 - Was on holidays
 - Planned week @ Nexedi, Lille, from September 13th to 17th
 - Reviewed PRs
     - [`#20567`](scikit-learn/scikit-learn#20567) Common Private Loss module
     - [`#18310`](scikit-learn/scikit-learn#18310) ENH Add option to centered ICE plots (cICE)
     - Others PRs prior to holidays
 - [`#20254`](scikit-learn/scikit-learn#20254)
     - Adapted benchmarks on `pdist_aggregation` to test #20254 against sklearnex
     - Adapting PR for `fast_euclidean` and `fast_sqeuclidean` on user-facing APIs
     - Next: comparing against scipy's 
     - Next: Having feedback on [#20254](scikit-learn/scikit-learn#20254) would also help
- Next: I need to block time to study Cython code.

### Mathis
- `sklearn_benchmarks`
  - Adapting benchmark script to run on Margaret
  - Fix issue with profiling files too big to be deployed on Github Pages
  - Ensure deterministic benchmark results
  - Working on declarative pipeline specification
  - Next: run long HPO benchmarks on Margaret

### Arturo

- Finished MOOC!
- Finished filling [Loïc's notes](https://notes.inria.fr/rgSzYtubR6uSOQIfY9Fpvw#) to find questions with score under 60% (Issue [#432](INRIA/scikit-learn-mooc#432))
    - started addressing easy-to-fix questions, resulting in gitlab MRs [#21](https://gitlab.inria.fr/learninglab/mooc-scikit-learn/mooc-scikit-learn-coordination/-/merge_requests/21) and [#22](https://gitlab.inria.fr/learninglab/mooc-scikit-learn/mooc-scikit-learn-coordination/-/merge_requests/22)
    - currently working on expanding the notes up to 70%
- Continued cross-linking forum posts with issues in GitHub, resulting in [#444](INRIA/scikit-learn-mooc#444), [#445](INRIA/scikit-learn-mooc#445), [#446](INRIA/scikit-learn-mooc#446), [#447](INRIA/scikit-learn-mooc#447) and [#448](INRIA/scikit-learn-mooc#448)

### Jérémie
- back from holidays, catching up
- Mathis' benchmarks
- trying to find what's going on with ASV benchmarks
  (asv should display the versions of all build and runtime depndencies for each run)

### Guillaume

- back from holidays
- Next:
    - release with Adrin
    - check the PR and issue trackers

### TODO / Next

- Expand Loïc’s notes up to 70% (Arturo)
- Create presentation to discuss my experience doing the MOOC (Arturo)
- Help with the scikit-learn release (Olivier, Guillaume)
- HR: Jeremy's renewal, Chiara's replacement (Gael)
- Mathis's consulting gig (Olivier, Gael, Mathis)
samronsin pushed a commit to samronsin/scikit-learn that referenced this pull request Nov 30, 2021
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Copy link
Contributor

@avm19 avm19 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thomasjpfan I think I found something.

Comment on lines +478 to +481
if not missing_names and not missing_names:
message += (
"Feature names must be in the same order as they were in fit.\n"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a typo? I guess the intended line 478 is
if not missing_names and not unexpected_names:

Copy link
Member Author

@thomasjpfan thomasjpfan Apr 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, it's a typo. Are you interested in opening a PR to fix?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! (It is difficult to say no at this point!)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request created: #23091

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants