Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] EHN add support for scalar, slice and mask in safe_indexing axis=0 #14475

Merged
merged 55 commits into from Sep 4, 2019

Conversation

glemaitre
Copy link
Contributor

@glemaitre glemaitre commented Jul 25, 2019

This is the follow-up of #14035.

It is a refactoring of safe_indexing splitting the X indexing depending on the type and reorganizing the tests as well.

@glemaitre glemaitre changed the title EHN add support for scalar, slice and mask in safe_indexing axis=0 [MRG] EHN add support for scalar, slice and mask in safe_indexing axis=0 Jul 25, 2019
@glemaitre
Copy link
Contributor Author

@glemaitre glemaitre commented Jul 25, 2019

@jnothman @NicolasHug As promised with a bit of delay, I draft something to make safe_indexing consistent. Happy to have feedback.

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Nice! :)

sklearn/utils/__init__.py Outdated Show resolved Hide resolved
sklearn/utils/__init__.py Outdated Show resolved Hide resolved
sklearn/utils/tests/test_utils.py Outdated Show resolved Hide resolved
Copy link
Member

@NicolasHug NicolasHug left a comment

Dumb question, but is there a real use-case for this?

doc/whats_new/v0.22.rst Outdated Show resolved Hide resolved
sklearn/utils/__init__.py Outdated Show resolved Hide resolved
sklearn/utils/__init__.py Outdated Show resolved Hide resolved
@glemaitre
Copy link
Contributor Author

@glemaitre glemaitre commented Jul 30, 2019

Dumb question, but is there a real use-case for this?

Passing a mask array to select sample does not seem inappropriate.
This is feature is mainly to make safe_indexing consistent after @jnothman had concerned about it. I agree with him that having a close behaviour and support for axis=0 and axis=1 would make sense.

sklearn/utils/__init__.py Outdated Show resolved Hide resolved
Copy link
Member

@jnothman jnothman left a comment

Does this mean that previously unsupported code will now work, such as cv splitters returning masks or feature selectors...? Are we being too lenient in our APIs then?

sklearn/utils/__init__.py Outdated Show resolved Hide resolved
sklearn/utils/__init__.py Outdated Show resolved Hide resolved
sklearn/utils/__init__.py Outdated Show resolved Hide resolved
sklearn/utils/__init__.py Outdated Show resolved Hide resolved
sklearn/utils/__init__.py Outdated Show resolved Hide resolved
- To select multiple elements (i.e. rows or columns), `indices` can be
one of the following: `list`, `array`, `slice`. The type used in
these containers can be one of the following: `int`, `bool`, and
`str`. `str` is only supported when `X` is a dataframe.
Copy link
Member

@NicolasHug NicolasHug Aug 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are slices of bool and str?

Would this work?

To select multiple elements (i.e. rows or columns), indices can be
a slice, or a container (list orarray). The type used in
these containers can be one of the following: int, bool, and
str. str is only supported when X is a dataframe.

Copy link
Contributor Author

@glemaitre glemaitre Aug 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

slice of bool will give you the first line (it casting stuff into int). We should raise an error for such case I think.

slice of string is useful with dataframe:

df.loc[slice('xxx', 'yyy')]

I see that we have an issue because we only use iloc and then it would not work. So the docstring is wrong

Copy link
Member

@NicolasHug NicolasHug Aug 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really want to support slices of strings? I'd say we don't for now to keep it simple?

Copy link
Contributor Author

@glemaitre glemaitre Aug 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is already supported for the column with dataframe

Copy link
Contributor Author

@glemaitre glemaitre Aug 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I mean is:

from sklearn.datasets import fetch_openml
iris = fetch_openml('iris', as_frame=True)
from sklearn.preprocessing import FunctionTransformer
from sklearn.compose import make_column_transformer
transformer = make_column_transformer(
    (FunctionTransformer(), slice('sepallength', 'sepalwidth'))
)
transformer.fit_transform(iris.data, iris.target)

@glemaitre
Copy link
Contributor Author

@glemaitre glemaitre commented Aug 1, 2019

I refactor the code which should make it more understandable but less easy to review by looking at the diff (sorry for that). I think that I will need to do the same regarding the tests if we want to keep this indexing manageable.

So before to go further, I think that we need to answer a couple of questions:

  • for axis=1, X cannot be a list?
  • do we want to support slicing with string for both axis=0 and axis=1?
  • we can now index using boolean array-like or arrays. I first thought that @jnothman wanted this (#14035 (comment)) but his last comment #14475 (review) question this additional support.

I think that this is the main blocker here.

Copy link
Member

@amueller amueller left a comment

I'm not sure if I understand what's going on. Are we considering safe_indexing not public and so we can break backward-compatibility?

sklearn/impute/_iterative.py Outdated Show resolved Hide resolved
sklearn/metrics/cluster/unsupervised.py Outdated Show resolved Hide resolved
@glemaitre
Copy link
Contributor Author

@glemaitre glemaitre commented Aug 19, 2019

I'm not sure if I understand what's going on. Are we considering safe_indexing not public and so we can break backward-compatibility?

As mentioned in one of my comment, I reintroduced support for boolean array-like and integer slice meaning since we were supporting it.

@jnothman
Copy link
Member

@jnothman jnothman commented Aug 20, 2019

Copy link
Member

@jnothman jnothman left a comment

I'm somewhat confused by this _check_key_type(key, superclass) design. It repeats very similar checks for each of three classes, but also has specialised code paths depending on superclass. Why not _determine_key_type(key) -> {'s', 'i', 'b', 'none-slice'} called exactly once per safe_indexing?

"""Index a pandas dataframe or a series."""
# check whether we should index with loc or iloc
if _check_key_type(key, int):
by_name = False
Copy link
Member

@jnothman jnothman Aug 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we be raising a warning/error if X.index.dtype.kind in 'iu' (and similar on the other axis)? This won't handle the case where there are ints of object dtype in the index.


if hasattr(key, 'shape'):
# Work-around for indexing with read-only key in pandas
# FIXME: solved in pandas 0.25
Copy link
Member

@jnothman jnothman Aug 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we use an explicit version comparison to take this codepath?

Copy link
Contributor Author

@glemaitre glemaitre Aug 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then we need to make an explicit pandas import

sklearn/utils/__init__.py Outdated Show resolved Hide resolved
if np.isscalar(key) or isinstance(key, slice):
# key is a slice or a scalar
return X[key]
key_set = set(key)
Copy link
Member

@jnothman jnothman Aug 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

key will usually be an array, won't it? Should we cast into an array at some point?

Copy link
Contributor Author

@glemaitre glemaitre Aug 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if we need to do so. We can always postpone until required.

sklearn/utils/__init__.py Outdated Show resolved Hide resolved
sklearn/utils/__init__.py Outdated Show resolved Hide resolved
sklearn/utils/__init__.py Outdated Show resolved Hide resolved
sklearn/utils/__init__.py Outdated Show resolved Hide resolved
return None
if isinstance(key, tuple(dtype_to_str.keys())):
try:
return dtype_to_str[type(key)]
Copy link
Member

@jnothman jnothman Aug 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure this is the right behaviour for a scalar bool

Copy link
Contributor Author

@glemaitre glemaitre Aug 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean? bool and np.bool_ will return the string 'bool'. What would you expect instead?

Note that in safe indexing, we are not using scalar bool to index.

Copy link
Member

@jnothman jnothman Aug 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that in safe indexing, we are not using scalar bool to index.

Okay ;)

sklearn/utils/__init__.py Outdated Show resolved Hide resolved
return set_type
if hasattr(key, 'dtype'):
try:
return array_dtype_to_str[key.dtype.kind]
Copy link
Member

@jnothman jnothman Aug 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is behaviour with dtype='O' where the array contains only ints or only bools correct? (How does numpy handle this?)

Copy link
Contributor Author

@glemaitre glemaitre Aug 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NumPy will return 'O' and nothing about 'i' or 'b' so it will fail. I am not really sure how to solve this issue without adding so much complexity.

One would need to try converting the array to bool and int first. However, I thought that we are usually dtype=object to handle string.

Copy link
Member

@thomasjpfan thomasjpfan Aug 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This maintains the behavior on master where we use ('O', 'U', 'S') to mean string:

if hasattr(key, 'dtype'):
if superclass is int:
return key.dtype.kind == 'i'
elif superclass is bool:
return key.dtype.kind == 'b'
else:
# superclass = str
return key.dtype.kind in ('O', 'U', 'S')

Copy link
Member

@jnothman jnothman left a comment

I've not reviewed tests thoroughly, but I'm pretty happy with this!

sklearn/utils/__init__.py Outdated Show resolved Hide resolved
sklearn/utils/__init__.py Outdated Show resolved Hide resolved
glemaitre and others added 2 commits Aug 21, 2019
Co-Authored-By: Joel Nothman <joel.nothman@gmail.com>
@rth
Copy link
Member

@rth rth commented Sep 4, 2019

Looks like there are two approvals. Anything blocking the merge?

@glemaitre
Copy link
Contributor Author

@glemaitre glemaitre commented Sep 4, 2019

I think this is ready to be merged. I address all comments.

@jnothman jnothman merged commit ae5f558 into scikit-learn:master Sep 4, 2019
17 checks passed
@jnothman
Copy link
Member

@jnothman jnothman commented Sep 4, 2019

Thanks @glemaitre

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants