Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] EHN add support for scalar, slice and mask in safe_indexing axis=0 #14475

Merged
merged 55 commits into from Sep 4, 2019

Conversation

@glemaitre
Copy link
Contributor

glemaitre commented Jul 25, 2019

This is the follow-up of #14035.

It is a refactoring of safe_indexing splitting the X indexing depending on the type and reorganizing the tests as well.

@glemaitre glemaitre changed the title EHN add support for scalar, slice and mask in safe_indexing axis=0 [MRG] EHN add support for scalar, slice and mask in safe_indexing axis=0 Jul 25, 2019
@glemaitre

This comment has been minimized.

Copy link
Contributor Author

glemaitre commented Jul 25, 2019

@jnothman @NicolasHug As promised with a bit of delay, I draft something to make safe_indexing consistent. Happy to have feedback.

glemaitre added 2 commits Jul 25, 2019
Copy link
Member

thomasjpfan left a comment

Nice! :)

sklearn/utils/__init__.py Outdated Show resolved Hide resolved
sklearn/utils/__init__.py Outdated Show resolved Hide resolved
sklearn/utils/tests/test_utils.py Outdated Show resolved Hide resolved
Copy link
Contributor

NicolasHug left a comment

Dumb question, but is there a real use-case for this?

doc/whats_new/v0.22.rst Outdated Show resolved Hide resolved
sklearn/utils/__init__.py Outdated Show resolved Hide resolved
sklearn/utils/__init__.py Outdated Show resolved Hide resolved
glemaitre added 5 commits Jul 29, 2019
fix
@glemaitre

This comment has been minimized.

Copy link
Contributor Author

glemaitre commented Jul 30, 2019

Dumb question, but is there a real use-case for this?

Passing a mask array to select sample does not seem inappropriate.
This is feature is mainly to make safe_indexing consistent after @jnothman had concerned about it. I agree with him that having a close behaviour and support for axis=0 and axis=1 would make sense.

sklearn/utils/__init__.py Outdated Show resolved Hide resolved
Copy link
Member

jnothman left a comment

Does this mean that previously unsupported code will now work, such as cv splitters returning masks or feature selectors...? Are we being too lenient in our APIs then?

sklearn/utils/__init__.py Outdated Show resolved Hide resolved
sklearn/utils/__init__.py Outdated Show resolved Hide resolved
sklearn/utils/__init__.py Outdated Show resolved Hide resolved
sklearn/utils/__init__.py Outdated Show resolved Hide resolved
sklearn/utils/__init__.py Outdated Show resolved Hide resolved
- To select multiple elements (i.e. rows or columns), `indices` can be
one of the following: `list`, `array`, `slice`. The type used in
these containers can be one of the following: `int`, `bool`, and
`str`. `str` is only supported when `X` is a dataframe.

This comment has been minimized.

Copy link
@NicolasHug

NicolasHug Aug 1, 2019

Contributor

what are slices of bool and str?

Would this work?

To select multiple elements (i.e. rows or columns), indices can be
a slice, or a container (list orarray). The type used in
these containers can be one of the following: int, bool, and
str. str is only supported when X is a dataframe.

This comment has been minimized.

Copy link
@glemaitre

glemaitre Aug 1, 2019

Author Contributor

slice of bool will give you the first line (it casting stuff into int). We should raise an error for such case I think.

slice of string is useful with dataframe:

df.loc[slice('xxx', 'yyy')]

I see that we have an issue because we only use iloc and then it would not work. So the docstring is wrong

This comment has been minimized.

Copy link
@NicolasHug

NicolasHug Aug 1, 2019

Contributor

do we really want to support slices of strings? I'd say we don't for now to keep it simple?

This comment has been minimized.

Copy link
@glemaitre

glemaitre Aug 1, 2019

Author Contributor

It is already supported for the column with dataframe

This comment has been minimized.

Copy link
@glemaitre

glemaitre Aug 1, 2019

Author Contributor

What I mean is:

from sklearn.datasets import fetch_openml
iris = fetch_openml('iris', as_frame=True)
from sklearn.preprocessing import FunctionTransformer
from sklearn.compose import make_column_transformer
transformer = make_column_transformer(
    (FunctionTransformer(), slice('sepallength', 'sepalwidth'))
)
transformer.fit_transform(iris.data, iris.target)
glemaitre added 4 commits Aug 1, 2019
@glemaitre

This comment has been minimized.

Copy link
Contributor Author

glemaitre commented Aug 1, 2019

I refactor the code which should make it more understandable but less easy to review by looking at the diff (sorry for that). I think that I will need to do the same regarding the tests if we want to keep this indexing manageable.

So before to go further, I think that we need to answer a couple of questions:

  • for axis=1, X cannot be a list?
  • do we want to support slicing with string for both axis=0 and axis=1?
  • we can now index using boolean array-like or arrays. I first thought that @jnothman wanted this (#14035 (comment)) but his last comment #14475 (review) question this additional support.

I think that this is the main blocker here.

Copy link
Member

amueller left a comment

I'm not sure if I understand what's going on. Are we considering safe_indexing not public and so we can break backward-compatibility?

sklearn/impute/_iterative.py Outdated Show resolved Hide resolved
sklearn/metrics/cluster/unsupervised.py Outdated Show resolved Hide resolved
glemaitre added 4 commits Aug 19, 2019
…into is/consistent_safe_indexing
@glemaitre

This comment has been minimized.

Copy link
Contributor Author

glemaitre commented Aug 19, 2019

I'm not sure if I understand what's going on. Are we considering safe_indexing not public and so we can break backward-compatibility?

As mentioned in one of my comment, I reintroduced support for boolean array-like and integer slice meaning since we were supporting it.

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Aug 20, 2019

Copy link
Member

jnothman left a comment

I'm somewhat confused by this _check_key_type(key, superclass) design. It repeats very similar checks for each of three classes, but also has specialised code paths depending on superclass. Why not _determine_key_type(key) -> {'s', 'i', 'b', 'none-slice'} called exactly once per safe_indexing?

"""Index a pandas dataframe or a series."""
# check whether we should index with loc or iloc
if _check_key_type(key, int):
by_name = False

This comment has been minimized.

Copy link
@jnothman

jnothman Aug 20, 2019

Member

should we be raising a warning/error if X.index.dtype.kind in 'iu' (and similar on the other axis)? This won't handle the case where there are ints of object dtype in the index.


if hasattr(key, 'shape'):
# Work-around for indexing with read-only key in pandas
# FIXME: solved in pandas 0.25

This comment has been minimized.

Copy link
@jnothman

jnothman Aug 20, 2019

Member

should we use an explicit version comparison to take this codepath?

This comment has been minimized.

Copy link
@glemaitre

glemaitre Aug 20, 2019

Author Contributor

then we need to make an explicit pandas import

sklearn/utils/__init__.py Outdated Show resolved Hide resolved
if np.isscalar(key) or isinstance(key, slice):
# key is a slice or a scalar
return X[key]
key_set = set(key)

This comment has been minimized.

Copy link
@jnothman

jnothman Aug 20, 2019

Member

key will usually be an array, won't it? Should we cast into an array at some point?

This comment has been minimized.

Copy link
@glemaitre

glemaitre Aug 20, 2019

Author Contributor

I don't know if we need to do so. We can always postpone until required.

sklearn/utils/__init__.py Outdated Show resolved Hide resolved
sklearn/utils/__init__.py Outdated Show resolved Hide resolved
sklearn/utils/__init__.py Outdated Show resolved Hide resolved
sklearn/utils/__init__.py Outdated Show resolved Hide resolved
return None
if isinstance(key, tuple(dtype_to_str.keys())):
try:
return dtype_to_str[type(key)]

This comment has been minimized.

Copy link
@jnothman

jnothman Aug 20, 2019

Member

Not sure this is the right behaviour for a scalar bool

This comment has been minimized.

Copy link
@glemaitre

glemaitre Aug 21, 2019

Author Contributor

What do you mean? bool and np.bool_ will return the string 'bool'. What would you expect instead?

Note that in safe indexing, we are not using scalar bool to index.

This comment has been minimized.

Copy link
@jnothman

jnothman Aug 21, 2019

Member

Note that in safe indexing, we are not using scalar bool to index.

Okay ;)

sklearn/utils/__init__.py Outdated Show resolved Hide resolved
return set_type
if hasattr(key, 'dtype'):
try:
return array_dtype_to_str[key.dtype.kind]

This comment has been minimized.

Copy link
@jnothman

jnothman Aug 20, 2019

Member

Is behaviour with dtype='O' where the array contains only ints or only bools correct? (How does numpy handle this?)

This comment has been minimized.

Copy link
@glemaitre

glemaitre Aug 21, 2019

Author Contributor

NumPy will return 'O' and nothing about 'i' or 'b' so it will fail. I am not really sure how to solve this issue without adding so much complexity.

One would need to try converting the array to bool and int first. However, I thought that we are usually dtype=object to handle string.

This comment has been minimized.

Copy link
@thomasjpfan

thomasjpfan Aug 22, 2019

Member

This maintains the behavior on master where we use ('O', 'U', 'S') to mean string:

if hasattr(key, 'dtype'):
if superclass is int:
return key.dtype.kind == 'i'
elif superclass is bool:
return key.dtype.kind == 'b'
else:
# superclass = str
return key.dtype.kind in ('O', 'U', 'S')

glemaitre added 3 commits Aug 21, 2019
Copy link
Member

jnothman left a comment

I've not reviewed tests thoroughly, but I'm pretty happy with this!

sklearn/utils/__init__.py Outdated Show resolved Hide resolved
sklearn/utils/__init__.py Outdated Show resolved Hide resolved
glemaitre and others added 2 commits Aug 21, 2019
Co-Authored-By: Joel Nothman <joel.nothman@gmail.com>
@rth

This comment has been minimized.

Copy link
Member

rth commented Sep 4, 2019

Looks like there are two approvals. Anything blocking the merge?

@glemaitre

This comment has been minimized.

Copy link
Contributor Author

glemaitre commented Sep 4, 2019

I think this is ready to be merged. I address all comments.

@jnothman jnothman merged commit ae5f558 into scikit-learn:master Sep 4, 2019
17 checks passed
17 checks passed
LGTM analysis: C/C++ No code changes detected
Details
LGTM analysis: JavaScript No code changes detected
Details
LGTM analysis: Python 1 new alert
Details
ci/circleci: deploy Your tests passed on CircleCI!
Details
ci/circleci: doc Your tests passed on CircleCI!
Details
ci/circleci: doc-min-dependencies Your tests passed on CircleCI!
Details
ci/circleci: lint Your tests passed on CircleCI!
Details
codecov/patch 99% of diff hit (target 96.88%)
Details
codecov/project 96.9% (+0.02%) compared to 1a14920
Details
scikit-learn.scikit-learn Build #20190821.27 succeeded
Details
scikit-learn.scikit-learn (Linux py35_conda_openblas) Linux py35_conda_openblas succeeded
Details
scikit-learn.scikit-learn (Linux py35_ubuntu_atlas) Linux py35_ubuntu_atlas succeeded
Details
scikit-learn.scikit-learn (Linux pylatest_conda_mkl_pandas) Linux pylatest_conda_mkl_pandas succeeded
Details
scikit-learn.scikit-learn (Linux32 py35_ubuntu_atlas_32bit) Linux32 py35_ubuntu_atlas_32bit succeeded
Details
scikit-learn.scikit-learn (Windows py35_pip_openblas_32bit) Windows py35_pip_openblas_32bit succeeded
Details
scikit-learn.scikit-learn (Windows py37_conda_mkl) Windows py37_conda_mkl succeeded
Details
scikit-learn.scikit-learn (macOS pylatest_conda_mkl) macOS pylatest_conda_mkl succeeded
Details
@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Sep 4, 2019

Thanks @glemaitre

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
7 participants
You can’t perform that action at this time.