Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX IndexError in fetch_openml('zoo') #14623

Merged

Conversation

@HabchiSarra
Copy link
Contributor

HabchiSarra commented Aug 10, 2019

Reference Issues/PRs

Fixes #14340

What does this implement/fix? Explain your changes.

The shape extraction from data_qualities used NumberOfFeatures, which excluded the ignored features.
This exclusion caused a bug in the data conversion since we tried to reshape the whole dataset with a lower number of features.
This fix includes all features in the shape extraction.

sklearn/datasets/openml.py Outdated Show resolved Hide resolved
sklearn/datasets/openml.py Outdated Show resolved Hide resolved
Copy link
Member

thomasjpfan left a comment

There is a naming convention when dealing with data in sklearn/datasets/tests/data/openmp/62. You can learn about the conventions by looking at other data in the test directory.

For example the arff data is named data-v1-download-1.arff.gz, etc.

@HabchiSarra HabchiSarra force-pushed the HabchiSarra:fix-ignored-features-openml branch from a2505a6 to 53223a6 Aug 12, 2019
@HabchiSarra

This comment has been minimized.

Copy link
Contributor Author

HabchiSarra commented Aug 12, 2019

Thanks, @thomasjpfan.
I followed your recommendations and changed my commit accordingly.

@amueller

This comment has been minimized.

Copy link
Member

amueller commented Aug 12, 2019

tests are failing.

@HabchiSarra HabchiSarra force-pushed the HabchiSarra:fix-ignored-features-openml branch from 53223a6 to c6fdb40 Aug 12, 2019
@HabchiSarra

This comment has been minimized.

Copy link
Contributor Author

HabchiSarra commented Aug 12, 2019

@amueller I fixed the tests.
However, for the failing doc build, I don't think it is related to my modifications.
Do you have an idea about its root cause or how to fix it?

@amueller

This comment has been minimized.

Copy link
Member

amueller commented Aug 12, 2019

Thanks! it's indeed unrelated to your changes, please merge with the current master branch which fixed this issue.

@HabchiSarra HabchiSarra force-pushed the HabchiSarra:fix-ignored-features-openml branch from c6fdb40 to 60bc2b6 Aug 12, 2019
def _get_data_shape(data_qualities):
# Using the data_info dictionary from _get_data_info_by_name to extract
# the number of samples / features
def _get_data_instances(data_qualities):

This comment has been minimized.

Copy link
@jnothman

jnothman Aug 13, 2019

Member
Suggested change
def _get_data_instances(data_qualities):
def _get_num_samples(data_qualities):

I think this would be clearer naming for readers of scikit-learn code.

This comment has been minimized.

Copy link
@HabchiSarra

HabchiSarra Aug 13, 2019

Author Contributor

I changed the name, thanks.

Parameters
----------
data_qualities : list

This comment has been minimized.

Copy link
@jnothman

jnothman Aug 13, 2019

Member

can't be a list. It's keyed by strings.

This comment has been minimized.

Copy link
@glemaitre

glemaitre Aug 13, 2019

Contributor

This is indeed a list of dict. Each dict has 2 keys "name" and "values".
This looks really weird because we could have a single dict with "name" being the key and "values" the associated value but this is not the case. This is actually what the dict comprehension is doing in l.448.

data_qualities : list of dict

This comment has been minimized.

Copy link
@HabchiSarra

HabchiSarra Aug 13, 2019

Author Contributor

Yes, it is a list of dict.
I updated the documentation accordingly.

qualities = {d['name']: d['value'] for d in data_qualities}
try:
return (int(float(qualities['NumberOfInstances'])),
int(float(qualities['NumberOfFeatures'])))
instances = int(float(qualities['NumberOfInstances']))
except AttributeError:

This comment has been minimized.

Copy link
@jnothman

jnothman Aug 13, 2019

Member

I don't see how an AttributeError would be raised here.

This comment has been minimized.

Copy link
@glemaitre

glemaitre Aug 13, 2019

Contributor

It should be a KeyError isn't it?

This comment has been minimized.

Copy link
@HabchiSarra

HabchiSarra Aug 13, 2019

Author Contributor

I changed the way we get the value, so it's not needed anymore.
See https://github.com/scikit-learn/scikit-learn/pull/14623/files#diff-ea672b15dd808c88257c58681d17bb6aR448

Copy link
Contributor

glemaitre left a comment

A couple of comments.

sklearn/datasets/openml.py Outdated Show resolved Hide resolved
sklearn/datasets/openml.py Outdated Show resolved Hide resolved
Parameters
----------
data_qualities : list

This comment has been minimized.

Copy link
@glemaitre

glemaitre Aug 13, 2019

Contributor

This is indeed a list of dict. Each dict has 2 keys "name" and "values".
This looks really weird because we could have a single dict with "name" being the key and "values" the associated value but this is not the case. This is actually what the dict comprehension is doing in l.448.

data_qualities : list of dict

sklearn/datasets/openml.py Outdated Show resolved Hide resolved
sklearn/datasets/openml.py Show resolved Hide resolved
qualities = {d['name']: d['value'] for d in data_qualities}
try:
return (int(float(qualities['NumberOfInstances'])),
int(float(qualities['NumberOfFeatures'])))
instances = int(float(qualities['NumberOfInstances']))
except AttributeError:

This comment has been minimized.

Copy link
@glemaitre

glemaitre Aug 13, 2019

Contributor

It should be a KeyError isn't it?

sklearn/datasets/openml.py Outdated Show resolved Hide resolved
sklearn/datasets/openml.py Outdated Show resolved Hide resolved
sklearn/datasets/tests/test_openml.py Show resolved Hide resolved
sklearn/datasets/tests/test_openml.py Show resolved Hide resolved
@HabchiSarra HabchiSarra force-pushed the HabchiSarra:fix-ignored-features-openml branch from 60bc2b6 to e6e8a1d Aug 13, 2019
@HabchiSarra

This comment has been minimized.

Copy link
Contributor Author

HabchiSarra commented Aug 13, 2019

I addressed the reviews, does the new version suit you?

Copy link
Member

thomasjpfan left a comment

Please add an entry to the change log at doc/whats_new/v0.22.rst. Like the other entries there, please reference this pull request with :pr: and credit yourself (and other contributors if applicable) with :user:.

@HabchiSarra HabchiSarra force-pushed the HabchiSarra:fix-ignored-features-openml branch from e6e8a1d to 53921ca Aug 13, 2019
Copy link
Contributor

glemaitre left a comment

nitpicking otherwise LGTM

sklearn/datasets/openml.py Show resolved Hide resolved
# Using the data_info dictionary from _get_data_info_by_name to extract
# the number of samples / features
def _get_num_samples(data_qualities):
"""Get the number of samples from data qualities

This comment has been minimized.

Copy link
@glemaitre

glemaitre Aug 13, 2019

Contributor

missing full stop.

-------
instances : int
The number of samples in the dataset or -1 if data qualities are
unavailable

This comment has been minimized.

Copy link
@glemaitre

glemaitre Aug 13, 2019

Contributor

missing full stop

Returns
-------
instances : int

This comment has been minimized.

Copy link
@glemaitre

glemaitre Aug 13, 2019

Contributor
Suggested change
instances : int
n_samples : int

This comment has been minimized.

Copy link
@HabchiSarra

HabchiSarra Aug 13, 2019

Author Contributor

Ok, it's done.

The shape extraction from data_qualities was using NumberOfFeatures,
which excluded the ignored features.
This exclusion caused a bug in the data conversion, since we tried
to reshape the whole dataset with a lower number of features.

This fix uses data_features to include ignored features in the shape
extraction

Fixes #14340
@HabchiSarra HabchiSarra force-pushed the HabchiSarra:fix-ignored-features-openml branch from 53921ca to ea80272 Aug 13, 2019
@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Aug 13, 2019

Thanks @HabchiSarra!

@jnothman jnothman merged commit e49b9d3 into scikit-learn:master Aug 13, 2019
17 checks passed
17 checks passed
LGTM analysis: C/C++ No code changes detected
Details
LGTM analysis: JavaScript No code changes detected
Details
LGTM analysis: Python No new or fixed alerts
Details
ci/circleci: deploy Your tests passed on CircleCI!
Details
ci/circleci: doc Your tests passed on CircleCI!
Details
ci/circleci: doc-min-dependencies Your tests passed on CircleCI!
Details
ci/circleci: lint Your tests passed on CircleCI!
Details
codecov/patch 100% of diff hit (target 96.88%)
Details
codecov/project 96.88% (+<.01%) compared to 92af3da
Details
scikit-learn.scikit-learn Build #20190813.69 succeeded
Details
scikit-learn.scikit-learn (Linux py35_conda_openblas) Linux py35_conda_openblas succeeded
Details
scikit-learn.scikit-learn (Linux py35_ubuntu_atlas) Linux py35_ubuntu_atlas succeeded
Details
scikit-learn.scikit-learn (Linux pylatest_conda_mkl_pandas) Linux pylatest_conda_mkl_pandas succeeded
Details
scikit-learn.scikit-learn (Linux32 py35_ubuntu_atlas_32bit) Linux32 py35_ubuntu_atlas_32bit succeeded
Details
scikit-learn.scikit-learn (Windows py35_pip_openblas_32bit) Windows py35_pip_openblas_32bit succeeded
Details
scikit-learn.scikit-learn (Windows py37_conda_mkl) Windows py37_conda_mkl succeeded
Details
scikit-learn.scikit-learn (macOS pylatest_conda_mkl) macOS pylatest_conda_mkl succeeded
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.