Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX IndexError in fetch_openml('zoo') #14623

Merged

Conversation

HabchiSarra
Copy link
Contributor

@HabchiSarra HabchiSarra commented Aug 10, 2019

Reference Issues/PRs

Fixes #14340

What does this implement/fix? Explain your changes.

The shape extraction from data_qualities used NumberOfFeatures, which excluded the ignored features.
This exclusion caused a bug in the data conversion since we tried to reshape the whole dataset with a lower number of features.
This fix includes all features in the shape extraction.

sklearn/datasets/openml.py Outdated Show resolved Hide resolved
sklearn/datasets/openml.py Outdated Show resolved Hide resolved
Copy link
Member

@thomasjpfan thomasjpfan left a comment

There is a naming convention when dealing with data in sklearn/datasets/tests/data/openmp/62. You can learn about the conventions by looking at other data in the test directory.

For example the arff data is named data-v1-download-1.arff.gz, etc.

@HabchiSarra HabchiSarra force-pushed the fix-ignored-features-openml branch from a2505a6 to 53223a6 Compare Aug 12, 2019
@HabchiSarra
Copy link
Contributor Author

@HabchiSarra HabchiSarra commented Aug 12, 2019

Thanks, @thomasjpfan.
I followed your recommendations and changed my commit accordingly.

@amueller
Copy link
Member

@amueller amueller commented Aug 12, 2019

tests are failing.

@HabchiSarra HabchiSarra force-pushed the fix-ignored-features-openml branch from 53223a6 to c6fdb40 Compare Aug 12, 2019
@HabchiSarra
Copy link
Contributor Author

@HabchiSarra HabchiSarra commented Aug 12, 2019

@amueller I fixed the tests.
However, for the failing doc build, I don't think it is related to my modifications.
Do you have an idea about its root cause or how to fix it?

@amueller
Copy link
Member

@amueller amueller commented Aug 12, 2019

Thanks! it's indeed unrelated to your changes, please merge with the current master branch which fixed this issue.

@HabchiSarra HabchiSarra force-pushed the fix-ignored-features-openml branch from c6fdb40 to 60bc2b6 Compare Aug 12, 2019
def _get_data_shape(data_qualities):
# Using the data_info dictionary from _get_data_info_by_name to extract
# the number of samples / features
def _get_data_instances(data_qualities):
Copy link
Member

@jnothman jnothman Aug 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def _get_data_instances(data_qualities):
def _get_num_samples(data_qualities):

I think this would be clearer naming for readers of scikit-learn code.

Copy link
Contributor Author

@HabchiSarra HabchiSarra Aug 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the name, thanks.

Parameters
----------
data_qualities : list
Copy link
Member

@jnothman jnothman Aug 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't be a list. It's keyed by strings.

Copy link
Contributor

@glemaitre glemaitre Aug 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is indeed a list of dict. Each dict has 2 keys "name" and "values".
This looks really weird because we could have a single dict with "name" being the key and "values" the associated value but this is not the case. This is actually what the dict comprehension is doing in l.448.

data_qualities : list of dict

Copy link
Contributor Author

@HabchiSarra HabchiSarra Aug 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is a list of dict.
I updated the documentation accordingly.

qualities = {d['name']: d['value'] for d in data_qualities}
try:
return (int(float(qualities['NumberOfInstances'])),
int(float(qualities['NumberOfFeatures'])))
instances = int(float(qualities['NumberOfInstances']))
except AttributeError:
Copy link
Member

@jnothman jnothman Aug 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how an AttributeError would be raised here.

Copy link
Contributor

@glemaitre glemaitre Aug 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be a KeyError isn't it?

Copy link
Contributor Author

@HabchiSarra HabchiSarra Aug 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the way we get the value, so it's not needed anymore.
See https://github.com/scikit-learn/scikit-learn/pull/14623/files#diff-ea672b15dd808c88257c58681d17bb6aR448

Copy link
Contributor

@glemaitre glemaitre left a comment

A couple of comments.

sklearn/datasets/openml.py Outdated Show resolved Hide resolved
sklearn/datasets/openml.py Outdated Show resolved Hide resolved
Parameters
----------
data_qualities : list
Copy link
Contributor

@glemaitre glemaitre Aug 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is indeed a list of dict. Each dict has 2 keys "name" and "values".
This looks really weird because we could have a single dict with "name" being the key and "values" the associated value but this is not the case. This is actually what the dict comprehension is doing in l.448.

data_qualities : list of dict

sklearn/datasets/openml.py Outdated Show resolved Hide resolved
sklearn/datasets/openml.py Show resolved Hide resolved
qualities = {d['name']: d['value'] for d in data_qualities}
try:
return (int(float(qualities['NumberOfInstances'])),
int(float(qualities['NumberOfFeatures'])))
instances = int(float(qualities['NumberOfInstances']))
except AttributeError:
Copy link
Contributor

@glemaitre glemaitre Aug 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be a KeyError isn't it?

sklearn/datasets/openml.py Outdated Show resolved Hide resolved
sklearn/datasets/openml.py Outdated Show resolved Hide resolved
sklearn/datasets/tests/test_openml.py Show resolved Hide resolved
sklearn/datasets/tests/test_openml.py Show resolved Hide resolved
@HabchiSarra HabchiSarra force-pushed the fix-ignored-features-openml branch from 60bc2b6 to e6e8a1d Compare Aug 13, 2019
@HabchiSarra
Copy link
Contributor Author

@HabchiSarra HabchiSarra commented Aug 13, 2019

I addressed the reviews, does the new version suit you?

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Please add an entry to the change log at doc/whats_new/v0.22.rst. Like the other entries there, please reference this pull request with :pr: and credit yourself (and other contributors if applicable) with :user:.

@HabchiSarra HabchiSarra force-pushed the fix-ignored-features-openml branch from e6e8a1d to 53921ca Compare Aug 13, 2019
Copy link
Contributor

@glemaitre glemaitre left a comment

nitpicking otherwise LGTM

sklearn/datasets/openml.py Show resolved Hide resolved
# Using the data_info dictionary from _get_data_info_by_name to extract
# the number of samples / features
def _get_num_samples(data_qualities):
"""Get the number of samples from data qualities
Copy link
Contributor

@glemaitre glemaitre Aug 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing full stop.

-------
instances : int
The number of samples in the dataset or -1 if data qualities are
unavailable
Copy link
Contributor

@glemaitre glemaitre Aug 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing full stop

Returns
-------
instances : int
Copy link
Contributor

@glemaitre glemaitre Aug 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
instances : int
n_samples : int

Copy link
Contributor Author

@HabchiSarra HabchiSarra Aug 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, it's done.

The shape extraction from data_qualities was using NumberOfFeatures,
which excluded the ignored features.
This exclusion caused a bug in the data conversion, since we tried
to reshape the whole dataset with a lower number of features.

This fix uses data_features to include ignored features in the shape
extraction

Fixes scikit-learn#14340
@HabchiSarra HabchiSarra force-pushed the fix-ignored-features-openml branch from 53921ca to ea80272 Compare Aug 13, 2019
@jnothman
Copy link
Member

@jnothman jnothman commented Aug 13, 2019

Thanks @HabchiSarra!

@jnothman jnothman merged commit e49b9d3 into scikit-learn:master Aug 13, 2019
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants