New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FIX IndexError in fetch_openml('zoo') #14623
FIX IndexError in fetch_openml('zoo') #14623
Conversation
There is a naming convention when dealing with data in sklearn/datasets/tests/data/openmp/62
. You can learn about the conventions by looking at other data in the test directory.
For example the arff data is named data-v1-download-1.arff.gz
, etc.
a2505a6
to
53223a6
Compare
Thanks, @thomasjpfan. |
tests are failing. |
53223a6
to
c6fdb40
Compare
@amueller I fixed the tests. |
Thanks! it's indeed unrelated to your changes, please merge with the current master branch which fixed this issue. |
c6fdb40
to
60bc2b6
Compare
sklearn/datasets/openml.py
Outdated
def _get_data_shape(data_qualities): | ||
# Using the data_info dictionary from _get_data_info_by_name to extract | ||
# the number of samples / features | ||
def _get_data_instances(data_qualities): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def _get_data_instances(data_qualities): | |
def _get_num_samples(data_qualities): |
I think this would be clearer naming for readers of scikit-learn code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the name, thanks.
sklearn/datasets/openml.py
Outdated
Parameters | ||
---------- | ||
data_qualities : list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can't be a list. It's keyed by strings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is indeed a list of dict. Each dict has 2 keys "name" and "values".
This looks really weird because we could have a single dict with "name" being the key and "values" the associated value but this is not the case. This is actually what the dict comprehension is doing in l.448.
data_qualities : list of dict
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is a list of dict.
I updated the documentation accordingly.
sklearn/datasets/openml.py
Outdated
qualities = {d['name']: d['value'] for d in data_qualities} | ||
try: | ||
return (int(float(qualities['NumberOfInstances'])), | ||
int(float(qualities['NumberOfFeatures']))) | ||
instances = int(float(qualities['NumberOfInstances'])) | ||
except AttributeError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see how an AttributeError would be raised here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be a KeyError
isn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the way we get the value, so it's not needed anymore.
See https://github.com/scikit-learn/scikit-learn/pull/14623/files#diff-ea672b15dd808c88257c58681d17bb6aR448
sklearn/datasets/openml.py
Outdated
Parameters | ||
---------- | ||
data_qualities : list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is indeed a list of dict. Each dict has 2 keys "name" and "values".
This looks really weird because we could have a single dict with "name" being the key and "values" the associated value but this is not the case. This is actually what the dict comprehension is doing in l.448.
data_qualities : list of dict
sklearn/datasets/openml.py
Outdated
qualities = {d['name']: d['value'] for d in data_qualities} | ||
try: | ||
return (int(float(qualities['NumberOfInstances'])), | ||
int(float(qualities['NumberOfFeatures']))) | ||
instances = int(float(qualities['NumberOfInstances'])) | ||
except AttributeError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be a KeyError
isn't it?
60bc2b6
to
e6e8a1d
Compare
I addressed the reviews, does the new version suit you? |
Please add an entry to the change log at doc/whats_new/v0.22.rst
. Like the other entries there, please reference this pull request with :pr:
and credit yourself (and other contributors if applicable) with :user:
.
e6e8a1d
to
53921ca
Compare
sklearn/datasets/openml.py
Outdated
# Using the data_info dictionary from _get_data_info_by_name to extract | ||
# the number of samples / features | ||
def _get_num_samples(data_qualities): | ||
"""Get the number of samples from data qualities |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing full stop.
sklearn/datasets/openml.py
Outdated
------- | ||
instances : int | ||
The number of samples in the dataset or -1 if data qualities are | ||
unavailable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing full stop
sklearn/datasets/openml.py
Outdated
Returns | ||
------- | ||
instances : int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instances : int | |
n_samples : int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, it's done.
The shape extraction from data_qualities was using NumberOfFeatures, which excluded the ignored features. This exclusion caused a bug in the data conversion, since we tried to reshape the whole dataset with a lower number of features. This fix uses data_features to include ignored features in the shape extraction Fixes scikit-learn#14340
53921ca
to
ea80272
Compare
Thanks @HabchiSarra! |
Reference Issues/PRs
Fixes #14340
What does this implement/fix? Explain your changes.
The shape extraction from data_qualities used NumberOfFeatures, which excluded the ignored features.
This exclusion caused a bug in the data conversion since we tried to reshape the whole dataset with a lower number of features.
This fix includes all features in the shape extraction.