Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fetch_openml('zoo') raises IndexError in sklearn.datasets.openml._convert_arff_data #14340

Closed
azrdev opened this issue Jul 13, 2019 · 7 comments · Fixed by #14623
Closed

fetch_openml('zoo') raises IndexError in sklearn.datasets.openml._convert_arff_data #14340

azrdev opened this issue Jul 13, 2019 · 7 comments · Fixed by #14623
Labels

Comments

@azrdev
Copy link

azrdev commented Jul 13, 2019

OpenML 'zoo' dataset fails to load.

>>> import sklearn.datasets
>>> sklearn.datasets.fetch_openml( 'zoo')
/usr/lib/python3.7/site-packages/sklearn/datasets/openml.py:305: UserWarning: Multiple active versions of the dataset matching the name zoo exist. Versions may be fundamentally different, returning version 1.
  " {version}.".format(name=name, version=res[0]['version']))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.7/site-packages/sklearn/datasets/openml.py", line 643, in fetch_openml
    X, y = _convert_arff_data(arff['data'], col_slice_x, col_slice_y, shape)
  File "/usr/lib/python3.7/site-packages/sklearn/datasets/openml.py", line 249, in _convert_arff_data
    y = data[:, col_slice_y]
IndexError: index 17 is out of bounds for axis 1 with size 17

First reported as openml/OpenML#989

Versions

>>> import sklearn; sklearn.show_versions()
/tmp/env/lib/python3.7/site-packages/numpy/distutils/system_info.py:639: UserWarning: 
    Atlas (http://math-atlas.sourceforge.net/) libraries not found.
    Directories to search for the libraries can be specified in the
    numpy/distutils/site.cfg file (section [atlas]) or by setting
    the ATLAS environment variable.
  self.calc_info()

System:
    python: 3.7.3 (default, Jun 24 2019, 04:54:02)  [GCC 9.1.0]
executable: /tmp/env/bin/python
   machine: Linux-5.1.16-arch1-1-ARCH-x86_64-with-arch

BLAS:
    macros: NO_ATLAS_INFO=1, HAVE_CBLAS=None
  lib_dirs: /usr/lib64
cblas_libs: cblas

Python deps:
       pip: 19.0.3
setuptools: 40.8.0
   sklearn: 0.21.2
     numpy: 1.16.4
     scipy: 1.3.0
    Cython: None
    pandas: 0.24.2
@amueller
Copy link
Member

Thanks for the report, looks like a bug indeed.

@amueller
Copy link
Member

amueller commented Jul 14, 2019

we should be ignoring the ignored features (in this case 'animal') and change the indices accordingly. see https://github.com/openml/openml-python/blob/347c4a6c2a7b072de574d2bd2f5e0952f6375a84/openml/datasets/dataset.py#L537 for a reference implementation.
@janvanrijn any chance you wanna take a stab at it?

@corona10
Copy link
Contributor

@amueller Hi if this issue is not hard to solve, can I take a look at it?

@amueller
Copy link
Member

@corona10 Not entirely sure how hard it is, feel free to have a look.

@HabchiSarra
Copy link
Contributor

@corona10 Hi, are you still working on this issue? Otherwise, I can take it on.

@corona10
Copy link
Contributor

Go ahead please

HabchiSarra pushed a commit to HabchiSarra/scikit-learn that referenced this issue Aug 10, 2019
The shape extraction from data_qualities was using NumberOfFeatures,
which excluded the ignored features.
This exclusion caused a bug in the data conversion, since we tried
to reshape the whole dataset with a lower number of features.

This commit returns all features in the shape extraction.

Fixes scikit-learn#14340
@HabchiSarra
Copy link
Contributor

Hi,
I proposed a fix by including the ignored features in the data reshaping.
Could you please check it and give me feedback about the PR?

HabchiSarra pushed a commit to HabchiSarra/scikit-learn that referenced this issue Aug 12, 2019
The shape extraction from data_qualities was using NumberOfFeatures,
which excluded the ignored features.
This exclusion caused a bug in the data conversion, since we tried
to reshape the whole dataset with a lower number of features.

This fix uses data_features to include ignored features in the shape
extraction

Fixes scikit-learn#14340
HabchiSarra pushed a commit to HabchiSarra/scikit-learn that referenced this issue Aug 12, 2019
The shape extraction from data_qualities was using NumberOfFeatures,
which excluded the ignored features.
This exclusion caused a bug in the data conversion, since we tried
to reshape the whole dataset with a lower number of features.

This fix uses data_features to include ignored features in the shape
extraction

Fixes scikit-learn#14340
HabchiSarra pushed a commit to HabchiSarra/scikit-learn that referenced this issue Aug 12, 2019
The shape extraction from data_qualities was using NumberOfFeatures,
which excluded the ignored features.
This exclusion caused a bug in the data conversion, since we tried
to reshape the whole dataset with a lower number of features.

This fix uses data_features to include ignored features in the shape
extraction

Fixes scikit-learn#14340
HabchiSarra pushed a commit to HabchiSarra/scikit-learn that referenced this issue Aug 13, 2019
The shape extraction from data_qualities was using NumberOfFeatures,
which excluded the ignored features.
This exclusion caused a bug in the data conversion, since we tried
to reshape the whole dataset with a lower number of features.

This fix uses data_features to include ignored features in the shape
extraction

Fixes scikit-learn#14340
HabchiSarra pushed a commit to HabchiSarra/scikit-learn that referenced this issue Aug 13, 2019
The shape extraction from data_qualities was using NumberOfFeatures,
which excluded the ignored features.
This exclusion caused a bug in the data conversion, since we tried
to reshape the whole dataset with a lower number of features.

This fix uses data_features to include ignored features in the shape
extraction

Fixes scikit-learn#14340
HabchiSarra pushed a commit to HabchiSarra/scikit-learn that referenced this issue Aug 13, 2019
The shape extraction from data_qualities was using NumberOfFeatures,
which excluded the ignored features.
This exclusion caused a bug in the data conversion, since we tried
to reshape the whole dataset with a lower number of features.

This fix uses data_features to include ignored features in the shape
extraction

Fixes scikit-learn#14340
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants