Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tests failing #2

Closed
tdhopper opened this issue Aug 13, 2013 · 15 comments
Closed

Tests failing #2

tdhopper opened this issue Aug 13, 2013 · 15 comments

Comments

@tdhopper
Copy link

I'm seeing a bunch of tests fail. I'm on Windows 7 with Python 2.7.5 via Anaconda 1.6.2 (64-bit).

C:\Anaconda\Lib\site-packages>python -m doctest README.rst
**********************************************************************
File "README.rst", line 75, in README.rst
Failed example:
    mapper.fit_transform(data)
Exception raised:
    Traceback (most recent call last):
      File "C:\Anaconda\lib\doctest.py", line 1289, in __run
        compileflags, 1) in test.globs
      File "<doctest README.rst[5]>", line 1, in <module>
        mapper.fit_transform(data)
      File "sklearn\base.py", line 408, in fit_transform
        return self.fit(X, **fit_params).transform(X)
      File "sklearn_pandas\__init__.py", line 46, in fit
        transformer.fit(X[columns])
      File "sklearn\preprocessing\label.py", line 241, in fit
        self.classes_ = unique_labels(y)
      File "sklearn\utils\multiclass.py", line 98, in unique_labels
        raise ValueError("Unknown label type")
    ValueError: Unknown label type
**********************************************************************
File "README.rst", line 89, in README.rst
Failed example:
    mapper.transform({'pet': ['cat'], 'children': [5.]})
Exception raised:
    Traceback (most recent call last):
      File "C:\Anaconda\lib\doctest.py", line 1289, in __run
        compileflags, 1) in test.globs
      File "<doctest README.rst[6]>", line 1, in <module>
        mapper.transform({'pet': ['cat'], 'children': [5.]})
      File "sklearn_pandas\__init__.py", line 52, in transform
        fea = transformer.transform(X[columns])
      File "sklearn\preprocessing\label.py", line 261, in transform
        self._check_fitted()
      File "sklearn\preprocessing\label.py", line 221, in _check_fitted
        raise ValueError("LabelBinarizer was not fitted yet.")
    ValueError: LabelBinarizer was not fitted yet.
**********************************************************************
File "README.rst", line 103, in README.rst
Failed example:
    mapper2.fit_transform(data)
Expected:
    array([[ 47.62288153],
           [-18.38596516],
           [  1.62873661],
           [-15.3709553 ],
           [-10.36602451],
           [ 16.62846476],
           [ -6.38116123],
           [-15.37597671]])
Got:
    array([[ 47.62195051],
           [-18.39077736],
           [  1.63037658],
           [-15.36917967],
           [-10.36208485],
           [ 16.62998504],
           [ -6.38386526],
           [-15.376405  ]])
**********************************************************************
File "README.rst", line 123, in README.rst
Failed example:
    cross_val_score(pipe, data, data.salary, sklearn.metrics.mean_squared_error)

Exception raised:
    Traceback (most recent call last):
      File "C:\Anaconda\lib\doctest.py", line 1289, in __run
        compileflags, 1) in test.globs
      File "<doctest README.rst[10]>", line 1, in <module>
        cross_val_score(pipe, data, data.salary, sklearn.metrics.mean_squared_er
ror)
      File "sklearn_pandas\__init__.py", line 34, in cross_val_score
        return cross_validation.cross_val_score(df, X_indices, *args, **kwargs)
      File "sklearn\cross_validation.py", line 1152, in cross_val_score
        for train, test in cv)
      File "sklearn\externals\joblib\parallel.py", line 517, in __call__
        self.dispatch(function, args, kwargs)
      File "sklearn\externals\joblib\parallel.py", line 312, in dispatch
        job = ImmediateApply(func, args, kwargs)
      File "sklearn\externals\joblib\parallel.py", line 136, in __init__
        self.results = func(*args, **kwargs)
      File "sklearn\cross_validation.py", line 1060, in _cross_val_score
        estimator.fit(X_train, y_train, **fit_params)
      File "sklearn_pandas\__init__.py", line 19, in fit
        self.estimator.fit(self._get_row_subset(x), y)
      File "sklearn\pipeline.py", line 130, in fit
        Xt, fit_params = self._pre_transform(X, y, **fit_params)
      File "sklearn\pipeline.py", line 120, in _pre_transform
        Xt = transform.fit_transform(Xt, y, **fit_params_steps[name])
      File "sklearn\base.py", line 411, in fit_transform
        return self.fit(X, y, **fit_params).transform(X)
      File "sklearn_pandas\__init__.py", line 46, in fit
        transformer.fit(X[columns])
      File "sklearn\preprocessing\label.py", line 241, in fit
        self.classes_ = unique_labels(y)
      File "sklearn\utils\multiclass.py", line 98, in unique_labels
        raise ValueError("Unknown label type")
    ValueError: Unknown label type
**********************************************************************
1 items had failures:
   4 of  11 in README.rst
***Test Failed*** 4 failures.
@paulgb
Copy link
Collaborator

paulgb commented Aug 14, 2013

Thanks for reporting this Tim.

I suspected having floating point calculations in the tests might lead to some issues -- I've changed the tests to only match the first two digits after the decimal.

Could you please paste the output of pip freeze for me? In the meantime I'll see if I can reproduce with the latest sklearn and pandas.

@tdhopper
Copy link
Author

Flask==0.10.1
Jinja2==2.6
MDP==3.3
PIL==1.1.7
PySAL==1.5.0
PySide==1.1.2
PyYAML==3.10
Pygments==1.6
SQLAlchemy==0.8.1
Sphinx==1.1.3
Werkzeug==0.9.1
astropy==0.2.3
atom==0.2.3
binstar-client==0.1.0
biopython==1.61
bitarray==0.8.1
boto==2.9.6
casuarius==1.1
chaco==4.2.1
conda==1.8.1
cubes==0.10.2
distribute==0.6.45
docutils==0.10
enable==4.2.1
enaml==0.7.6
gevent==0.13.8
gevent-websocket==0.3.6
gevent-zeromq==0.2.2
greenlet==0.4.1
grin==1.2.1
h5py==2.1.1
ipython==0.13.2
itsdangerous==0.21
keyring==1.4
llvmmath==0.1
llvmpy==0.11.3
lxml==3.2.1
matplotlib==1.2.1
menuinst==1.0.1
meta==development
moves==0.1
networkx==1.7
nltk==2.0.4
nose==1.3.0
numba==0.9.0
numexpr==2.0.1
numpy==1.7.1
pandas==0.12.0
pep8==1.4.5
ply==3.4
praw==2.1.4
psutil==0.7.1
py==1.4.14
pycosat==0.6.0
pycparser==2.09.1
pycrypto==2.6
pyface==4.2.1
pyflakes==0.7.2
pyparsing==1.5.6
pyreadline==2.0-dev1
pytest==2.3.5
python-dateutil==2.1
pytz==2013b
pywin32==218.4
pyzmq==2.2.0.1
requests==1.2.3
rope==0.9.4
scikit-image==0.8.2
scikit-learn==0.14.1
scipy==0.12.0
simplejson==3.3.0
six==1.3.0
sklearn-pandas==0.0.3
spyder==2.2.0
statsmodels==0.4.3
sympy==0.7.2
tables==2.4.0
tornado==3.1
traits==4.2.1
traitsui==4.2.1
tweepy==2.1
update-checker==0.5
vincent==0.2
wsgiref==0.1.2
xlrd==0.9.2
xlwt==0.7.5

@paulgb
Copy link
Collaborator

paulgb commented Aug 15, 2013

I'm able to reproduce this. It seems the interface of sklearn has changed. The following code fails with scikit-learn 0.14.1 but works with scikit-learn 0.13.1:

import pandas as pd
import numpy as np
import sklearn.preprocessing

data = pd.DataFrame({'pet': ['cat', 'dog', 'dog', 'fish', 'cat', 'dog', 'cat', 'fish'],
                     'children': [4., 6, 3, 3, 2, 3, 5, 4],
                     'salary':   [90, 24, 44, 27, 32, 59, 36, 27]})

lb = sklearn.preprocessing.LabelBinarizer()
lb.fit(data.pet)
print lb.transform(data.pet)

I need to investigate further to see if this is a sklearn bug or if the tests need to be adjusted appropriately.

@dolaameng
Copy link

I second that it is a "bug" in sklearn 0.14, in file sklearn/utils/multiclass.py, function type_of_target , line293

if y.ndim > 2 or y.dtype == object:
        return 'unknown'

It will return 'unknown' for any np.array like strings. So it works fine with ['cat', 'dog', 'fish'], but not with np.asarray(['cat', 'dog', 'fish']) anymore.

@paulgb
Copy link
Collaborator

paulgb commented Aug 16, 2013

Good find. type_of_target actually will return 'multiclass' for np.array(['cat', 'dog', 'fish']), because it has the right dtype ('|S1'). But calling as_matrix() on the pandas dataframe gives a matrix with dtype 'object' because the columns have different types (or maybe this is always the pandas behaviour, I haven't checked).

Still, it seems a little weird for the behaviour to change based on the data representation of two arrays that numpy considers equivalent.

I'm going to take a better look at the sklearn code to see if there's a good reason behind this and if I can't find one I'll file a bug report on that project.

In [38]: p = np.array(['a', 'b', 'c'])

In [39]: q = np.array(['a', 'b', 'c'], dtype='object')

In [40]: np.array_equal(p, q)
Out[40]: True

In [41]: type_of_target(p)
Out[41]: 'multiclass'

In [42]: type_of_target(q)
Out[42]: 'unknown'

@paulgb
Copy link
Collaborator

paulgb commented Aug 20, 2013

It seems to be a deeper disconnect between sklearn and pandas than I'd hoped: pandas seems to want string arrays to have the "object" dtype, while sklearn expects them to have the appropriate numpy string datatype. I've found a few hack solutions but none that I feel good about publishing. I'm continuing to dig into the sklearn code to see if there's a better way.

@ogrisel
Copy link
Contributor

ogrisel commented Aug 20, 2013

I think we want to support dtype='object' for arrays of variable length strings as well in scikit-learn: type_of_target(q) == 'unknown' is a probably a bug.

@ogrisel
Copy link
Contributor

ogrisel commented Aug 20, 2013

I wish numpy had a dtype for variable length strings...

@paulgb
Copy link
Collaborator

paulgb commented Aug 20, 2013

Yes, it's a shame that an array of sequences looks the same (from a dtype perspective) as an array of variable-length strings.

The least hacky workaround that I can think of without changing sklearn is to convert the arrays to fixed-length strings (np.array(X, dtype='|S')) before sending it off to sklearn, but it would be great to fix sklearn instead. Looks to me like it would just be a matter of looping over the array in sklearn to check if the types are all str/unicode objects, thoughts?

@ogrisel
Copy link
Contributor

ogrisel commented Aug 20, 2013

+1 for the temp workaround in sklearn_pandas to restaure sklearn 0.14 compat and I will create an issue for the sklearn project.

@ogrisel
Copy link
Contributor

ogrisel commented Sep 19, 2013

@paulgb wouldn't converting to list of string or unicode before sending to sklearn work even better?

@paulgb
Copy link
Collaborator

paulgb commented Sep 20, 2013

That would work, but the issue is more with knowing when to convert to strings without having to do a scan of the entire table.

I experimented a little more about pandas internals and it seems that object is used as a table dtype any time a table has heterogenous types, but as a column dtype only when the content is a string. So my workaround solution is to convert the matrix dtype to string if and only if every column in the mapping has the object dtype.

I've updated the code here and on PyPi to version 0.0.4 which includes this fix. I'd like to wait to hear some feedback on whether it solved the problem before closing this issue.

@paulgb
Copy link
Collaborator

paulgb commented Sep 21, 2013

Closing this optimistically based on tests passing and things working for Tim (https://twitter.com/tdhopper/status/381088588739272704)

@paulgb paulgb closed this as completed Sep 21, 2013
@tdhopper
Copy link
Author

First time one of my tweets has ever been mentioned as a reason for closing a bug report.

@linehammer
Copy link

There is a mismatch in "What you can pass" Vs. "What you are actually passing". This means that the scikit-learn library is not able to recognize what type of problem you want to solve ( regression or classification ). The Unknown label type: 'unknown' error raised related to the Y values that you use in scikit-learn .

solutions

  • Group your Y values into bins (classes for example: 0, 1, 2, 3) and apply classification modeling to your data

  • In most cases, your Y values are of type object, so sklearn cannot recognize its type. Add the line y=y.astype('int') before you pass the variable into the classifier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants