Tests failing #2

tdhopper · 2013-08-13T15:34:02Z

I'm seeing a bunch of tests fail. I'm on Windows 7 with Python 2.7.5 via Anaconda 1.6.2 (64-bit).

C:\Anaconda\Lib\site-packages>python -m doctest README.rst
**********************************************************************
File "README.rst", line 75, in README.rst
Failed example:
    mapper.fit_transform(data)
Exception raised:
    Traceback (most recent call last):
      File "C:\Anaconda\lib\doctest.py", line 1289, in __run
        compileflags, 1) in test.globs
      File "<doctest README.rst[5]>", line 1, in <module>
        mapper.fit_transform(data)
      File "sklearn\base.py", line 408, in fit_transform
        return self.fit(X, **fit_params).transform(X)
      File "sklearn_pandas\__init__.py", line 46, in fit
        transformer.fit(X[columns])
      File "sklearn\preprocessing\label.py", line 241, in fit
        self.classes_ = unique_labels(y)
      File "sklearn\utils\multiclass.py", line 98, in unique_labels
        raise ValueError("Unknown label type")
    ValueError: Unknown label type
**********************************************************************
File "README.rst", line 89, in README.rst
Failed example:
    mapper.transform({'pet': ['cat'], 'children': [5.]})
Exception raised:
    Traceback (most recent call last):
      File "C:\Anaconda\lib\doctest.py", line 1289, in __run
        compileflags, 1) in test.globs
      File "<doctest README.rst[6]>", line 1, in <module>
        mapper.transform({'pet': ['cat'], 'children': [5.]})
      File "sklearn_pandas\__init__.py", line 52, in transform
        fea = transformer.transform(X[columns])
      File "sklearn\preprocessing\label.py", line 261, in transform
        self._check_fitted()
      File "sklearn\preprocessing\label.py", line 221, in _check_fitted
        raise ValueError("LabelBinarizer was not fitted yet.")
    ValueError: LabelBinarizer was not fitted yet.
**********************************************************************
File "README.rst", line 103, in README.rst
Failed example:
    mapper2.fit_transform(data)
Expected:
    array([[ 47.62288153],
           [-18.38596516],
           [  1.62873661],
           [-15.3709553 ],
           [-10.36602451],
           [ 16.62846476],
           [ -6.38116123],
           [-15.37597671]])
Got:
    array([[ 47.62195051],
           [-18.39077736],
           [  1.63037658],
           [-15.36917967],
           [-10.36208485],
           [ 16.62998504],
           [ -6.38386526],
           [-15.376405  ]])
**********************************************************************
File "README.rst", line 123, in README.rst
Failed example:
    cross_val_score(pipe, data, data.salary, sklearn.metrics.mean_squared_error)

Exception raised:
    Traceback (most recent call last):
      File "C:\Anaconda\lib\doctest.py", line 1289, in __run
        compileflags, 1) in test.globs
      File "<doctest README.rst[10]>", line 1, in <module>
        cross_val_score(pipe, data, data.salary, sklearn.metrics.mean_squared_er
ror)
      File "sklearn_pandas\__init__.py", line 34, in cross_val_score
        return cross_validation.cross_val_score(df, X_indices, *args, **kwargs)
      File "sklearn\cross_validation.py", line 1152, in cross_val_score
        for train, test in cv)
      File "sklearn\externals\joblib\parallel.py", line 517, in __call__
        self.dispatch(function, args, kwargs)
      File "sklearn\externals\joblib\parallel.py", line 312, in dispatch
        job = ImmediateApply(func, args, kwargs)
      File "sklearn\externals\joblib\parallel.py", line 136, in __init__
        self.results = func(*args, **kwargs)
      File "sklearn\cross_validation.py", line 1060, in _cross_val_score
        estimator.fit(X_train, y_train, **fit_params)
      File "sklearn_pandas\__init__.py", line 19, in fit
        self.estimator.fit(self._get_row_subset(x), y)
      File "sklearn\pipeline.py", line 130, in fit
        Xt, fit_params = self._pre_transform(X, y, **fit_params)
      File "sklearn\pipeline.py", line 120, in _pre_transform
        Xt = transform.fit_transform(Xt, y, **fit_params_steps[name])
      File "sklearn\base.py", line 411, in fit_transform
        return self.fit(X, y, **fit_params).transform(X)
      File "sklearn_pandas\__init__.py", line 46, in fit
        transformer.fit(X[columns])
      File "sklearn\preprocessing\label.py", line 241, in fit
        self.classes_ = unique_labels(y)
      File "sklearn\utils\multiclass.py", line 98, in unique_labels
        raise ValueError("Unknown label type")
    ValueError: Unknown label type
**********************************************************************
1 items had failures:
   4 of  11 in README.rst
***Test Failed*** 4 failures.

The text was updated successfully, but these errors were encountered:

paulgb · 2013-08-14T00:09:29Z

Thanks for reporting this Tim.

I suspected having floating point calculations in the tests might lead to some issues -- I've changed the tests to only match the first two digits after the decimal.

Could you please paste the output of pip freeze for me? In the meantime I'll see if I can reproduce with the latest sklearn and pandas.

tdhopper · 2013-08-14T19:00:00Z

Flask==0.10.1
Jinja2==2.6
MDP==3.3
PIL==1.1.7
PySAL==1.5.0
PySide==1.1.2
PyYAML==3.10
Pygments==1.6
SQLAlchemy==0.8.1
Sphinx==1.1.3
Werkzeug==0.9.1
astropy==0.2.3
atom==0.2.3
binstar-client==0.1.0
biopython==1.61
bitarray==0.8.1
boto==2.9.6
casuarius==1.1
chaco==4.2.1
conda==1.8.1
cubes==0.10.2
distribute==0.6.45
docutils==0.10
enable==4.2.1
enaml==0.7.6
gevent==0.13.8
gevent-websocket==0.3.6
gevent-zeromq==0.2.2
greenlet==0.4.1
grin==1.2.1
h5py==2.1.1
ipython==0.13.2
itsdangerous==0.21
keyring==1.4
llvmmath==0.1
llvmpy==0.11.3
lxml==3.2.1
matplotlib==1.2.1
menuinst==1.0.1
meta==development
moves==0.1
networkx==1.7
nltk==2.0.4
nose==1.3.0
numba==0.9.0
numexpr==2.0.1
numpy==1.7.1
pandas==0.12.0
pep8==1.4.5
ply==3.4
praw==2.1.4
psutil==0.7.1
py==1.4.14
pycosat==0.6.0
pycparser==2.09.1
pycrypto==2.6
pyface==4.2.1
pyflakes==0.7.2
pyparsing==1.5.6
pyreadline==2.0-dev1
pytest==2.3.5
python-dateutil==2.1
pytz==2013b
pywin32==218.4
pyzmq==2.2.0.1
requests==1.2.3
rope==0.9.4
scikit-image==0.8.2
scikit-learn==0.14.1
scipy==0.12.0
simplejson==3.3.0
six==1.3.0
sklearn-pandas==0.0.3
spyder==2.2.0
statsmodels==0.4.3
sympy==0.7.2
tables==2.4.0
tornado==3.1
traits==4.2.1
traitsui==4.2.1
tweepy==2.1
update-checker==0.5
vincent==0.2
wsgiref==0.1.2
xlrd==0.9.2
xlwt==0.7.5

paulgb · 2013-08-15T01:44:59Z

I'm able to reproduce this. It seems the interface of sklearn has changed. The following code fails with scikit-learn 0.14.1 but works with scikit-learn 0.13.1:

import pandas as pd
import numpy as np
import sklearn.preprocessing

data = pd.DataFrame({'pet': ['cat', 'dog', 'dog', 'fish', 'cat', 'dog', 'cat', 'fish'],
                     'children': [4., 6, 3, 3, 2, 3, 5, 4],
                     'salary':   [90, 24, 44, 27, 32, 59, 36, 27]})

lb = sklearn.preprocessing.LabelBinarizer()
lb.fit(data.pet)
print lb.transform(data.pet)

I need to investigate further to see if this is a sklearn bug or if the tests need to be adjusted appropriately.

dolaameng · 2013-08-15T02:33:32Z

I second that it is a "bug" in sklearn 0.14, in file sklearn/utils/multiclass.py, function type_of_target , line293

if y.ndim > 2 or y.dtype == object:
        return 'unknown'

It will return 'unknown' for any np.array like strings. So it works fine with ['cat', 'dog', 'fish'], but not with np.asarray(['cat', 'dog', 'fish']) anymore.

paulgb · 2013-08-16T04:06:24Z

Good find. type_of_target actually will return 'multiclass' for np.array(['cat', 'dog', 'fish']), because it has the right dtype ('|S1'). But calling as_matrix() on the pandas dataframe gives a matrix with dtype 'object' because the columns have different types (or maybe this is always the pandas behaviour, I haven't checked).

Still, it seems a little weird for the behaviour to change based on the data representation of two arrays that numpy considers equivalent.

I'm going to take a better look at the sklearn code to see if there's a good reason behind this and if I can't find one I'll file a bug report on that project.

In [38]: p = np.array(['a', 'b', 'c'])

In [39]: q = np.array(['a', 'b', 'c'], dtype='object')

In [40]: np.array_equal(p, q)
Out[40]: True

In [41]: type_of_target(p)
Out[41]: 'multiclass'

In [42]: type_of_target(q)
Out[42]: 'unknown'

paulgb · 2013-08-20T03:46:12Z

It seems to be a deeper disconnect between sklearn and pandas than I'd hoped: pandas seems to want string arrays to have the "object" dtype, while sklearn expects them to have the appropriate numpy string datatype. I've found a few hack solutions but none that I feel good about publishing. I'm continuing to dig into the sklearn code to see if there's a better way.

ogrisel · 2013-08-20T13:39:32Z

I think we want to support dtype='object' for arrays of variable length strings as well in scikit-learn: type_of_target(q) == 'unknown' is a probably a bug.

ogrisel · 2013-08-20T13:40:18Z

I wish numpy had a dtype for variable length strings...

paulgb · 2013-08-20T13:48:20Z

Yes, it's a shame that an array of sequences looks the same (from a dtype perspective) as an array of variable-length strings.

The least hacky workaround that I can think of without changing sklearn is to convert the arrays to fixed-length strings (np.array(X, dtype='|S')) before sending it off to sklearn, but it would be great to fix sklearn instead. Looks to me like it would just be a matter of looping over the array in sklearn to check if the types are all str/unicode objects, thoughts?

ogrisel · 2013-08-20T13:54:31Z

+1 for the temp workaround in sklearn_pandas to restaure sklearn 0.14 compat and I will create an issue for the sklearn project.

ogrisel · 2013-09-19T18:20:01Z

@paulgb wouldn't converting to list of string or unicode before sending to sklearn work even better?

paulgb · 2013-09-20T03:15:02Z

That would work, but the issue is more with knowing when to convert to strings without having to do a scan of the entire table.

I experimented a little more about pandas internals and it seems that object is used as a table dtype any time a table has heterogenous types, but as a column dtype only when the content is a string. So my workaround solution is to convert the matrix dtype to string if and only if every column in the mapping has the object dtype.

I've updated the code here and on PyPi to version 0.0.4 which includes this fix. I'd like to wait to hear some feedback on whether it solved the problem before closing this issue.

paulgb · 2013-09-21T21:03:12Z

Closing this optimistically based on tests passing and things working for Tim (https://twitter.com/tdhopper/status/381088588739272704)

tdhopper · 2013-09-21T22:41:56Z

First time one of my tweets has ever been mentioned as a reason for closing a bug report.

linehammer · 2021-04-21T04:59:13Z

There is a mismatch in "What you can pass" Vs. "What you are actually passing". This means that the scikit-learn library is not able to recognize what type of problem you want to solve ( regression or classification ). The Unknown label type: 'unknown' error raised related to the Y values that you use in scikit-learn .

solutions

Group your Y values into bins (classes for example: 0, 1, 2, 3) and apply classification modeling to your data
In most cases, your Y values are of type object, so sklearn cannot recognize its type. Add the line y=y.astype('int') before you pass the variable into the classifier.

paulgb mentioned this issue Aug 15, 2013

mapper.fit_transform() fails #3

Closed

ogrisel mentioned this issue Aug 20, 2013

Multiclass and multilabel classifiers should accept arrays with string labels with dtype=object scikit-learn/scikit-learn#2374

Closed

paulgb added a commit that referenced this issue Sep 20, 2013

major cleanup and docs push, workaround fix for bug #2

8c4af33

paulgb closed this as completed Sep 21, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tests failing #2

Tests failing #2

tdhopper commented Aug 13, 2013

paulgb commented Aug 14, 2013

tdhopper commented Aug 14, 2013

paulgb commented Aug 15, 2013

dolaameng commented Aug 15, 2013

paulgb commented Aug 16, 2013

paulgb commented Aug 20, 2013

ogrisel commented Aug 20, 2013

ogrisel commented Aug 20, 2013

paulgb commented Aug 20, 2013

ogrisel commented Aug 20, 2013

ogrisel commented Sep 19, 2013

paulgb commented Sep 20, 2013

paulgb commented Sep 21, 2013

tdhopper commented Sep 21, 2013

linehammer commented Apr 21, 2021

Tests failing #2

Tests failing #2

Comments

tdhopper commented Aug 13, 2013

paulgb commented Aug 14, 2013

tdhopper commented Aug 14, 2013

paulgb commented Aug 15, 2013

dolaameng commented Aug 15, 2013

paulgb commented Aug 16, 2013

paulgb commented Aug 20, 2013

ogrisel commented Aug 20, 2013

ogrisel commented Aug 20, 2013

paulgb commented Aug 20, 2013

ogrisel commented Aug 20, 2013

ogrisel commented Sep 19, 2013

paulgb commented Sep 20, 2013

paulgb commented Sep 21, 2013

tdhopper commented Sep 21, 2013

linehammer commented Apr 21, 2021