Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MultiOutputClassifier.predict_proba fails if targets have different number of values #8093

Closed
pjbull opened this issue Dec 21, 2016 · 9 comments
Closed
Labels
Bug

Comments

@pjbull
Copy link
Contributor

@pjbull pjbull commented Dec 21, 2016

Description

If two target columns are categorical and have a different number unique values, MultiOutputClassifier.predict_proba raises a value error when trying to dstack the probability matrices.

Steps/Code to Reproduce

Example:

from sklearn.linear_model import LogisticRegression
from sklearn.multioutput import MultiOutputClassifier

import numpy as np

# random features
X = np.random.normal(size=(100, 100))

# random labels
Y = np.concatenate([
        np.random.choice(['a', 'b'], (100, 1)),     # first column can have 2 values
        np.random.choice(['d', 'e', 'f'], (100, 1)) # second column can have 3 
    ], axis=1)

clf = MultiOutputClassifier(LogisticRegression())

clf.fit(X, Y)

clf.predict_proba(X)

Expected Results

No error is thrown. It looks like the RandomForestClassifier handles data of this shape and returns a list of numpy arrays. I would expect the same behavior in this case.

Actual Results

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-40-84c34a558c92> in <module>()
     18 clf.fit(X, Y)
     19 
---> 20 clf.predict_proba(X)

anaconda/lib/python3.5/site-packages/sklearn/multioutput.py in predict_proba(self, X)
    224 
    225         results = np.dstack([estimator.predict_proba(X) for estimator in
--> 226                             self.estimators_])
    227         return results
    228 

anaconda/lib/python3.5/site-packages/numpy/lib/shape_base.py in dstack(tup)
    366 
    367     """
--> 368     return _nx.concatenate([atleast_3d(_m) for _m in tup], 2)
    369 
    370 def _replace_zero_by_x_arrays(sub_arys):

ValueError: all the input array dimensions except for the concatenation axis must match exactly

Versions

Darwin-15.6.0-x86_64-i386-64bit
Python 3.5.2 |Anaconda custom (x86_64)| (default, Jul  2 2016, 17:52:12) 
[GCC 4.2.1 Compatible Apple LLVM 4.2 (clang-425.0.28)]
NumPy 1.11.1
SciPy 0.18.1
Scikit-Learn 0.18.1

If returning a list is the right fix, happy to submit a PR for this.

@jnothman
Copy link
Member

@jnothman jnothman commented Dec 21, 2016

thanks for reporting, and for the patch

@yupbank
Copy link
Contributor

@yupbank yupbank commented Dec 21, 2016

how about use a sparse matrix of 3d array? some "extra" probability are imported though...

@jnothman
Copy link
Member

@jnothman jnothman commented Dec 21, 2016

@yupbank
Copy link
Contributor

@yupbank yupbank commented Dec 21, 2016

then it can not be indexed in a matrix fashion and a bunch of np.arg* can not being used..

@pjbull
Copy link
Contributor Author

@pjbull pjbull commented Dec 21, 2016

To me, it makes more sense to leave the decision to stack as a sparse matrix (which introduces NaNs) or concatenate into a larger 2d matrix (which makes it harder to separate out individual estimators) up to the consumer. Both are one line calls in numpy, but make it more challenging to reason about which predictions belong to which estimator/class.

The list of arrays lets you easily construct either of those, but it's much more of a pain to go the other way and back out probabilities for any one estimator from those structures.

@yupbank
Copy link
Contributor

@yupbank yupbank commented Dec 21, 2016

sparse matrix does not necessary introduce NaN but also 0s by default..
something like
(output, max_classes, probability) ?

@pjbull
Copy link
Contributor Author

@pjbull pjbull commented Dec 21, 2016

Don't like it because user has to track # of classes per estimator that are valid.

@jnothman
Copy link
Member

@jnothman jnothman commented Dec 21, 2016

@maniteja123
Copy link
Contributor

@maniteja123 maniteja123 commented Dec 22, 2016

Thanks for the report and fix. Sorry I wasn't aware of the fact that list of probabilities is common for predict_proba of multi-output and also didn't think about the issue of different number of classes for different output.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

4 participants
You can’t perform that action at this time.