New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MultiOutputClassifier.predict_proba fails if targets have different number of values #8093

Closed
pjbull opened this Issue Dec 21, 2016 · 9 comments

Comments

Projects
None yet
4 participants
@pjbull
Contributor

pjbull commented Dec 21, 2016

Description

If two target columns are categorical and have a different number unique values, MultiOutputClassifier.predict_proba raises a value error when trying to dstack the probability matrices.

Steps/Code to Reproduce

Example:

from sklearn.linear_model import LogisticRegression
from sklearn.multioutput import MultiOutputClassifier

import numpy as np

# random features
X = np.random.normal(size=(100, 100))

# random labels
Y = np.concatenate([
        np.random.choice(['a', 'b'], (100, 1)),     # first column can have 2 values
        np.random.choice(['d', 'e', 'f'], (100, 1)) # second column can have 3 
    ], axis=1)

clf = MultiOutputClassifier(LogisticRegression())

clf.fit(X, Y)

clf.predict_proba(X)

Expected Results

No error is thrown. It looks like the RandomForestClassifier handles data of this shape and returns a list of numpy arrays. I would expect the same behavior in this case.

Actual Results

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-40-84c34a558c92> in <module>()
     18 clf.fit(X, Y)
     19 
---> 20 clf.predict_proba(X)

anaconda/lib/python3.5/site-packages/sklearn/multioutput.py in predict_proba(self, X)
    224 
    225         results = np.dstack([estimator.predict_proba(X) for estimator in
--> 226                             self.estimators_])
    227         return results
    228 

anaconda/lib/python3.5/site-packages/numpy/lib/shape_base.py in dstack(tup)
    366 
    367     """
--> 368     return _nx.concatenate([atleast_3d(_m) for _m in tup], 2)
    369 
    370 def _replace_zero_by_x_arrays(sub_arys):

ValueError: all the input array dimensions except for the concatenation axis must match exactly

Versions

Darwin-15.6.0-x86_64-i386-64bit
Python 3.5.2 |Anaconda custom (x86_64)| (default, Jul  2 2016, 17:52:12) 
[GCC 4.2.1 Compatible Apple LLVM 4.2 (clang-425.0.28)]
NumPy 1.11.1
SciPy 0.18.1
Scikit-Learn 0.18.1

If returning a list is the right fix, happy to submit a PR for this.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Dec 21, 2016

Member

thanks for reporting, and for the patch

Member

jnothman commented Dec 21, 2016

thanks for reporting, and for the patch

@yupbank

This comment has been minimized.

Show comment
Hide comment
@yupbank

yupbank Dec 21, 2016

Contributor

how about use a sparse matrix of 3d array? some "extra" probability are imported though...

Contributor

yupbank commented Dec 21, 2016

how about use a sparse matrix of 3d array? some "extra" probability are imported though...

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Dec 21, 2016

Member
Member

jnothman commented Dec 21, 2016

@yupbank

This comment has been minimized.

Show comment
Hide comment
@yupbank

yupbank Dec 21, 2016

Contributor

then it can not be indexed in a matrix fashion and a bunch of np.arg* can not being used..

Contributor

yupbank commented Dec 21, 2016

then it can not be indexed in a matrix fashion and a bunch of np.arg* can not being used..

@pjbull

This comment has been minimized.

Show comment
Hide comment
@pjbull

pjbull Dec 21, 2016

Contributor

To me, it makes more sense to leave the decision to stack as a sparse matrix (which introduces NaNs) or concatenate into a larger 2d matrix (which makes it harder to separate out individual estimators) up to the consumer. Both are one line calls in numpy, but make it more challenging to reason about which predictions belong to which estimator/class.

The list of arrays lets you easily construct either of those, but it's much more of a pain to go the other way and back out probabilities for any one estimator from those structures.

Contributor

pjbull commented Dec 21, 2016

To me, it makes more sense to leave the decision to stack as a sparse matrix (which introduces NaNs) or concatenate into a larger 2d matrix (which makes it harder to separate out individual estimators) up to the consumer. Both are one line calls in numpy, but make it more challenging to reason about which predictions belong to which estimator/class.

The list of arrays lets you easily construct either of those, but it's much more of a pain to go the other way and back out probabilities for any one estimator from those structures.

@yupbank

This comment has been minimized.

Show comment
Hide comment
@yupbank

yupbank Dec 21, 2016

Contributor

sparse matrix does not necessary introduce NaN but also 0s by default..
something like
(output, max_classes, probability) ?

Contributor

yupbank commented Dec 21, 2016

sparse matrix does not necessary introduce NaN but also 0s by default..
something like
(output, max_classes, probability) ?

@pjbull

This comment has been minimized.

Show comment
Hide comment
@pjbull

pjbull Dec 21, 2016

Contributor

Don't like it because user has to track # of classes per estimator that are valid.

Contributor

pjbull commented Dec 21, 2016

Don't like it because user has to track # of classes per estimator that are valid.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Dec 21, 2016

Member
Member

jnothman commented Dec 21, 2016

@maniteja123

This comment has been minimized.

Show comment
Hide comment
@maniteja123

maniteja123 Dec 22, 2016

Contributor

Thanks for the report and fix. Sorry I wasn't aware of the fact that list of probabilities is common for predict_proba of multi-output and also didn't think about the issue of different number of classes for different output.

Contributor

maniteja123 commented Dec 22, 2016

Thanks for the report and fix. Sorry I wasn't aware of the fact that list of probabilities is common for predict_proba of multi-output and also didn't think about the issue of different number of classes for different output.

@raghavrv raghavrv closed this in #8095 Dec 22, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment