Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
MultiOutputClassifier.predict_proba fails if targets have different number of values #8093
If two target columns are categorical and have a different number unique values, MultiOutputClassifier.predict_proba raises a value error when trying to
Steps/Code to Reproduce
from sklearn.linear_model import LogisticRegression from sklearn.multioutput import MultiOutputClassifier import numpy as np # random features X = np.random.normal(size=(100, 100)) # random labels Y = np.concatenate([ np.random.choice(['a', 'b'], (100, 1)), # first column can have 2 values np.random.choice(['d', 'e', 'f'], (100, 1)) # second column can have 3 ], axis=1) clf = MultiOutputClassifier(LogisticRegression()) clf.fit(X, Y) clf.predict_proba(X)
No error is thrown. It looks like the
If returning a list is the right fix, happy to submit a PR for this.
referenced this issue
Dec 21, 2016
What's wrong with a list of arrays?…
On 22 December 2016 at 06:07, Peng Yu ***@***.***> wrote: how about use a sparse matrix of 3d array? some "extra" probability are imported though... — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#8093 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz6xFglpZUaCX2GBGtqiwhAOlRfOSBks5rKXkGgaJpZM4LShdL> .
To me, it makes more sense to leave the decision to stack as a sparse matrix (which introduces NaNs) or concatenate into a larger 2d matrix (which makes it harder to separate out individual estimators) up to the consumer. Both are one line calls in numpy, but make it more challenging to reason about which predictions belong to which estimator/class.
The list of arrays lets you easily construct either of those, but it's much more of a pain to go the other way and back out probabilities for any one estimator from those structures.
list of arrays is already what we've been doing for multioutput probabilities for a long time, AFAIK. I don't know why it got implemented here differently.…
On 22 December 2016 at 08:13, Peng Yu ***@***.***> wrote: sparse matrix does not necessary introduce NaN but also 0s by default.. something like (output, max_classes, probability) ? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#8093 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz68dbGr5hNV2JnsmKIxRFp9XGzyGnks5rKZaJgaJpZM4LShdL> .