New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

order of returned probabilites unclear for cross_val_predict with method=predict_proba #7863

Closed
simonm3 opened this Issue Nov 13, 2016 · 9 comments

Comments

Projects
None yet
3 participants
@simonm3

simonm3 commented Nov 13, 2016

Cross_val_predict has a new method parameter which is typically set to "predict_proba" to return probabilities for each class.

However the order of the classes returned is unclear. Either self.classes_ needs to be set; or the results need to be returned in a predictable order. Otherwise we have a list of probabilities for each class but no way to know which column relates to which class.

@jnothman jnothman added the Bug label Nov 14, 2016

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Nov 14, 2016

Member

Yes, I suppose this (and disagreements in the set of classes between splits) was overlooked.

(For internal models, I'm fairly sure that classes_ is always alphabetically ordered.)

Member

jnothman commented Nov 14, 2016

Yes, I suppose this (and disagreements in the set of classes between splits) was overlooked.

(For internal models, I'm fairly sure that classes_ is always alphabetically ordered.)

@dalmia

This comment has been minimized.

Show comment
Hide comment
@dalmia

dalmia Nov 15, 2016

Contributor

Can I start working on this?

Contributor

dalmia commented Nov 15, 2016

Can I start working on this?

@dalmia

This comment has been minimized.

Show comment
Hide comment
@dalmia

dalmia Nov 16, 2016

Contributor

After running a few tests on the iris dataset using LogisticRegression as the estimator, it became clear that the order of the classes appearing did not matter in the final result. This gives a simple and clear explanation.

Contributor

dalmia commented Nov 16, 2016

After running a few tests on the iris dataset using LogisticRegression as the estimator, it became clear that the order of the classes appearing did not matter in the final result. This gives a simple and clear explanation.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Nov 16, 2016

Member

Iris aside, it's possible to create a cross-validation strategy that will
land up with only a subset of the classes being present in the training
data for a particular training set. This should be handled.

Firstly the documentation should point out that the classes will be in
sorted order, and this should, in theory at least, be confirmed, reordering
if necessary, from the underlying estimators.

On 16 November 2016 at 11:41, Aman Dalmia notifications@github.com wrote:

After running a few tests on the iris datasets using LogisticRegression as
the estimator, it became clear that the order of the classes appearing did
not matter in the final result. This
http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.predict_log_proba
gives a simple and clear explanation.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#7863 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz65Z6-gNAXqLB0OC8OTlEeBKG9pBeks5q-lFNgaJpZM4Kw2SU
.

Member

jnothman commented Nov 16, 2016

Iris aside, it's possible to create a cross-validation strategy that will
land up with only a subset of the classes being present in the training
data for a particular training set. This should be handled.

Firstly the documentation should point out that the classes will be in
sorted order, and this should, in theory at least, be confirmed, reordering
if necessary, from the underlying estimators.

On 16 November 2016 at 11:41, Aman Dalmia notifications@github.com wrote:

After running a few tests on the iris datasets using LogisticRegression as
the estimator, it became clear that the order of the classes appearing did
not matter in the final result. This
http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.predict_log_proba
gives a simple and clear explanation.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#7863 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz65Z6-gNAXqLB0OC8OTlEeBKG9pBeks5q-lFNgaJpZM4Kw2SU
.

@dalmia

This comment has been minimized.

Show comment
Hide comment
@dalmia

dalmia Nov 16, 2016

Contributor

True, we could miss out on a few observations in cross-validation. As for the confirmation of the classes being sorted, since these lines of the cross_val_predict function already ensure that only estimators capable of calculating the probability are passed 'predict_proba' as the method, and all such estimators return the classes in sorted order, do we need any other mode of confirmation?
Please correct me if I misunderstood what you intended to say.

 if not callable(getattr(estimator, method)):
        raise AttributeError('{} not implemented in estimator'
                             .format(method))
Contributor

dalmia commented Nov 16, 2016

True, we could miss out on a few observations in cross-validation. As for the confirmation of the classes being sorted, since these lines of the cross_val_predict function already ensure that only estimators capable of calculating the probability are passed 'predict_proba' as the method, and all such estimators return the classes in sorted order, do we need any other mode of confirmation?
Please correct me if I misunderstood what you intended to say.

 if not callable(getattr(estimator, method)):
        raise AttributeError('{} not implemented in estimator'
                             .format(method))
@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Nov 16, 2016

Member

Well, we don't promise that all return classes in sorted order; all store
classes in .classes_

On 16 November 2016 at 14:44, Aman Dalmia notifications@github.com wrote:

True, we could miss out on a few observations in cross-validation. As for
the confirmation of the classes being sorted, since these lines of the
cross_val_predict function already ensure that only estimators capable of
calculating the probability are passed 'predict_proba' as the method, and
all such estimators return the classes in sorted order, do we need any
other mode of confirmation?
Please correct me if I misunderstood what you intended to say.

if not callable(getattr(estimator, method)):
raise AttributeError('{} not implemented in estimator'
.format(method))


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#7863 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz643iTLmmU5CBwLaMK5h6Iwqak0f3ks5q-nwFgaJpZM4Kw2SU
.

Member

jnothman commented Nov 16, 2016

Well, we don't promise that all return classes in sorted order; all store
classes in .classes_

On 16 November 2016 at 14:44, Aman Dalmia notifications@github.com wrote:

True, we could miss out on a few observations in cross-validation. As for
the confirmation of the classes being sorted, since these lines of the
cross_val_predict function already ensure that only estimators capable of
calculating the probability are passed 'predict_proba' as the method, and
all such estimators return the classes in sorted order, do we need any
other mode of confirmation?
Please correct me if I misunderstood what you intended to say.

if not callable(getattr(estimator, method)):
raise AttributeError('{} not implemented in estimator'
.format(method))


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#7863 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz643iTLmmU5CBwLaMK5h6Iwqak0f3ks5q-nwFgaJpZM4Kw2SU
.

@dalmia

This comment has been minimized.

Show comment
Hide comment
@dalmia

dalmia Nov 16, 2016

Contributor

Then how about, in the _fit_and_predict function, I retrieve the classes_ attribute, make a dictionary of the column indices and the class labels, sort them on the class labels and apply the transformation to the predictions, thereby ensuring that a sorted result is returned?

Contributor

dalmia commented Nov 16, 2016

Then how about, in the _fit_and_predict function, I retrieve the classes_ attribute, make a dictionary of the column indices and the class labels, sort them on the class labels and apply the transformation to the predictions, thereby ensuring that a sorted result is returned?

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Nov 16, 2016

Member

That sort of thing, yes

On 16 November 2016 at 16:23, Aman Dalmia notifications@github.com wrote:

Then how about, in the fit_and_predict function, I retrieve the 'classes'
attribute, make a dictionary of the column indices and the class labels,
sort them and apply the transformation to the predictions, thereby ensuring
that a sorted result is returned?


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#7863 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz61TyLjZQJX7Jjehj00p84uYkIiMHks5q-pNWgaJpZM4Kw2SU
.

Member

jnothman commented Nov 16, 2016

That sort of thing, yes

On 16 November 2016 at 16:23, Aman Dalmia notifications@github.com wrote:

Then how about, in the fit_and_predict function, I retrieve the 'classes'
attribute, make a dictionary of the column indices and the class labels,
sort them and apply the transformation to the predictions, thereby ensuring
that a sorted result is returned?


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#7863 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz61TyLjZQJX7Jjehj00p84uYkIiMHks5q-pNWgaJpZM4Kw2SU
.

@simonm3

This comment has been minimized.

Show comment
Hide comment
@simonm3

simonm3 Nov 16, 2016

That would be a good idea as it would make sure 3rd party classifiers used the same standard. I know it works today for LogisticRegression and RandomForestClassifier. However I am unsure for XGBClassifier as that is outside sklearn.

simonm3 commented Nov 16, 2016

That would be a good idea as it would make sure 3rd party classifiers used the same standard. I know it works today for LogisticRegression and RandomForestClassifier. However I am unsure for XGBClassifier as that is outside sklearn.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment