New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] TST Move roc_auc_score from METRIC_UNDEFINED_BINARY to METRIC_UNDEFINED_MULTICLASS #9786

Merged
merged 27 commits into from Sep 27, 2017

Conversation

Projects
None yet
3 participants
@qinhanmin2014
Member

qinhanmin2014 commented Sep 16, 2017

Reference Issue

Proposed in #9567 by @jnothman

What does this implement/fix? Explain your changes.

METRIC_UNDEFINED_BINARY are metrics don't support binary inputs, METRIC_UNDEFINED_MULTICLASS are metrics don't support multiclass inputs, so seems that roc_auc_score belongs to METRIC_UNDEFINED_MULTICLASS .
In order to pass the tests in test_common.py, I have to:
(1)add the check to ensure that the shape of sample_weight is [n_samples] (regression test already in test_common.py)
(2)Carefully choose the scaling value in the test to reduce minor errors introduced by python when doing floating operations(e.g., 4.2 + 2.1 == 6.3 is False).

Any other comments?

cc @jnothman

qinhanmin2014 added some commits Sep 16, 2017

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Sep 18, 2017

Member

What kinds of discrepancy between expected and actual do we get if we leave the scaling as it was?

Member

jnothman commented Sep 18, 2017

What kinds of discrepancy between expected and actual do we get if we leave the scaling as it was?

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Sep 18, 2017

Member

We may just need a higher tolerance in assert_almost_equal because we're now dealing with a threshold-based metric rather than discrete labels.

Member

jnothman commented Sep 18, 2017

We may just need a higher tolerance in assert_almost_equal because we're now dealing with a threshold-based metric rather than discrete labels.

@qinhanmin2014

This comment has been minimized.

Show comment
Hide comment
@qinhanmin2014

qinhanmin2014 Sep 18, 2017

Member

@jnothman Thanks.
If we use the original scaling value, we have to set decimal=2 to pass the test.
Here is the code to reproduce the test:

n_samples = 50
random_state = check_random_state(0)
y_true = random_state.randint(0, 2, size=(n_samples, ))
y_pred = random_state.randint(0, 2, size=(n_samples, ))
y_score = random_state.random_sample(size=(n_samples,))
rng = np.random.RandomState(0)
sample_weight = rng.randint(1, 10, size=len(y_true))
roc_auc_score(y_true, y_score, sample_weight=sample_weight)
# 0.38036523593708349
roc_auc_score(y_true, y_score, sample_weight=sample_weight*2)
# 0.38036523593708349
roc_auc_score(y_true, y_score, sample_weight=sample_weight*0.3)
# 0.38733004532124787

I checked the implementation and it seems right. The difference seems to be introduced by python when doing floating operations.
So should we remain the original scaling value and set decimal=2?

Member

qinhanmin2014 commented Sep 18, 2017

@jnothman Thanks.
If we use the original scaling value, we have to set decimal=2 to pass the test.
Here is the code to reproduce the test:

n_samples = 50
random_state = check_random_state(0)
y_true = random_state.randint(0, 2, size=(n_samples, ))
y_pred = random_state.randint(0, 2, size=(n_samples, ))
y_score = random_state.random_sample(size=(n_samples,))
rng = np.random.RandomState(0)
sample_weight = rng.randint(1, 10, size=len(y_true))
roc_auc_score(y_true, y_score, sample_weight=sample_weight)
# 0.38036523593708349
roc_auc_score(y_true, y_score, sample_weight=sample_weight*2)
# 0.38036523593708349
roc_auc_score(y_true, y_score, sample_weight=sample_weight*0.3)
# 0.38733004532124787

I checked the implementation and it seems right. The difference seems to be introduced by python when doing floating operations.
So should we remain the original scaling value and set decimal=2?

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Sep 18, 2017

Member

As long as we don't have any estimators that require scores in (0, 1), we could use integer scores to encourage more stability...?

Member

jnothman commented Sep 18, 2017

As long as we don't have any estimators that require scores in (0, 1), we could use integer scores to encourage more stability...?

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Sep 18, 2017

Member

Integer scores, where scores are likely to be equal, would also be a more challenging test to pass

Member

jnothman commented Sep 18, 2017

Integer scores, where scores are likely to be equal, would also be a more challenging test to pass

@qinhanmin2014

This comment has been minimized.

Show comment
Hide comment
@qinhanmin2014

qinhanmin2014 Sep 18, 2017

Member

@jnothman Sorry but I don't quite understand. What do you mean by 'integer scores'? Could you please provide more details? Thanks :)
From my perspective, I can only come up with two solutions:
(1)reduce the precision requirement
(2)keep default precision requirement and choose 'good' scaling value(e.g., 0.5) for sample_weight

Member

qinhanmin2014 commented Sep 18, 2017

@jnothman Sorry but I don't quite understand. What do you mean by 'integer scores'? Could you please provide more details? Thanks :)
From my perspective, I can only come up with two solutions:
(1)reduce the precision requirement
(2)keep default precision requirement and choose 'good' scaling value(e.g., 0.5) for sample_weight

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Sep 18, 2017

Member

Currently we have y_score = random_state.random_sample(size=(n_samples,)) generating floats in (0,1). Instead we could use random_state.randint(10, size=(n_samples,)) to generate ints in [0,9].

Member

jnothman commented Sep 18, 2017

Currently we have y_score = random_state.random_sample(size=(n_samples,)) generating floats in (0,1). Instead we could use random_state.randint(10, size=(n_samples,)) to generate ints in [0,9].

@qinhanmin2014

This comment has been minimized.

Show comment
Hide comment
@qinhanmin2014

qinhanmin2014 Sep 18, 2017

Member

@jnothman
Do you mean something like this?

y_true = random_state.randint(0, 2, size=(n_samples, ))
y_score = random_state.randint(0, 10, size=(n_samples, ))
sample_weight = rng.randint(1, 10, size=len(y_true))
roc_auc_score(y_true, y_score, sample_weight=sample_weight)

Such method work for roc_auc_score, but brier_score_loss require scores in (0, 1), so the common test cannot pass in this way.

Member

qinhanmin2014 commented Sep 18, 2017

@jnothman
Do you mean something like this?

y_true = random_state.randint(0, 2, size=(n_samples, ))
y_score = random_state.randint(0, 10, size=(n_samples, ))
sample_weight = rng.randint(1, 10, size=len(y_true))
roc_auc_score(y_true, y_score, sample_weight=sample_weight)

Such method work for roc_auc_score, but brier_score_loss require scores in (0, 1), so the common test cannot pass in this way.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Sep 18, 2017

Member
Member

jnothman commented Sep 18, 2017

qinhanmin2014 added some commits Sep 18, 2017

@jnothman jnothman changed the title from [MRG] TST Move roc_auc_score from METRIC_UNDEFINED_BINARY to METRIC_UNDEFINED_MULTICLASS to [MRG+1] TST Move roc_auc_score from METRIC_UNDEFINED_BINARY to METRIC_UNDEFINED_MULTICLASS Sep 18, 2017

@qinhanmin2014

This comment has been minimized.

Show comment
Hide comment
@qinhanmin2014

qinhanmin2014 Sep 18, 2017

Member

@jnothman
I got strange test failure when decimal=2 (it can pass locally). Could you please help me? Thanks.
(currently I use decimal=1 to pass the test)

AssertionError: 
Arrays are not almost equal to 2 decimals
roc_auc_score sample_weight is not invariant under scaling
 ACTUAL: 0.38036523593708343
 DESIRED: 0.38733004532124793
>>  raise AssertionError('\nArrays are not almost equal to 2 decimals\nroc_auc_score sample_weight is not invariant under scaling\n ACTUAL: 0.38036523593708343\n DESIRED: 0.38733004532124793')
Member

qinhanmin2014 commented Sep 18, 2017

@jnothman
I got strange test failure when decimal=2 (it can pass locally). Could you please help me? Thanks.
(currently I use decimal=1 to pass the test)

AssertionError: 
Arrays are not almost equal to 2 decimals
roc_auc_score sample_weight is not invariant under scaling
 ACTUAL: 0.38036523593708343
 DESIRED: 0.38733004532124793
>>  raise AssertionError('\nArrays are not almost equal to 2 decimals\nroc_auc_score sample_weight is not invariant under scaling\n ACTUAL: 0.38036523593708343\n DESIRED: 0.38733004532124793')
@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Sep 18, 2017

Member
Member

jnothman commented Sep 18, 2017

@qinhanmin2014

This comment has been minimized.

Show comment
Hide comment
@qinhanmin2014

qinhanmin2014 Sep 18, 2017

Member

yes on the face of it that looks weird. assert_allclose is a variant which gives more explicit control of tolerances if you'd rather.

@jnothman Thanks. assert_allclose works. CIs are green. Is this OK for you?

Member

qinhanmin2014 commented Sep 18, 2017

yes on the face of it that looks weird. assert_allclose is a variant which gives more explicit control of tolerances if you'd rather.

@jnothman Thanks. assert_allclose works. CIs are green. Is this OK for you?

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Sep 18, 2017

Member

Yes, but we'll wait for another review to be sure I'm not missing something!

Member

jnothman commented Sep 18, 2017

Yes, but we'll wait for another review to be sure I'm not missing something!

@lesteve

This comment has been minimized.

Show comment
Hide comment
@lesteve

lesteve Sep 18, 2017

Member

atol=1e-2 seems quite high and is only required for roc_auc metrics (all the other metrics pass the tests for the default values i.e. atol=0, rtol=1e-7), do we understand why this is the case?

With scaling=1/17 I get differences that are bigger than 1% (in relative difference) which does not seem like it can be explained by floating point differences ...

E               AssertionError: 
E               Not equal to tolerance rtol=1e-07, atol=0
E               roc_auc_score sample_weight is not invariant under scaling
E               (mismatch 100.0%)
E                x: array(0.3803652359370835)
E                y: array(0.3757664622767263)
Member

lesteve commented Sep 18, 2017

atol=1e-2 seems quite high and is only required for roc_auc metrics (all the other metrics pass the tests for the default values i.e. atol=0, rtol=1e-7), do we understand why this is the case?

With scaling=1/17 I get differences that are bigger than 1% (in relative difference) which does not seem like it can be explained by floating point differences ...

E               AssertionError: 
E               Not equal to tolerance rtol=1e-07, atol=0
E               roc_auc_score sample_weight is not invariant under scaling
E               (mismatch 100.0%)
E                x: array(0.3803652359370835)
E                y: array(0.3757664622767263)
@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Sep 18, 2017

Member
Member

jnothman commented Sep 18, 2017

@qinhanmin2014

This comment has been minimized.

Show comment
Hide comment
@qinhanmin2014

qinhanmin2014 Sep 18, 2017

Member

@lesteve Thanks for your review.
also ping @jnothman
I traced the whole calculation process of roc_auc_score again. The main reason for the differences lies in function auc in ranking.py. When entering this function, all the parameters (i.e., x and y) are almost the same (difference < 1e-7). But since we introduce very small difference for x, we get different order so we get much bigger difference from np.traz. See the following code snippet:

x1 = [0.6198347107438017, 0.6776859504132231, 0.6776859504132231, 
      0.6776859504132231, 0.6776859504132231, 0.6776859504132231,
      0.74380165289256195, 0.74380165289256195, 0.8925619834710744, 
      0.92561983471074383, 1.0, 1.0, 
      1.0, 1.0, 1.0]
x2 = [0.61983471074380181, 0.6776859504132231, 0.67768595041322344, 
      0.67768595041322321, 0.67768595041322321, 0.67768595041322321,
      0.74380165289256217, 0.74380165289256217, 0.89256198347107418, 
      0.92561983471074349, 0.99999999999999956, 0.99999999999999967,
      0.99999999999999989, 0.99999999999999967, 1.0]
y1 = [0.62096774193548387, 0.62096774193548387, 0.62903225806451613, 
      0.66129032258064513, 0.717741935483871, 0.7338709677419355,
      0.7338709677419355, 0.75806451612903225, 0.75806451612903225,
      0.75806451612903225, 0.75806451612903225, 0.80645161290322576, 
      0.81451612903225812, 0.84677419354838712, 0.94354838709677424]
y2 = [0.62096774193548399, 0.62096774193548399, 0.62903225806451624,
      0.66129032258064524, 0.717741935483871, 0.7338709677419355, 
      0.7338709677419355, 0.75806451612903225, 0.75806451612903225,
      0.75806451612903225, 0.75806451612903225, 0.80645161290322587, 
      0.81451612903225812, 0.84677419354838712, 0.94354838709677424]
x1 = np.array(x1)
x2 = np.array(x2)
y1 = np.array(y1)
y2 = np.array(y2)
np.testing.assert_allclose(x1, x2) #pass
np.testing.assert_allclose(y1, y2) #pass
order = np.lexsort((y1, x1))
x1, y1 = x1[order], y1[order]
order = np.lexsort((y2, x2))
x2, y2 = x1[order], y1[order]
print np.trapz(y1, x1)
# 0.27865902426
print np.trapz(y2, x2)
# 0.275193281792

If we want to solve the problem, we will need to write a more robust sorting function (e.g., regard two float as equal if x1-x2<1e-7) instead of simply calling np.lexsort. Also, we may need to modify the following statement to use np.allclose(..., 0).

optimal_idxs = np.where(np.r_[True,
    np.logical_or(np.diff(fps, 2), np.diff(tps, 2)), True])[0]

We also have tricky ways to access our goal. That is, to round x and y before sorting.

x = np.round(x, 10)
y = np.round(y, 10)
order = np.lexsort((y, x))

We get very similar stable result in this way:

roc_auc_score(y_true, y_score, sample_weight=sample_weight)
# before(unstable):0.38036523593708349
# after(stable):0.38036523591966814
Member

qinhanmin2014 commented Sep 18, 2017

@lesteve Thanks for your review.
also ping @jnothman
I traced the whole calculation process of roc_auc_score again. The main reason for the differences lies in function auc in ranking.py. When entering this function, all the parameters (i.e., x and y) are almost the same (difference < 1e-7). But since we introduce very small difference for x, we get different order so we get much bigger difference from np.traz. See the following code snippet:

x1 = [0.6198347107438017, 0.6776859504132231, 0.6776859504132231, 
      0.6776859504132231, 0.6776859504132231, 0.6776859504132231,
      0.74380165289256195, 0.74380165289256195, 0.8925619834710744, 
      0.92561983471074383, 1.0, 1.0, 
      1.0, 1.0, 1.0]
x2 = [0.61983471074380181, 0.6776859504132231, 0.67768595041322344, 
      0.67768595041322321, 0.67768595041322321, 0.67768595041322321,
      0.74380165289256217, 0.74380165289256217, 0.89256198347107418, 
      0.92561983471074349, 0.99999999999999956, 0.99999999999999967,
      0.99999999999999989, 0.99999999999999967, 1.0]
y1 = [0.62096774193548387, 0.62096774193548387, 0.62903225806451613, 
      0.66129032258064513, 0.717741935483871, 0.7338709677419355,
      0.7338709677419355, 0.75806451612903225, 0.75806451612903225,
      0.75806451612903225, 0.75806451612903225, 0.80645161290322576, 
      0.81451612903225812, 0.84677419354838712, 0.94354838709677424]
y2 = [0.62096774193548399, 0.62096774193548399, 0.62903225806451624,
      0.66129032258064524, 0.717741935483871, 0.7338709677419355, 
      0.7338709677419355, 0.75806451612903225, 0.75806451612903225,
      0.75806451612903225, 0.75806451612903225, 0.80645161290322587, 
      0.81451612903225812, 0.84677419354838712, 0.94354838709677424]
x1 = np.array(x1)
x2 = np.array(x2)
y1 = np.array(y1)
y2 = np.array(y2)
np.testing.assert_allclose(x1, x2) #pass
np.testing.assert_allclose(y1, y2) #pass
order = np.lexsort((y1, x1))
x1, y1 = x1[order], y1[order]
order = np.lexsort((y2, x2))
x2, y2 = x1[order], y1[order]
print np.trapz(y1, x1)
# 0.27865902426
print np.trapz(y2, x2)
# 0.275193281792

If we want to solve the problem, we will need to write a more robust sorting function (e.g., regard two float as equal if x1-x2<1e-7) instead of simply calling np.lexsort. Also, we may need to modify the following statement to use np.allclose(..., 0).

optimal_idxs = np.where(np.r_[True,
    np.logical_or(np.diff(fps, 2), np.diff(tps, 2)), True])[0]

We also have tricky ways to access our goal. That is, to round x and y before sorting.

x = np.round(x, 10)
y = np.round(y, 10)
order = np.lexsort((y, x))

We get very similar stable result in this way:

roc_auc_score(y_true, y_score, sample_weight=sample_weight)
# before(unstable):0.38036523593708349
# after(stable):0.38036523591966814
@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Sep 18, 2017

Member
Member

jnothman commented Sep 18, 2017

@lesteve

This comment has been minimized.

Show comment
Hide comment
@lesteve

lesteve Sep 19, 2017

Member

@jnothman I trust you on this one. If you think that atol=1e-2 is fine, then let's do that. Maybe have a special case for auc in the test with a FIXME to show that the higher tolerance is only needed for roc_auc scores?

Member

lesteve commented Sep 19, 2017

@jnothman I trust you on this one. If you think that atol=1e-2 is fine, then let's do that. Maybe have a special case for auc in the test with a FIXME to show that the higher tolerance is only needed for roc_auc scores?

Revert "try another way"
This reverts commit 6b2cf79.
@qinhanmin2014

This comment has been minimized.

Show comment
Hide comment
@qinhanmin2014

qinhanmin2014 Sep 19, 2017

Member

@jnothman So seems that the reason is clear for this problem (np.lexsort and np.traz in function auc) and I come back to the original solution proposed by you. Do you think we need a seperate test for roc_auc_score here as proposed by @lesteve?
BTW, @jnothman @lesteve I'm suddenly wondering why we need to sort in auc. Seems that we already ensure that x(fpr) and y(tpr) are not decreasing here? If so, this can be sloved easily by setting parameter reorder to False in auc?

Member

qinhanmin2014 commented Sep 19, 2017

@jnothman So seems that the reason is clear for this problem (np.lexsort and np.traz in function auc) and I come back to the original solution proposed by you. Do you think we need a seperate test for roc_auc_score here as proposed by @lesteve?
BTW, @jnothman @lesteve I'm suddenly wondering why we need to sort in auc. Seems that we already ensure that x(fpr) and y(tpr) are not decreasing here? If so, this can be sloved easily by setting parameter reorder to False in auc?

@lesteve

This comment has been minimized.

Show comment
Hide comment
@lesteve

lesteve Sep 19, 2017

Member

Do you think we need a seperate test for roc_auc_score here as proposed by @lesteve?

Just for clarity I was not advocating a completely separate test but more a special case, i.e. something like this:

# FIXME: roc_auc scores are more unstable than other scores
kwargs = {'atol': 1e-2} if 'roc_auc' in metric else {}

for scaling in [2, 0.3]:
    assert_almost_equal(
        np.testing.assert_allclose(
            weighted_score,
            metric(y1, y2, sample_weight=sample_weight * scaling),
            atol=1e-2,
            err_msg=("%s sample_weight is not invariant "
                     "under scaling" % name),
            **kwargs))
Member

lesteve commented Sep 19, 2017

Do you think we need a seperate test for roc_auc_score here as proposed by @lesteve?

Just for clarity I was not advocating a completely separate test but more a special case, i.e. something like this:

# FIXME: roc_auc scores are more unstable than other scores
kwargs = {'atol': 1e-2} if 'roc_auc' in metric else {}

for scaling in [2, 0.3]:
    assert_almost_equal(
        np.testing.assert_allclose(
            weighted_score,
            metric(y1, y2, sample_weight=sample_weight * scaling),
            atol=1e-2,
            err_msg=("%s sample_weight is not invariant "
                     "under scaling" % name),
            **kwargs))

@jnothman jnothman changed the title from [MRG+1] TST Move roc_auc_score from METRIC_UNDEFINED_BINARY to METRIC_UNDEFINED_MULTICLASS to [MRG] TST Move roc_auc_score from METRIC_UNDEFINED_BINARY to METRIC_UNDEFINED_MULTICLASS Sep 19, 2017

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Sep 19, 2017

Member

Yeah, I'm taking back my lgtm. I don't think we're testing the right thing if we let atol=1e-2.

BTW, @jnothman @lesteve I'm suddenly wondering why we need to sort in auc. Seems that we already ensure that x(fpr) and y(tpr) are not decreasing here? If so, this can be solved easily by setting parameter reorder to False in auc?

Indeed, this seems to pass tests in sklearn/metrics... if we trust the coverage. If we confirm that reordering should not be necessary, then this may be a good solution.

Member

jnothman commented Sep 19, 2017

Yeah, I'm taking back my lgtm. I don't think we're testing the right thing if we let atol=1e-2.

BTW, @jnothman @lesteve I'm suddenly wondering why we need to sort in auc. Seems that we already ensure that x(fpr) and y(tpr) are not decreasing here? If so, this can be solved easily by setting parameter reorder to False in auc?

Indeed, this seems to pass tests in sklearn/metrics... if we trust the coverage. If we confirm that reordering should not be necessary, then this may be a good solution.

@jnothman

I would also much rather modify y_score than keeping a high tolerance.

@qinhanmin2014

This comment has been minimized.

Show comment
Hide comment
@qinhanmin2014

qinhanmin2014 Sep 19, 2017

Member

@jnothman
I believe we can do so. See the following code in roc_auc_score

fpr, tpr, tresholds = roc_curve(y_true, y_score, sample_weight=sample_weight)
return auc(fpr, tpr, reorder=True)

The doc of roc_curve clearly states that

Returns

fpr : array, shape = [>2]
        Increasing false positive rates...
tpr : array, shape = [>2]
        Increasing true positive rates...

The problem is the following code in auc:

dx = np.diff(x)
if np.any(dx < 0):
    if np.all(dx <= 0):
        direction = -1
    else:
        raise ValueError...

Do you have a acceptable way not to raise an error here for something like [1, 1+1e-10, 1]?

Member

qinhanmin2014 commented Sep 19, 2017

@jnothman
I believe we can do so. See the following code in roc_auc_score

fpr, tpr, tresholds = roc_curve(y_true, y_score, sample_weight=sample_weight)
return auc(fpr, tpr, reorder=True)

The doc of roc_curve clearly states that

Returns

fpr : array, shape = [>2]
        Increasing false positive rates...
tpr : array, shape = [>2]
        Increasing true positive rates...

The problem is the following code in auc:

dx = np.diff(x)
if np.any(dx < 0):
    if np.all(dx <= 0):
        direction = -1
    else:
        raise ValueError...

Do you have a acceptable way not to raise an error here for something like [1, 1+1e-10, 1]?

@qinhanmin2014

This comment has been minimized.

Show comment
Hide comment
@qinhanmin2014

qinhanmin2014 Sep 19, 2017

Member

I would also much rather modify y_score than keeping a high tolerance.

But seems currently brier_score_loss prevent us from doing so

Member

qinhanmin2014 commented Sep 19, 2017

I would also much rather modify y_score than keeping a high tolerance.

But seems currently brier_score_loss prevent us from doing so

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Sep 19, 2017

Member
Member

jnothman commented Sep 19, 2017

@qinhanmin2014

This comment has been minimized.

Show comment
Hide comment
@qinhanmin2014

qinhanmin2014 Sep 19, 2017

Member

ping @jnothman @lesteve I have combined both of your suggestions and come up with this solution without influencing the other tests.
I have to admit that sometimes the problem seems so scary.

n_samples = 50
random_state = check_random_state(0)
y_true = random_state.randint(0, 2, size=(n_samples, ))
y_pred = random_state.randint(0, 2, size=(n_samples, ))
y_score = random_state.random_sample(size=(n_samples,))
y_score = np.round(y_score, 1)
rng = np.random.RandomState(0)
sample_weight = rng.randint(1, 10, size=len(y_true))
roc_auc_score(y_true, y_score, sample_weight=sample_weight*2)
# 0.39089576113036523
roc_auc_score(y_true, y_score, sample_weight=sample_weight*0.3)
# 0.41558917621967495

I've not understood what the problem is in that auc code if tpr and fpr are increasing

This means that we needn't sort them again in auc, which will at least save some time.

Member

qinhanmin2014 commented Sep 19, 2017

ping @jnothman @lesteve I have combined both of your suggestions and come up with this solution without influencing the other tests.
I have to admit that sometimes the problem seems so scary.

n_samples = 50
random_state = check_random_state(0)
y_true = random_state.randint(0, 2, size=(n_samples, ))
y_pred = random_state.randint(0, 2, size=(n_samples, ))
y_score = random_state.random_sample(size=(n_samples,))
y_score = np.round(y_score, 1)
rng = np.random.RandomState(0)
sample_weight = rng.randint(1, 10, size=len(y_true))
roc_auc_score(y_true, y_score, sample_weight=sample_weight*2)
# 0.39089576113036523
roc_auc_score(y_true, y_score, sample_weight=sample_weight*0.3)
# 0.41558917621967495

I've not understood what the problem is in that auc code if tpr and fpr are increasing

This means that we needn't sort them again in auc, which will at least save some time.

@qinhanmin2014

This comment has been minimized.

Show comment
Hide comment
@qinhanmin2014

qinhanmin2014 Sep 20, 2017

Member

ping @jnothman for the previous comment
Seems that we can get a stable result as well as save some calculation time if we don't do the unnecessary sort in auc. This is because the calculation method of np.trapz. WDYT? Shall I have a try here? Thanks.

Member

qinhanmin2014 commented Sep 20, 2017

ping @jnothman for the previous comment
Seems that we can get a stable result as well as save some calculation time if we don't do the unnecessary sort in auc. This is because the calculation method of np.trapz. WDYT? Shall I have a try here? Thanks.

@qinhanmin2014

This comment has been minimized.

Show comment
Hide comment
@qinhanmin2014

qinhanmin2014 Sep 20, 2017

Member

@lesteve Thanks a lot for the detailed instruction. Seems that you find the main reason of the problem. Now seems that we can even pass the test with auc(fpr, tpr, reorder=True).
So which one do you prefer : (1)directly return np.trapz(tpr, fpr) or (2) return auc(fpr, tpr, reorder=True).
First one makes the calculation speed faster. Second one makes the change smaller.

Member

qinhanmin2014 commented Sep 20, 2017

@lesteve Thanks a lot for the detailed instruction. Seems that you find the main reason of the problem. Now seems that we can even pass the test with auc(fpr, tpr, reorder=True).
So which one do you prefer : (1)directly return np.trapz(tpr, fpr) or (2) return auc(fpr, tpr, reorder=True).
First one makes the calculation speed faster. Second one makes the change smaller.

@lesteve

This comment has been minimized.

Show comment
Hide comment
@lesteve

lesteve Sep 20, 2017

Member

I would prefer (3) auc(fpr, tpr). We do not need to reorder since we know we have increasing fpr (by construction). Then I would not touch the auc function to minimize controversial changes in this PR.

A few tests to make sure that we always get increasing fpr and tpr in _binary_clf_curve and roc_curve would be needed too. If there are some already, it looks like there were not stringent enough ...

Member

lesteve commented Sep 20, 2017

I would prefer (3) auc(fpr, tpr). We do not need to reorder since we know we have increasing fpr (by construction). Then I would not touch the auc function to minimize controversial changes in this PR.

A few tests to make sure that we always get increasing fpr and tpr in _binary_clf_curve and roc_curve would be needed too. If there are some already, it looks like there were not stringent enough ...

@qinhanmin2014

This comment has been minimized.

Show comment
Hide comment
@qinhanmin2014

qinhanmin2014 Sep 20, 2017

Member

@lesteve Thanks. Comments addressed.

Member

qinhanmin2014 commented Sep 20, 2017

@lesteve Thanks. Comments addressed.

@qinhanmin2014

This comment has been minimized.

Show comment
Hide comment
@qinhanmin2014

qinhanmin2014 Sep 21, 2017

Member

@jnothman We have chenged our method by rewriting a statement here to ensure that fpr(fps) and tpr(tps) are increasing. We also add a regression test. Please double check for your +1. Thanks.

Member

qinhanmin2014 commented Sep 21, 2017

@jnothman We have chenged our method by rewriting a statement here to ensure that fpr(fps) and tpr(tps) are increasing. We also add a regression test. Please double check for your +1. Thanks.

@qinhanmin2014

This comment has been minimized.

Show comment
Hide comment
@qinhanmin2014

qinhanmin2014 Sep 23, 2017

Member

ping @jnothman @lesteve for futher review. Thanks :)

Member

qinhanmin2014 commented Sep 23, 2017

ping @jnothman @lesteve for futher review. Thanks :)

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Sep 25, 2017

Member
Member

jnothman commented Sep 25, 2017

@jnothman

Apart from nitpicks, LGTM

Show outdated Hide outdated doc/whats_new/v0.20.rst
@@ -371,6 +371,20 @@ def test_roc_curve_drop_intermediate():
[1.0, 0.9, 0.7, 0.6, 0.])
def test_roc_curve_fpr_tpr_increasing():

This comment has been minimized.

@jnothman

jnothman Sep 26, 2017

Member

I feel like the fact that elements are sorted for one random sample isn't a very strong assurance. There are edge cases that could be further tested (such as having repeated thresholds), too, but I'm not sure what reasonable edge cases for this test are.

@jnothman

jnothman Sep 26, 2017

Member

I feel like the fact that elements are sorted for one random sample isn't a very strong assurance. There are edge cases that could be further tested (such as having repeated thresholds), too, but I'm not sure what reasonable edge cases for this test are.

This comment has been minimized.

@lesteve

lesteve Sep 27, 2017

Member

Basically the edge cases are when the definition of fps are not equal because of floating point errors:

tps = stable_cumsum(y_true * weight)[threshold_idxs]
fps = stable_cumsum(weight)[threshold_idxs] - tps
fps = stable_cumsum((1 - y_true) * weight)[threshold_idxs]

It is not obvious to me how to construct simply an example that does not work but maybe with a little bit of thought there is a way to put a simpler one together.

For full details the best is to look at the definition of _binary_clf_curve, especially how the other variables are defined.

@lesteve

lesteve Sep 27, 2017

Member

Basically the edge cases are when the definition of fps are not equal because of floating point errors:

tps = stable_cumsum(y_true * weight)[threshold_idxs]
fps = stable_cumsum(weight)[threshold_idxs] - tps
fps = stable_cumsum((1 - y_true) * weight)[threshold_idxs]

It is not obvious to me how to construct simply an example that does not work but maybe with a little bit of thought there is a way to put a simpler one together.

For full details the best is to look at the definition of _binary_clf_curve, especially how the other variables are defined.

This comment has been minimized.

@lesteve

lesteve Sep 27, 2017

Member

OK I found a simpler example:

def test_roc_curve_fpr_tpr_increasing():
    # Ensure that fpr and tpr returned by roc_curve are increasing
    # Construct an edge case with float y_score and sample_weight
    # when some adjacent values of fpr and tpr are the same.
    y_true = [0, 0, 1, 1, 1]
    y_score = [0.1, 0.7, 0.3, 0.4, 0.5]
    sample_weight = np.repeat(0.2, 5)
    fpr, tpr, _ = roc_curve(y_true, y_score,
                            sample_weight=sample_weight)
    assert_equal((np.diff(fpr) < 0).sum(), 0)
    assert_equal((np.diff(tpr) < 0).sum(), 0)

Are you happier with this one @jnothman?

@lesteve

lesteve Sep 27, 2017

Member

OK I found a simpler example:

def test_roc_curve_fpr_tpr_increasing():
    # Ensure that fpr and tpr returned by roc_curve are increasing
    # Construct an edge case with float y_score and sample_weight
    # when some adjacent values of fpr and tpr are the same.
    y_true = [0, 0, 1, 1, 1]
    y_score = [0.1, 0.7, 0.3, 0.4, 0.5]
    sample_weight = np.repeat(0.2, 5)
    fpr, tpr, _ = roc_curve(y_true, y_score,
                            sample_weight=sample_weight)
    assert_equal((np.diff(fpr) < 0).sum(), 0)
    assert_equal((np.diff(tpr) < 0).sum(), 0)

Are you happier with this one @jnothman?

This comment has been minimized.

@lesteve

lesteve Sep 27, 2017

Member
(Pdb) np.diff(fpr)
array([  5.00000000e-01,   0.00000000e+00,   2.22044605e-16,
        -3.33066907e-16,   5.00000000e-01])
@lesteve

lesteve Sep 27, 2017

Member
(Pdb) np.diff(fpr)
array([  5.00000000e-01,   0.00000000e+00,   2.22044605e-16,
        -3.33066907e-16,   5.00000000e-01])
Show outdated Hide outdated sklearn/metrics/tests/test_ranking.py
Show outdated Hide outdated sklearn/metrics/tests/test_ranking.py
@qinhanmin2014

This comment has been minimized.

Show comment
Hide comment
@qinhanmin2014

qinhanmin2014 Sep 26, 2017

Member

@jnothman Thanks for your precious time :) Comments addressed.

Do you really think this belongs here? Do we believe this will change user results often enough to caution them here? This makes it seem like their ROC scores will have suddenly changed... I think "bug fix" is more appropriate in any case.

After consideration, I remove the statement about roc_curve (the change of the result is really small, most time much less than 1e-7) and move the statement about roc_auc_score to bug fix section.

It's not really a regression test

I remove the comment since I now think that it is unnecessary. But from my perspective, this is a regression test (fail on master). Because in master, we cannot ensure that fpr and tpr are increasing because we use subtraction instead of accumulation. Though the error is really small, this is actually the core reason of the issue (it cause the wrong sort and the wrong roc_auc_score).

I feel like the fact that elements are sorted for one random sample isn't a very strong assurance. There are edge cases that could be further tested (such as having repeated thresholds), too, but I'm not sure what reasonable edge cases for this test are.

The test is constructed based on @lesteve's suggestion. I think it hit the edge case (float y_score and sample_weight when some adjacent values of fpr and tpr are the same). Now for adjacent same value, we no longer get something not increasing( e.g., [a+1e-10, a-1e-10] ). I have added a comment.

WDYT? Thanks :)

Member

qinhanmin2014 commented Sep 26, 2017

@jnothman Thanks for your precious time :) Comments addressed.

Do you really think this belongs here? Do we believe this will change user results often enough to caution them here? This makes it seem like their ROC scores will have suddenly changed... I think "bug fix" is more appropriate in any case.

After consideration, I remove the statement about roc_curve (the change of the result is really small, most time much less than 1e-7) and move the statement about roc_auc_score to bug fix section.

It's not really a regression test

I remove the comment since I now think that it is unnecessary. But from my perspective, this is a regression test (fail on master). Because in master, we cannot ensure that fpr and tpr are increasing because we use subtraction instead of accumulation. Though the error is really small, this is actually the core reason of the issue (it cause the wrong sort and the wrong roc_auc_score).

I feel like the fact that elements are sorted for one random sample isn't a very strong assurance. There are edge cases that could be further tested (such as having repeated thresholds), too, but I'm not sure what reasonable edge cases for this test are.

The test is constructed based on @lesteve's suggestion. I think it hit the edge case (float y_score and sample_weight when some adjacent values of fpr and tpr are the same). Now for adjacent same value, we no longer get something not increasing( e.g., [a+1e-10, a-1e-10] ). I have added a comment.

WDYT? Thanks :)

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Sep 26, 2017

Member

Will merge when @lesteve approves. Might even be reasonable to include in 0.19.1

Member

jnothman commented Sep 26, 2017

Will merge when @lesteve approves. Might even be reasonable to include in 0.19.1

@jnothman jnothman added this to the 0.19.1 milestone Sep 26, 2017

lesteve and others added some commits Sep 27, 2017

@qinhanmin2014

This comment has been minimized.

Show comment
Hide comment
@qinhanmin2014

qinhanmin2014 Sep 27, 2017

Member

@lesteve Thanks for the review and the improvement. I have slightly updated the comment of the test.
Seems that the test at least hit some of the edge case (it fails on master). In such cases, y_score and sample_weight are float, some adjacent values of fpr and tpr are actually the same. In python, the small error introduced when doing float calculation seems unavoidable. In this PR, at least we guarantee that fpr and tpr are increasing and avoid significant error for roc_auc_score. WDYT?

Member

qinhanmin2014 commented Sep 27, 2017

@lesteve Thanks for the review and the improvement. I have slightly updated the comment of the test.
Seems that the test at least hit some of the edge case (it fails on master). In such cases, y_score and sample_weight are float, some adjacent values of fpr and tpr are actually the same. In python, the small error introduced when doing float calculation seems unavoidable. In this PR, at least we guarantee that fpr and tpr are increasing and avoid significant error for roc_auc_score. WDYT?

@qinhanmin2014

This comment has been minimized.

Show comment
Hide comment
@qinhanmin2014

qinhanmin2014 Sep 27, 2017

Member

@lesteve Thanks a lot for your help :) I totally agree with all your suggestions and have learnt a lot.

Member

qinhanmin2014 commented Sep 27, 2017

@lesteve Thanks a lot for your help :) I totally agree with all your suggestions and have learnt a lot.

@lesteve

This comment has been minimized.

Show comment
Hide comment
@lesteve

lesteve Sep 27, 2017

Member

I think this is good to go, merging, thanks a lot!

Another piece of advice while I am at it: chose better names for your branches, you can do better than test-feature-3 surely.

Member

lesteve commented Sep 27, 2017

I think this is good to go, merging, thanks a lot!

Another piece of advice while I am at it: chose better names for your branches, you can do better than test-feature-3 surely.

@lesteve lesteve merged commit 8fb648a into scikit-learn:master Sep 27, 2017

6 checks passed

ci/circleci Your tests passed on CircleCI!
Details
codecov/patch 100% of diff hit (target 96.17%)
Details
codecov/project Absolute coverage decreased by -<.01% but relative coverage increased by +3.82% compared to dc99f91
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
lgtm analysis: Python No alert changes
Details
@lesteve

This comment has been minimized.

Show comment
Hide comment
@lesteve

lesteve Sep 27, 2017

Member

Follow up PRs @qinhanmin2014 if you are up to it (I would rather have two separate PRs in this case):

  • with this change reorder=False in the auc function is not used in our code. I think we should deprecate the reorder=False parameter and potentially (up for debate) have some threshold in case there are some small negative values in np.diff(tpr) or np.diff(fpr)
  • in this PR we spotted a place where check_consistent_lengths(X, y) was used where check_consistent_lengths(X, y, sample_weight) should have called it would be good to double-check that this error is not present in some other places in our codebase.
Member

lesteve commented Sep 27, 2017

Follow up PRs @qinhanmin2014 if you are up to it (I would rather have two separate PRs in this case):

  • with this change reorder=False in the auc function is not used in our code. I think we should deprecate the reorder=False parameter and potentially (up for debate) have some threshold in case there are some small negative values in np.diff(tpr) or np.diff(fpr)
  • in this PR we spotted a place where check_consistent_lengths(X, y) was used where check_consistent_lengths(X, y, sample_weight) should have called it would be good to double-check that this error is not present in some other places in our codebase.
@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Sep 27, 2017

Member
Member

jnothman commented Sep 27, 2017

@qinhanmin2014

This comment has been minimized.

Show comment
Hide comment
@qinhanmin2014

qinhanmin2014 Sep 28, 2017

Member

@lesteve @jnothman Thanks a lot :) I'll try taking these issues.

Member

qinhanmin2014 commented Sep 28, 2017

@lesteve @jnothman Thanks a lot :) I'll try taking these issues.

@qinhanmin2014

This comment has been minimized.

Show comment
Hide comment
@qinhanmin2014

qinhanmin2014 Oct 4, 2017

Member

@lesteve
I have opened #9870 to address your second concern since it might need more discussions. Also, it seems that part of it depends on #9828. The PR already have a +1 so if you have time, please have a look at it. Thanks a lot :)

Member

qinhanmin2014 commented Oct 4, 2017

@lesteve
I have opened #9870 to address your second concern since it might need more discussions. Also, it seems that part of it depends on #9828. The PR already have a +1 so if you have time, please have a look at it. Thanks a lot :)

maskani-moh added a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017

[MRG+1] Fix floating bug in roc_auc_score (#9786)
* ensure fpr and tpr are increasing in roc_curve with non integer sample weights
* add tests and move roc_auc_score from METRIC_UNDEFINED_BINARY to METRIC_UNDEFINED_MULTICLASS

jwjohnson314 pushed a commit to jwjohnson314/scikit-learn that referenced this pull request Dec 18, 2017

[MRG+1] Fix floating bug in roc_auc_score (#9786)
* ensure fpr and tpr are increasing in roc_curve with non integer sample weights
* add tests and move roc_auc_score from METRIC_UNDEFINED_BINARY to METRIC_UNDEFINED_MULTICLASS

@lesteve lesteve referenced this pull request Jul 16, 2018

Closed

regression AUC reorder? #11572

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment