roc_auc_score computation is wrong for large samples #6842

arogozhnikov · 2016-05-29T20:51:21Z

Hi, we've recently found some strange inconsistency in the behavior of roc_auc_score. After long digging @tata-antares found out that it is wrongly computed for large data samples if weights passed are float32.

Minimal example

from sklearn.metrics import roc_auc_score
import numpy
numpy.random.seed(42)

n_samples = 4 * 10 ** 7
y = numpy.random.randint(2, size=n_samples)
prediction = numpy.random.normal(size=n_samples) + y * 0.01
trivial_weight = numpy.ones(n_samples)

roc_auc_score(y, prediction)
#prints 0.50273924526046887

roc_auc_score(y, prediction, sample_weight=trivial_weight.astype('float32'))
#**prints nan** and wrong warning
/moosefs/miniconda/envs/ipython_py2/lib/python2.7/site-packages/sklearn/metrics/ranking.py:530: UndefinedMetricWarning: No negative samples in y_true, false positive value should be meaningless
  UndefinedMetricWarning)

roc_auc_score(y, prediction, sample_weight=trivial_weight.astype('float64'))
#prints 0.50273924526046887

The value when dtype=float64 is correct, but in the case we investigated after passing float32 metric returned a regular number, which is very different from correct one (e.g 0.58 instead of 0.52). It is some occasion we detected this problem.

sklearn == '0.17.1'

The text was updated successfully, but these errors were encountered:

nelson-liu · 2016-05-29T20:53:11Z

~~maybe related to issues # 6711 and 3864?~~ Nevermind it's not.

arogozhnikov · 2016-05-29T20:59:10Z

#3864 is due to clipping, may be related to #6711, but seems no weights there.

jnothman · 2016-05-31T23:14:10Z

I can confirm getting "UndefinedMetricWarning: No negative samples in y_true, false positive value should be meaningless" ~~but I'm not able to replicate the nan. Is there a reason you did not report the rest of your platform details?~~

jnothman · 2016-05-31T23:14:48Z

Scratch that. I do get nan. Just wasn't reading output correctly.

jnothman · 2016-05-31T23:31:19Z

This is what it comes down to:

In [25]: np.cumsum(trivial_weight.astype('float32'))[-1]
Out[25]: 16777216.0

In [26]: np.cumsum(trivial_weight.astype('float64'))[-1]
Out[26]: 40000000.0

In [27]: np.sum(trivial_weight.astype('float32'))
Out[27]: 40000000.0

In [28]: np.sum(trivial_weight.astype('float32')).dtype
Out[28]: dtype('float32')

Cumsum is losing precision, even when a float32 sum does not. This might be a numpy bug. If so, it lies in np.add.accumulate:

In [10]: np.add.accumulate(trivial_weight.astype('float32'))[-1]
Out[10]: 16777216.0

We can solve here by always passing dtype='float64' to cumsum, or doing so if the number of samples is high or something. There might also be a way to do cumsum in batches in a way that reduces the issue. Certainly we can identify the error by doing the cumsum and checking if its final value is equal to the sum.

jnothman · 2016-05-31T23:35:46Z

@arogozhnikov, please let me know if you report a bug to numpy. Feel free to submit a PR here.

and @nelson-liu, this is entirely independent of those other bugs.

nelson-liu · 2016-06-01T02:13:24Z

@jnothman ah ok, sorry I didn't look super closely into it; I just remembered seeing those other two threads and xref'd them here. thanks for letting me know.

arogozhnikov · 2016-06-03T17:47:40Z

@jnothman
it's not a bug, it is an expected behavior, since this code

x = trivial_weight
for i in range(len(x) - 1):
    x[i + 1] += x[i]

gives the same (wrong) result.

np.sum is a different story — one can change order of summation there in any possible way.
As for np.cumsum, this will drive to doubling number of computations (even for optimal implementation), so I don't think numpy team will do this (at least till the moment they need multithreading for np.cumsum).

jnothman · 2016-06-04T07:52:52Z

I don't find your claim to be true:


In [9]: x = trivial_weight.copy()

In [10]: for i in range(len(x) - 1):
   ....:         x[i + 1] += x[i]
   ....:
x[-1]

In [11]: x[-1]
Out[11]: 40000000.0

jnothman · 2016-06-04T07:53:27Z

whoops forgot astype. stupid.

jnothman · 2016-06-04T07:55:35Z

you're right.... though that's really quite nasty.

jnothman · 2016-08-31T14:20:27Z

Would we be happy with a fix that performs cumsum in float64?

amueller · 2016-08-31T15:42:30Z

+1 for cumsum in float64.

arogozhnikov · 2016-08-31T17:59:55Z

Would we be happy with a fix that performs cumsum in float64?

Yes, sounds good enough.

jnothman · 2016-08-31T23:38:39Z

Is there a way to test this? Would we better off with a (somewhat costly) helper to insure stability even in the float64 case? Like:

def stable_cumsum(arr):
    out = np.cumsum(arr, dtype=np.float64)
    expected = np.sum(arr, dtype=np.float64)
    if not np.allclose(out, expected):
        raise RuntimeError('cumsum was found to be unstable: its results do not correspond to sum')
    return out

jnothman added the Bug label May 31, 2016

amueller added this to the 0.18 milestone Aug 15, 2016

This was referenced Sep 2, 2016

[MRG + 1] FIX use high precision cumsum and check it is stable enough #7331

Merged

Find and fix any potentially unstable cumsums #7359

Closed

lesteve closed this as completed in #7331 Sep 9, 2016

yangarbiter mentioned this issue Sep 9, 2016

[MRG+1] FIX unstable cumsum #7376

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roc_auc_score computation is wrong for large samples #6842

roc_auc_score computation is wrong for large samples #6842

arogozhnikov commented May 29, 2016 •

edited by TomDLT

Loading

nelson-liu commented May 29, 2016 •

edited

Loading

arogozhnikov commented May 29, 2016

jnothman commented May 31, 2016 •

edited

Loading

jnothman commented May 31, 2016

jnothman commented May 31, 2016

jnothman commented May 31, 2016 •

edited

Loading

nelson-liu commented Jun 1, 2016

arogozhnikov commented Jun 3, 2016 •

edited

Loading

jnothman commented Jun 4, 2016

jnothman commented Jun 4, 2016

jnothman commented Jun 4, 2016

jnothman commented Aug 31, 2016

amueller commented Aug 31, 2016

arogozhnikov commented Aug 31, 2016 •

edited

Loading

jnothman commented Aug 31, 2016

roc_auc_score computation is wrong for large samples #6842

roc_auc_score computation is wrong for large samples #6842

Comments

arogozhnikov commented May 29, 2016 • edited by TomDLT Loading

Minimal example

nelson-liu commented May 29, 2016 • edited Loading

arogozhnikov commented May 29, 2016

jnothman commented May 31, 2016 • edited Loading

jnothman commented May 31, 2016

jnothman commented May 31, 2016

jnothman commented May 31, 2016 • edited Loading

nelson-liu commented Jun 1, 2016

arogozhnikov commented Jun 3, 2016 • edited Loading

jnothman commented Jun 4, 2016

jnothman commented Jun 4, 2016

jnothman commented Jun 4, 2016

jnothman commented Aug 31, 2016

amueller commented Aug 31, 2016

arogozhnikov commented Aug 31, 2016 • edited Loading

jnothman commented Aug 31, 2016

arogozhnikov commented May 29, 2016 •

edited by TomDLT

Loading

nelson-liu commented May 29, 2016 •

edited

Loading

jnothman commented May 31, 2016 •

edited

Loading

jnothman commented May 31, 2016 •

edited

Loading

arogozhnikov commented Jun 3, 2016 •

edited

Loading

arogozhnikov commented Aug 31, 2016 •

edited

Loading