Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

scoreateprecentile return wrong value #972

Closed
isofer opened this Issue · 5 comments

4 participants

@isofer

scoreatprecentile of a series returns the wrong value

In [1341]: a = np.random.rand(100)

In [1342]: b = pd.Series(a)

In [1343]: a[:10]

Out[1343]: 
array([ 0.6131142 ,  0.65266141,  0.24583156,  0.70179786,  0.33361506,
        0.65042728,  0.70192276,  0.02727854,  0.65948894,  0.44326182])

In [1348]: scoreatpercentile(a,1) 
Out[1348]: 0.010388922650144839 #correct value

In [1349]: scoreatpercentile(b,1) 
Out[1349]: 0.65226593993834392 #incorrect value

In [1350]: scoreatpercentile(a,2)
Out[1350]: 0.011971896338709577 #correct value

In [1351]: scoreatpercentile(b,2)
Out[1351]: 0.25396815348880808 #incorrect value



I'm not sure if this is a pandas issue or scipy issue, and I am aware of the quantile method, but I still wonder if it is possible to fix that.

@wesm
Owner

The problem is the semantics of integer indexes with pandas objects . Either pass b.values to scoreatpercentile or use b.quantile(0.2). I think scipy.stats should be calling np.asarray on the input, you could raise an issue with them about it

http://pandas.pydata.org/pandas-docs/stable/indexing.html#advanced-indexing-with-integer-labels

@wesm wesm closed this
@lodagro
Collaborator

If you look at the scoreatpercentile code, issue can be spotted quickly.
A sort is done and afterwards indexed. Since b has a int index, the issue here is label <> positional indexing on the Series.

def scoreatpercentile(a, per, limit=()):
    values = np.sort(a,axis=0)
    if limit:
        values = values[(limit[0] <= values) & (values <= limit[1])]

    idx = per /100. * (values.shape[0] - 1)
    if (idx % 1 == 0):
        return values[idx]
    else:
        return _interpolate(values[int(idx)], values[int(idx) + 1], idx % 1)

If you give b a non integer index, the issue does not show up.

In [54]: b.index = pandas.util.testing.makeStringIndex(100)

In [55]: stats.scoreatpercentile(b, 1)
Out[55]: 0.063875501677037982

In [56]: stats.scoreatpercentile(a, 1)
Out[56]: 0.063875501677037982
@isofer

thanks.
I'll raise the issue in scipy

@jseabold
Collaborator

This is another instance of Series not quite being array-like. scoreatpercentile can't call asarray because it has to deal with array-like matrices and masked arrays. Maybe in the future if these go away (matrix likely isn't). Just thinking out loud, but it might be worth thinking if you really want to preserve the new sorted index for the default integer index in a Series/DataFrame. The again, it might not.

@isofer

the reply of a scipy developer to this issue:
http://projects.scipy.org/scipy/ticket/1634

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.