Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataFrame.describe(percentiles=[]) still returns 50% percentile. #11866

Closed
dragoljub opened this issue Dec 18, 2015 · 14 comments
Closed

DataFrame.describe(percentiles=[]) still returns 50% percentile. #11866

dragoljub opened this issue Dec 18, 2015 · 14 comments
Labels
Docs Numeric Operations Arithmetic, Comparison, and Logical operations

Comments

@dragoljub
Copy link

The DataFrame.describe() method docs seem to indicate that you can pass percentiles=None to not compute any percentiles, however by default it still computes 25%, 50% and 75%. The best I can do is pass an empty list to only compute the 50% percentile. I would think that passing an empty list would return no percentile computations.

Should we allow passing an empty list to not compute any percentiles?

pandas 0.17.1

In [1]: import pandas as pd

In [2]: import numpy as np

In [3]: df = pd.DataFrame(np.random.randn(10,5))

In [4]: df.describe(percentiles=None)
Out[4]:
               0          1          2          3          4          5  
count  10.000000  10.000000  10.000000  10.000000  10.000000  10.000000
mean   -0.116736  -0.160728   0.066763  -0.068867  -0.242050   0.390091
std     0.771704   0.837520   0.875747   0.955985   1.093919   0.923464
min    -1.347786  -1.140541  -1.297533  -1.347824  -2.085290  -0.825807
25%    -0.580527  -0.613640  -0.558291  -0.538433  -0.836046  -0.275567
50%    -0.261526  -0.395307   0.007595  -0.248025   0.000515   0.314278
75%     0.329780   0.154053   0.708768   0.407732   0.366278   1.192338
max     1.285276   1.649528   1.485076   1.697162   1.551388   1.762939

In [15]: df.describe(percentiles=[])
Out[15]:
               0          1          2          3          4          5  
count  10.000000  10.000000  10.000000  10.000000  10.000000  10.000000
mean   -0.116736  -0.160728   0.066763  -0.068867  -0.242050   0.390091
std     0.771704   0.837520   0.875747   0.955985   1.093919   0.923464
min    -1.347786  -1.140541  -1.297533  -1.347824  -2.085290  -0.825807
50%    -0.261526  -0.395307   0.007595  -0.248025   0.000515   0.314278
max     1.285276   1.649528   1.485076   1.697162   1.551388   1.762939
@rockg
Copy link
Contributor

rockg commented Dec 18, 2015

I think the goal here is to return the median which I think is a useful statistic and the code comments here echo that. We can clear up the documentation if that would help. What are you trying to achieve?

@dragoljub
Copy link
Author

I was just trying to avoid computing any percentiles/median because that often involves sorting which could take some time depending on how many columns of data you are looking at. I suppose the 50%/median makes sense to have in describe as a default. Still I would think that passing an empty list would not compute even the 50%/median.

@jreback
Copy link
Contributor

jreback commented Dec 18, 2015

median does not involve sorting as its implemented using a skip list

in fact it's just order n (and its in c)

@shoyer
Copy link
Member

shoyer commented Dec 19, 2015

Yep, there's not much to be gained by dropping percentiles -- every summary operation is O(n).

On the other hand, if you have actual big data, then you probably want to use approximate (sketch) algorithms for quantiles so you can do stream processing. But that's not really a problem for pandas...

@dragoljub
Copy link
Author

Medians should be fast but take a look at the performance difference I'm getting. Even if I hack up a quick 'describe' function with concat and transpose its quite a bit faster than df.describe(). When I remove the median its an additional 2x faster as compared with computing the median. ❓ 😕

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(100000,1000), columns=['C{}'.format(i) for i in range(1000)])

%time a = df.describe(percentiles=[])
    Wall time: 17.8 s

%time b = pd.concat([df.count(), df.mean(), df.std(), df.min(), df.median(), df.max()], axis=1).T
    Wall time: 10.8 s

%time c = pd.concat([df.count(), df.mean(), df.std(), df.min(), df.max()], axis=1).T
    Wall time: 4.94 s

np.array_equal(a,b)
    True

pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.10.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 44 Stepping 2, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None

pandas: 0.17.1
nose: 1.3.7
pip: 7.1.2
setuptools: 16.0
Cython: 0.23
numpy: 1.9.2
scipy: 0.16.0
statsmodels: 0.6.1
IPython: 4.0.0
sphinx: None
patsy: 0.4.0
dateutil: 2.4.2
pytz: 2015.4
blosc: None
bottleneck: 1.0.0
tables: 3.2.1
numexpr: 2.4.3
matplotlib: 1.5.0
openpyxl: 2.2.5
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: None
lxml: 3.4.4
bs4: 4.4.0
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.8
pymysql: None
psycopg2: None
Jinja2: None

@jreback
Copy link
Contributor

jreback commented Dec 19, 2015

@dragoljub you realize his has nothing to do with median per-se and much more do with with a column-by-column application of functions. .describe is essentially a fancy .apply. Note that it could be implemented to do this on blocks and it would be much faster.

@shoyer
Copy link
Member

shoyer commented Dec 19, 2015

That is a fair point... medians may still be O(n) in time and space, but they are indeed slower than calculating moments.

In any case I agree that this is a bug. A fix would be appreciated!

@jreback
Copy link
Contributor

jreback commented Dec 19, 2015

this is essentially the same issue as #11623 (the perf part)

@jreback jreback added Performance Memory or execution speed performance API Design Numeric Operations Arithmetic, Comparison, and Logical operations Difficulty Intermediate labels Dec 19, 2015
@jreback jreback added this to the Next Major Release milestone Dec 19, 2015
@dragoljub
Copy link
Author

Yes block level computation would be great! 👍

The other point I'm making is:
Should we have an escape hatch in df.describe() for users that don't want to compute medians for 1000's of columns? Even with block level computation the median computation takes several times longer than all the other statistics combined. 🐢

@RhysU
Copy link

RhysU commented Feb 25, 2019

If the empty list always computes the 50th percentile, how about a documentation update indicating this is expected behavior.

@mroeschke mroeschke added Docs and removed API Design Performance Memory or execution speed performance labels Apr 21, 2021
@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
@fjetter
Copy link
Member

fjetter commented Mar 24, 2025

I noticed this behavior changing on the latest nightly. I assume that's been an intentional change? I couldn't find something in the changelog

@TomAugspurger
Copy link
Contributor

#61158 might be relevant (cc @MartinBraquet).

@TomAugspurger
Copy link
Contributor

AFAICT, #60550 and this issue are near duplicates. The only difference is whether percentiles=[] is and percentiles=[0.25] both mean "don't include the median". Reading through #60550, the new behavior seems to be that the median is only included if passed explicitly or with the default percentiles=None.

So I'll close this but LMK if I've misunderstood.

@MartinBraquet
Copy link
Contributor

MartinBraquet commented Mar 24, 2025

@TomAugspurger That's correct. Starting 3.0, the median will only be included de facto if percentiles is None. Any non None value for percentiles will not return the median, unless it's included in that list passed as percentiles. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs Numeric Operations Arithmetic, Comparison, and Logical operations
Projects
None yet
Development

No branches or pull requests

10 participants