-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DataFrame.describe(percentiles=[]) still returns 50% percentile. #11866
Comments
I think the goal here is to return the median which I think is a useful statistic and the code comments here echo that. We can clear up the documentation if that would help. What are you trying to achieve? |
I was just trying to avoid computing any percentiles/median because that often involves sorting which could take some time depending on how many columns of data you are looking at. I suppose the 50%/median makes sense to have in describe as a default. Still I would think that passing an empty list would not compute even the 50%/median. |
median does not involve sorting as its implemented using a skip list in fact it's just order n (and its in c) |
Yep, there's not much to be gained by dropping percentiles -- every summary operation is O(n). On the other hand, if you have actual big data, then you probably want to use approximate (sketch) algorithms for quantiles so you can do stream processing. But that's not really a problem for pandas... |
Medians should be fast but take a look at the performance difference I'm getting. Even if I hack up a quick 'describe' function with concat and transpose its quite a bit faster than df.describe(). When I remove the median its an additional 2x faster as compared with computing the median. ❓ 😕 import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(100000,1000), columns=['C{}'.format(i) for i in range(1000)])
%time a = df.describe(percentiles=[])
Wall time: 17.8 s
%time b = pd.concat([df.count(), df.mean(), df.std(), df.min(), df.median(), df.max()], axis=1).T
Wall time: 10.8 s
%time c = pd.concat([df.count(), df.mean(), df.std(), df.min(), df.max()], axis=1).T
Wall time: 4.94 s
np.array_equal(a,b)
True
pd.show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 2.7.10.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 44 Stepping 2, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
pandas: 0.17.1
nose: 1.3.7
pip: 7.1.2
setuptools: 16.0
Cython: 0.23
numpy: 1.9.2
scipy: 0.16.0
statsmodels: 0.6.1
IPython: 4.0.0
sphinx: None
patsy: 0.4.0
dateutil: 2.4.2
pytz: 2015.4
blosc: None
bottleneck: 1.0.0
tables: 3.2.1
numexpr: 2.4.3
matplotlib: 1.5.0
openpyxl: 2.2.5
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: None
lxml: 3.4.4
bs4: 4.4.0
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.8
pymysql: None
psycopg2: None
Jinja2: None |
@dragoljub you realize his has nothing to do with |
That is a fair point... medians may still be O(n) in time and space, but they are indeed slower than calculating moments. In any case I agree that this is a bug. A fix would be appreciated! |
this is essentially the same issue as #11623 (the perf part) |
Yes block level computation would be great! 👍 The other point I'm making is: |
If the empty list always computes the 50th percentile, how about a documentation update indicating this is expected behavior. |
I noticed this behavior changing on the latest nightly. I assume that's been an intentional change? I couldn't find something in the changelog |
#61158 might be relevant (cc @MartinBraquet). |
AFAICT, #60550 and this issue are near duplicates. The only difference is whether So I'll close this but LMK if I've misunderstood. |
@TomAugspurger That's correct. Starting 3.0, the median will only be included de facto if percentiles is None. Any non None value for percentiles will not return the median, unless it's included in that list passed as percentiles. Thanks. |
The DataFrame.describe() method docs seem to indicate that you can pass percentiles=None to not compute any percentiles, however by default it still computes 25%, 50% and 75%. The best I can do is pass an empty list to only compute the 50% percentile. I would think that passing an empty list would return no percentile computations.
Should we allow passing an empty list to not compute any percentiles?
pandas 0.17.1
The text was updated successfully, but these errors were encountered: