Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: more flexible describe() + tests #8164

Closed
wants to merge 1 commit into from

Conversation

bthyreau
Copy link
Contributor

@bthyreau bthyreau commented Sep 3, 2014

this patch adds a return_type keyword argument to describe() to make it more
flexible to use on mixed-type dataframe. User can now select among returning
numeric, categorical, or both, as well as 'auto' (previous behaviour, default),
and 'same', which keep columns identical (useful, e.g. with groupby())

@jreback
Copy link
Contributor

jreback commented Sep 3, 2014

pls show examples of the use case for this

@bthyreau
Copy link
Contributor Author

bthyreau commented Sep 3, 2014

Rationale.

When using dataframe of mixed type, ie. containing numeric values, string, categorical, etc, the current behaviour of describe() is a bit rough, as its only summarize over numerical columns only, or if none exists, over categorical columns only. With this change, describe() gets more flexible in its return form, which considerably smoothed my interactive data-analysis sessions.

From the doc

  return_type : string, default 'auto'
       The result format. Most relevant for mixed-types dataframes.
       One of 'numeric_only', 'categorical_only', 'both', 'same', 'auto'
       'auto' returns the 'numeric_only' description, except on fully
       categorical frames, where it returns 'categorical_only'.
       'both' returns the pair, 'same' returns a frame with same columns
       as original dataframe

Example

Although real-life scenario are more convincing, here is small examples

In [1]: import numpy as np
In [2]: import pandas as pd

In [3]: df = pd.DataFrame({'colA': ['foo', 'foo', 'bar'] * 10,
   ...:                         'colB': ['a', 'b', 'c', 'd', 'e'] * 6,
   ...:                         'colC': np.arange(30), 'colD' : np.ones(30)})

In [4]: df.head()
Out[4]: 
  colA colB  colC  colD
0  foo    a     0     1
1  foo    b     1     1
2  bar    c     2     1
3  foo    d     3     1
4  foo    e     4     1

# old behaviour pick columns based on the types of the dataframe. Not so nice.
In [6]: df.describe()
Out[6]: 
            colC  colD
count  30.000000    30
mean   14.500000     1
std     8.803408     0
min     0.000000     1
25%     7.250000     1
50%    14.500000     1
75%    21.750000     1
max    29.000000     1

# using the new option, we can explictely asks to describe both types
In [8]: df.describe(return_type="categorical_only")
Out[8]: 
       colA colB
count    30   30
unique    2    5
top     foo    d
freq     20    6

In [9]: df.describe(return_type="numeric_only")
Out[9]: 
            colC  colD
count  30.000000    30
mean   14.500000     1
std     8.803408     0
min     0.000000     1
25%     7.250000     1
50%    14.500000     1
75%    21.750000     1
max    29.000000     1

# using option "same" returns a df with similar-columns
In [11]: df.describe(return_type="same") 
Out[11]: 
       colA colB      colC colD
count    30   30        30   30
unique    2    5       NaN  NaN
top     foo    d       NaN  NaN
freq     20    6       NaN  NaN
mean    NaN  NaN      14.5    1
std     NaN  NaN  8.803408    0
min     NaN  NaN         0    1
25%     NaN  NaN      7.25    1
50%     NaN  NaN      14.5    1
75%     NaN  NaN     21.75    1
max     NaN  NaN        29    1

# one of my favorite pattern, using groupby:
In [13]: out = df.groupby("colA").describe(return_type="same")

In [14]: out.unstack(0)
       colB           colC           colD     
colA    bar  foo       bar       foo  bar  foo
count    10   20        10        20   10   20
unique    5    5       NaN       NaN  NaN  NaN
top       d    d       NaN       NaN  NaN  NaN
freq      2    4       NaN       NaN  NaN  NaN
mean    NaN  NaN      15.5        14    1    1
std     NaN  NaN  9.082951  8.855566    0    0
min     NaN  NaN         2         0    1    1
25%     NaN  NaN      8.75      6.75    1    1
50%     NaN  NaN      15.5        14    1    1
75%     NaN  NaN     22.25     21.25    1    1
max     NaN  NaN        29        28    1    1

@jreback
Copy link
Contributor

jreback commented Sep 3, 2014

Is their a reason you think that the above approach is better than:

In [8]: df.select_dtypes(['object']).describe()
Out[8]: 
       colA colB
count    30   30
unique    2    5
top     foo    d
freq     20    6

In [9]: df.select_dtypes(['number']).describe()
Out[9]: 
            colC  colD
count  30.000000    30
mean   14.500000     1
std     8.803408     0
min     0.000000     1
25%     7.250000     1
50%    14.500000     1
75%    21.750000     1
max    29.000000     1

(possibly adding include/exclude to .describe then doing a select_dtypes might be ok though)

@bthyreau
Copy link
Contributor Author

bthyreau commented Sep 4, 2014

  • The current behaviour of describe() is un-intuitive (unless the dataframe is homogenous type.) as it returns a subset of columns; and that subset depends on other included columns.
  • describe(), as an easy, short, concise way to get an overview of the df values, is especially appreciable during interactive sessions. Using a manual approach, or "select_dtypes" pre-filterings, can of course eventually output the results, but it defeat some of the practicality of describe().
  • Getting the result of describe() with the same columns as original df is quite handy. e.g. you can re-use a same columns-subset list. For e.g. compare
In [60]: df.dtypes
Out[60]: 
colA     object
colB     object
colC      int32
colD    float64
dtype: object


In [56]: model_col = ["colA","colB"]

In [57]: df.loc[:,model_col].describe().loc[:,model_col]
Out[57]: 
       colA colB
count    30   30
unique    2    5
top     foo    d
freq     20    6

In [58]: model_col = ["colA","colB","colC"]

In [59]: df.loc[:,model_col].describe().loc[:,model_col]
Out[59]: 
       colA  colB       colC
count   NaN   NaN  30.000000
mean    NaN   NaN  14.500000
std     NaN   NaN   8.803408
min     NaN   NaN   0.000000
25%     NaN   NaN   7.250000
50%     NaN   NaN  14.500000
75%     NaN   NaN  21.750000
max     NaN   NaN  29.000000

Here we have lost the count(),unique(), first(), etc. of colA and colB as soon as we introduced colC in the model

However, it's not lost anymore when using return_type = "same"

In [61]: df.loc[:,model_col].describe(return_type="same").loc[:,model_col]
Out[61]: 
       colA colB      colC
count    30   30        30
unique    2    5       NaN
top     foo    d       NaN
freq     20    6       NaN
mean    NaN  NaN      14.5
std     NaN  NaN  8.803408
min     NaN  NaN         0
25%     NaN  NaN      7.25
50%     NaN  NaN      14.5
75%     NaN  NaN     21.75
max     NaN  NaN        29
  • It also works out-of-the-box with groupby() results objects. Very useful when interacting with data in an exploratory way, i can easily see/stack the summaries results of my various grouping pivot. An alternative would need several lines of code, possibly the need to create a small helper function (or an ugly lambda) every time.

In the exemple before, even with select_dtypes(), I'm not even sure how to properly include a summary for colB (categorical) in the df.groupby("colA").describe() default output. In my proposal, it's just a matter of df.groupby("colA").describe(return_type="same").

Of course, it's more convincing with real-world large dataframe of mixed types, (as used e.g. in Psychology) where it's easy to mentally lost track of every columns and their types.

  • most of the internal type-selecting logic was already implemented, so was only a matter of exposing it

@jreback
Copy link
Contributor

jreback commented Sep 4, 2014

ok, your idea of 'same' is ok, but the API is not consistent with the pandas style.

I would be ok with adding include,exclude keywords to .describe to filter (with the default of include='number',exclude=None. You could simply make it accept include='same' (and intercept that) to implement the functionailiy (and use .select_dtypes internally).

its a well-constructed and general API by @cpcloud

@bthyreau
Copy link
Contributor Author

bthyreau commented Sep 5, 2014

ok, i implemented your suggested API, and it's indeed more flexible, while retaining the usability. Great !

Now it's possible to specify output form using include=/exclude= list.
I override 'include' to add the 'all' (alias '*') keyword, which enforce no-filtering, and the None/None (default) pair the enforce the previous, type-dependent behaviour.

Some snippet below:

>>> from pandas import Series
>>> from pandas import DataFrame
>>> import pandas.util.testing as tm
>>> import numpy as np
>>> 
>>> df = DataFrame({'catA': ['foo', 'foo', 'bar'] * 8,
...         'catB': ['a', 'b', 'c', 'd'] * 6,
...         'numC': np.arange(24),
...         'numD': np.arange(24.) + .5,
...         'ts': tm.makeTimeSeries()[:24].index})
>>> 
>>> 
>>> df.describe(include=["number","object"])
       catA catB       numC       numD
count    24   24  24.000000  24.000000
unique    2    4        NaN        NaN
top     foo    d        NaN        NaN
freq     16    6        NaN        NaN
mean    NaN  NaN  11.500000  12.000000
std     NaN  NaN   7.071068   7.071068
min     NaN  NaN   0.000000   0.500000
25%     NaN  NaN   5.750000   6.250000
50%     NaN  NaN  11.500000  12.000000
75%     NaN  NaN  17.250000  17.750000
max     NaN  NaN  23.000000  23.500000
>>> df.loc[:,:].describe() # as before
            numC       numD
count  24.000000  24.000000
mean   11.500000  12.000000
std     7.071068   7.071068
min     0.000000   0.500000
25%     5.750000   6.250000
50%    11.500000  12.000000
75%    17.250000  17.750000
max    23.000000  23.500000
>>> 
>>> df.loc[:,['catA','catB','ts']].describe() # contains NaN, as before
       catA catB                   ts
count    24   24                   24
unique    2    4                   24
first   NaN  NaN  2000-01-03 00:00:00
last    NaN  NaN  2000-02-03 00:00:00
top     foo    d  2000-01-31 00:00:00
freq     16    6                    1
>>> 
>>> df.describe(include=["object"])
       catA catB
count    24   24
unique    2    4
top     foo    d
freq     16    6
>>> df.describe(include='*')
       catA catB       numC       numD                   ts
count    24   24  24.000000  24.000000                   24
unique    2    4        NaN        NaN                   24
top     foo    d        NaN        NaN  2000-01-31 00:00:00
freq     16    6        NaN        NaN                    1
first   NaN  NaN        NaN        NaN  2000-01-03 00:00:00
last    NaN  NaN        NaN        NaN  2000-02-03 00:00:00
mean    NaN  NaN  11.500000  12.000000                  NaN
std     NaN  NaN   7.071068   7.071068                  NaN
min     NaN  NaN   0.000000   0.500000                  NaN
25%     NaN  NaN   5.750000   6.250000                  NaN
50%     NaN  NaN  11.500000  12.000000                  NaN
75%     NaN  NaN  17.250000  17.750000                  NaN
max     NaN  NaN  23.000000  23.500000                  NaN
>>> 
>>> df.loc[:,['catA','catB']].describe(include='*')
       catA catB
count    24   24
unique    2    4
top     foo    d
freq     16    6
>>> df.describe(include='*', exclude='XXX')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pandas/core/generic.py", line 3681, in describe
    raise ValueError("exclude must be None when include is '%s'" % include)
ValueError: exclude must be None when include is '*'
>>> 
>>> df.groupby("catA").describe(include='*') # my favorite
            catB       numC       numD                   ts
catA                                                       
bar  count     8   8.000000   8.000000                    8
     unique    4        NaN        NaN                    8
     top       d        NaN        NaN  2000-01-31 00:00:00
     freq      2        NaN        NaN                    1
     first   NaN        NaN        NaN  2000-01-05 00:00:00
     last    NaN        NaN        NaN  2000-02-03 00:00:00
     mean    NaN  12.500000  13.000000                  NaN
     std     NaN   7.348469   7.348469                  NaN
     min     NaN   2.000000   2.500000                  NaN
     25%     NaN   7.250000   7.750000                  NaN
     50%     NaN  12.500000  13.000000                  NaN
     75%     NaN  17.750000  18.250000                  NaN
     max     NaN  23.000000  23.500000                  NaN
foo  count    16  16.000000  16.000000                   16
     unique    4        NaN        NaN                   16
     top       d        NaN        NaN  2000-01-25 00:00:00
     freq      4        NaN        NaN                    1
     first   NaN        NaN        NaN  2000-01-03 00:00:00
     last    NaN        NaN        NaN  2000-02-02 00:00:00
     mean    NaN  11.000000  11.500000                  NaN
     std     NaN   7.118052   7.118052                  NaN
     min     NaN   0.000000   0.500000                  NaN
     25%     NaN   5.500000   6.000000                  NaN
     50%     NaN  11.000000  11.500000                  NaN
     75%     NaN  16.500000  17.000000                  NaN
     max     NaN  22.000000  22.500000                  NaN
>>> df.groupby("catA").describe(include=["object", "datetime", "number"], exclude=["float"])
            catB       numC                   ts
catA                                            
bar  count     8   8.000000                    8
     unique    4        NaN                    8
     top       d        NaN  2000-01-31 00:00:00
     freq      2        NaN                    1
     first   NaN        NaN  2000-01-05 00:00:00
     last    NaN        NaN  2000-02-03 00:00:00
     mean    NaN  12.500000                  NaN
     std     NaN   7.348469                  NaN
     min     NaN   2.000000                  NaN
     25%     NaN   7.250000                  NaN
     50%     NaN  12.500000                  NaN
     75%     NaN  17.750000                  NaN
     max     NaN  23.000000                  NaN
foo  count    16  16.000000                   16
     unique    4        NaN                   16
     top       d        NaN  2000-01-25 00:00:00
     freq      4        NaN                    1
     first   NaN        NaN  2000-01-03 00:00:00
     last    NaN        NaN  2000-02-02 00:00:00
     mean    NaN  11.000000                  NaN
     std     NaN   7.118052                  NaN
     min     NaN   0.000000                  NaN
     25%     NaN   5.500000                  NaN
     50%     NaN  11.000000                  NaN
     75%     NaN  16.500000                  NaN
     max     NaN  22.000000                  NaN

Some Design decision minor points

  • Meaningless input raise exceptions (mostly through select_dtypes), except through groupby() who catch them and fallback to returning 'all'.
  • the index of the returned results (count/mean/50%/unique/first...) still vary depending on the dataframe content (which kindof contradict my initial complain for columns, i never saw that as a problem when it concerned rows). Its order is also not stable.
  • I thought it might be convenient to move the '*' dtype-filtering trick directly into select_dtypes(). Would it be useful ?

for x in fself.columns]
# merge individual outputs, preserving index order as possible
names = []
ldesc_indexes = sorted([x.index for x in ldesc], key=len)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm concat does this by default (preserve order and concat other axes)
so no need for all this section (you might need to tweak some options to concat)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless i missed something, i believe by default only the columns order is preserved; the index order is lexsorted; that makes the output a bit unfriendly. Without it:

pd.concat(ldesc, keys=fself.columns, axis=1)
Out[38]: 
       catA catB       numC       numD                   ts
25%     NaN  NaN   5.750000   6.250000                  NaN
50%     NaN  NaN  11.500000  12.000000                  NaN
75%     NaN  NaN  17.250000  17.750000                  NaN
count    24   24  24.000000  24.000000                   24
first   NaN  NaN        NaN        NaN  2000-01-03 00:00:00
freq     16    6        NaN        NaN                    1
last    NaN  NaN        NaN        NaN  2000-02-03 00:00:00
max     NaN  NaN  23.000000  23.500000                  NaN
mean    NaN  NaN  11.500000  12.000000                  NaN
min     NaN  NaN   0.000000   0.500000                  NaN
std     NaN  NaN   7.071068   7.071068                  NaN
top     foo    d        NaN        NaN  2000-01-31 00:00:00
unique    2    4        NaN        NaN                   24

It interleaved different type's output.Though the output index-order of describe() is not supposed to be guaranteed, i found it greatly convenient for interactive sessions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need to use keys
that's the problem , just construct the sub frames and concat glues them together in the same order

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry i'm confused.

Here, the ldecs list is already in the column order, ready to be concat'ed; i use keys= only as a way to set back the column names that Series.describe()s lost.
The index (row) order has to be ordered "by blocks" according to a logic which can not trivially be infered from the list (ie. categorical rows if any; then numeric if any); even if i could somehow avoid the lexsort and use inputs' orders (which i failed to), it would order the blocks differently depending on the first passed column's dtype, which is less friendly for user. So i believe it's necessary to pre-create the index order ("names" here)

I admit i could use join_axes=Index(names) instead of .loc[names], though.

@cpcloud
Copy link
Member

cpcloud commented Sep 5, 2014

Introducing this kind of coupling hardly seems worth it for the "inconvenience" of having to call a single method.

@cpcloud
Copy link
Member

cpcloud commented Sep 5, 2014

Also how is all and star in select dtypes different from just not calling the method?

@bthyreau
Copy link
Contributor Author

Ok, following the previous comment, i refrained from touching select_dtypes.

I left the row-creation logic as it was, as i believed it was necessary, as discussed before.

I'm quite happy with the current code. In addition to docstrings, I also added a brief overview in the main-doc.

[I also have a potential short changelog doc, but i guess it's rude to commit it (e.g. in v0.15.txt) before knowing if you plan to merge this at all :)]

@bthyreau bthyreau changed the title more flexible describe() + tests ENH: more flexible describe() + tests Sep 11, 2014
.. ipython:: python

frame = DataFrame({'a': ['Yes', 'Yes', 'No', 'No'], 'b': range(4)})
frame.describe(include=['object'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

show frame.describe(), e.g. the default case

@bthyreau
Copy link
Contributor Author

Ok, i updated the main doc and docstring following your request.
Thanks

As for the rationale of those loops, this is necessary to compute the order of the row axis (statistics-list). The describe() functions must output results immediately practical for users, but without the loop, as you showed, percentiles are not surrounded by min/max; count is at the middle, etc; due to the default lexsorting logic of Index operations.

In detail, in the snippet below:

  • This first loop is basically the apply (except the cat'ing is delayed later, so i can still have access to series's length)
  • The second loop is the index-sorting logic

That's why i gave-up using apply in this case. I also experimented other way, such as the various Index manipulation functions, or pre-computing the rows-keys, etc. but it didn't improve much.

Note also that, as a side effect, the whole function itself seems to be slightly faster than the sole logicless apply.

def test1(fself, percentile_width = None, percentiles = []):
            ldesc = []
            for name, col in fself.iteritems():
                s = col.describe(percentile_width=percentile_width,\
                            percentiles=percentiles)
                s.name = name
                ldesc.append(s)
            # set a convenient order for rows
            names = []
            ldesc_indexes = sorted([x.index for x in ldesc], key=len)
            for idxnames in ldesc_indexes:
                for name in idxnames:
                    if name not in names:
                        names.append(name)
            d = pd.concat(ldesc, join_axes=pd.Index([names]), axis=1)
            return d

In [84]: %timeit test1(df, percentiles=[.42])
100 loops, best of 3: 5.4 ms per loop

In [85]: %timeit df.apply(lambda x : x.describe(percentile_width = None, percentiles=[.42]))
100 loops, best of 3: 6.59 ms per loop

Same pattern on a wider (24, 500)-shaped df: 458 ms vs 499 ms

# set a convenient order for rows
names = []
ldesc_indexes = sorted([x.index for x in ldesc], key=len)
for idxnames in ldesc_indexes:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at the very list this should be a list comprehension then

@bthyreau
Copy link
Contributor Author

Well, it's only style, but if you want back the list-comprehension, then fine; while at it, to make some actual improvement, i changed the behaviour on Series so that the index name got filed at creation time. See commit.

@jreback
Copy link
Contributor

jreback commented Sep 14, 2014

pls squash to a single commit

fself = self.select_dtypes(include=include, exclude=exclude)
ldesc = [col.describe(percentile_width=percentile_width,
percentiles=percentiles) for _, col in fself.iteritems()]
# set a convenient order for rows
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blank line betwwen blocks

@bthyreau
Copy link
Contributor Author

Ok, i squashed all commits into one, which updates the code and the main doc.
What i haven't commited yet is the changelog/release-notes text, which i have ready. Should i commit it somewhere ?
Thanks

@@ -490,6 +490,23 @@ number of unique values and most frequently occurring values:
s = Series(['a', 'a', 'b', 'b', 'a', 'a', np.nan, 'c', 'd', 'a'])
s.describe()

Note that on a mixed-type DataFrame object, `describe` will restrict the summary to
include only numerical columns or, if none are, only categorical columns:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls add similar note to v0.15.0.txt (and include this PR number as a refernce). put in the API section. Include a reference this doc section here.

frame.describe(include=['object'])
frame.describe(include=['number'])
frame.describe(include='all')

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you refer to select_dtypes here? (as it is in that function that the include/exclude arguments are explained in more detail)

@jorisvandenbossche
Copy link
Member

@bthyreau added some more comments (sorry it took a while to look at).
Further I have another remark: is the '*' option needed? Is just using 'all' not enough? (it's only two characters more). I don't think * is used anywhere in pandas at the moment, and is just two aliases for the same.

@cpcloud: are you ok with including this? as you objected in the first place?

@bthyreau
Copy link
Contributor Author

bthyreau commented Oct 1, 2014

  • Thank for your feedback. Following your comments about the condition-path, i finally decided to refactor that whole function. I was initially reluctant to change it, as it changed some of the (unspecified) order of fields in output; but it makes the code quite smaller and much clearer now.
    The only user-visible change is that the order of the output rows is slightly changed (in a more consistent way IMHO), and i had to alter an existing unittest which assumed an order.
  • Also, i wanted to confirm the behaviour using the new pd.Categorical data type, but i think select_dtypes() can't (explicitely) filter on it (yet?). Am i right ?

@jreback
Copy link
Contributor

jreback commented Oct 1, 2014

select_dtypes handles categorical (pass in 'category' as the dtype)
let's drop using * and just use all to include all dtypes

@bthyreau
Copy link
Contributor Author

bthyreau commented Oct 1, 2014

@jreback ok thanks. Ok to drop "*" if you think it's inconsistant with the rest of pandas.
I'll fix the branch ASAP (fail due to minor discrepency as for the status of "bool" + i want to assess categories) and push it again soon

@bthyreau
Copy link
Contributor Author

bthyreau commented Oct 2, 2014

  • Ok, removed '*' as you requested.
    [Ok course, if it was only for me, "all" would be the default describe() behaviour (since I can't figure-out the motivation for silentely dropping random columns), so in real-world use, i will have to alias it anyway.]. I hope i made the doc clear enough to warn about this pitfall.

Thanks

@jreback
Copy link
Contributor

jreback commented Oct 2, 2014

@bthyreau well

describe is most useful for only numeric columns and that is the default
so it provides backwards compat

it does not drop random columns rather it by default selects numeric
(which is how most operations work FYI)

@@ -3658,6 +3658,16 @@ def abs(self):
The percentiles to include in the output. Should all
be in the interval [0, 1]. By default `percentiles` is
[.25, .5, .75], returning the 25th, 50th, and 75th percentiles.
include, exclude : list-like, 'all', or None (default)
Specify the form of the returned result. Either:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have to leave an empty line after this line in order that the list will be rendered as a list (in the online html docstring pages). Or other option is to remove the line and just have the list items, then there does not need to be an empty line.

@bthyreau
Copy link
Contributor Author

bthyreau commented Oct 3, 2014

ok, refactored a bit to avoid the recomputation of parameters due to recursion. Thanks for pointing it out; As a bonus, the codepaths are shorter and easier to follow !
I also updated the docstring following your suggestion about more explicit mixed-type cases.

Thanks !

@@ -3751,42 +3767,45 @@ def describe_categorical_1d(data):

elif issubclass(data.dtype.type, np.datetime64):
asint = data.dropna().values.view('i8')
names += ['first', 'last', 'top', 'freq']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect this does not work with Timedelta, but easy to fix, do something like:

if com.needs_i8_conversion(data):
    boxer = com.i8_boxer(data)
    asint = data.dropna().asi8

    names = ......

then just user boxer rather than lib.Timestamp

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing that out.

Hmm.. although your snippet works well, Timedelta seems closer to a numeric type than a object one: select_dtypes() filters it with np.number; and users would like to get a min/max/mean values, not first/last. So instead, i moved it to the describe_numeric codepath. Since it's new in 0.15, i think it's ok to do so even in the "default" describe call. However, std() doesn't make a lot of sense, so an alternative would be to create another specific index for timedelta.

ie. current code behave:

In [4]: df = DataFrame({'A_cat': Categorical(['foo', 'foo', 'bar'] * 8),
   ...:                 'B_obj': ['a', 'b', 'c', 'd'] * 6,
   ...:                 'C_int': np.arange(24, dtype='int64'),
   ...:                 'D_ts': tm.makeTimeSeries(24).index,
   ...:                 'E_tdelta': to_timedelta(np.arange(24)%20,"D")})

In [6]: df.describe(include="all")
Out[6]: 
       A_cat B_obj      C_int                 D_ts          E_tdelta
count     24    24  24.000000                   24                24
unique     2     4        NaN                   24               NaN
top      foo     d        NaN  2000-01-31 00:00:00               NaN
freq      16     6        NaN                    1               NaN
first    NaN   NaN        NaN  2000-01-03 00:00:00               NaN
last     NaN   NaN        NaN  2000-02-03 00:00:00               NaN
mean     NaN   NaN  11.500000                  NaN   8 days 04:00:00
std      NaN   NaN   7.071068                  NaN      5.354416e+14
min      NaN   NaN   0.000000                  NaN   0 days 00:00:00
25%      NaN   NaN   5.750000                  NaN   2 days 18:00:00
50%      NaN   NaN  11.500000                  NaN   7 days 12:00:00
75%      NaN   NaN  17.250000                  NaN  13 days 06:00:00
max      NaN   NaN  23.000000                  NaN  19 days 00:00:00

In [7]: df.describe()
Out[7]: 
           C_int          E_tdelta
count  24.000000                24
mean   11.500000   8 days 04:00:00
std     7.071068      5.354416e+14
min     0.000000   0 days 00:00:00
25%     5.750000   2 days 18:00:00
50%    11.500000   7 days 12:00:00
75%    17.250000  13 days 06:00:00
max    23.000000  19 days 00:00:00

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is fine (I am going to fix the std of the Timedelta right now actually). go ahead and make those slight changes (below) and we'll get this in.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on 2nd thought, even though std IS possible, its too tricky for right now (the problem is var is not allowed because it CAN overflow and not be represented by a Timedelta, so need special handling for std. punting for now). Put a wrapper around these ops to catch a TypeError and turn it into a Nan is the best soln I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm; in the example above, the float64 value ( 5.354416e+14 ) for std() is actually correct, and convert correctly with to_timedelta(df.std()) to 6 days 04:44:01.629993; [the var(), otoh, throw an OverflowError, but describe() doesn't use var()].

I'm not sure why the returned std value is not a TimeDelta64 but a float64; it looks like a problem with the std() call. Did you mean i should still catch that in describe() and recover ? (i may misunderstood since i'm not familiar with that part)

I fixed your other comment - Thanks !

@jreback
Copy link
Contributor

jreback commented Oct 5, 2014

@bthyreau small change.

some .std() calls on a Timedelta will work (except for the returned inference). But that's the problem, some will also overflow. This stems from: std = sqrt(var) and var can be a number that is really big.

In [7]: max_int = np.iinfo(np.int64).max

In [8]: max_int
Out[8]: 9223372036854775807

In [9]: pd.Timedelta(max_int)
Out[9]: Timedelta('106751 days 23:47:16.854775')

In [10]: big_float = 3e19

In [12]: pd.Timedelta(big_float)
---------------------------------------------------------------------------
OverflowError                             Traceback (most recent call last)
OverflowError: Python int too large to convert to C long

In [13]: pd.Timedelta(np.sqrt(big_float))
Out[13]: Timedelta('0 days 00:00:05.477225')

Going to create an issue to fix this, but don't have time right now.
So just wrap the .std() as I show above to catch this (it will raise TypeError because it doesn't allow std/var ATM).

The complication is that std is allowd, but now var (but std CALLS var). So need to do this in a non-hacky way.

@jreback
Copy link
Contributor

jreback commented Oct 5, 2014

see here: #8471

lmk when you make that change and push.

@jreback
Copy link
Contributor

jreback commented Oct 5, 2014

@bthyreau of u can address this soon would be gr8
if not lmk - going to do a rc prob tonight and this goes in

@jreback
Copy link
Contributor

jreback commented Oct 5, 2014

I think #8476 will allow this to merge cleanly. so hold off

@jreback
Copy link
Contributor

jreback commented Oct 5, 2014

@bthyreau ok I think if u rebase this should work

@bthyreau
Copy link
Contributor Author

bthyreau commented Oct 5, 2014

ok great. Rebasing and pushing now

This enhance describe()'s output via new include/exclude list arguments,
letting the user specify the dtypes to be summarized as output.
This provides an simple way to overcome the automatic type-filtering done
by default; it's also convenient with groupby().
Also includes documentation and changelog entries.
@jreback
Copy link
Contributor

jreback commented Oct 6, 2014

merge via 6d3803d

thanks!

@jreback
Copy link
Contributor

jreback commented Oct 6, 2014

side issue:

I think we may need a rounding option or something to make some of the default
display better looking. This comes up most often in timedeltas, but maybe other dtypes too.

This example is from your tests.

In [23]: df['td'] = df['ts'].diff()

In [24]: df.describe(include='all')
Out[24]: 
       catA catB       numC       numD                   ts                      td
count    24   24  24.000000  24.000000                   24                      23
unique    2    4        NaN        NaN                   24                     NaN
top     foo    d        NaN        NaN  2000-01-31 00:00:00                     NaN
freq     16    6        NaN        NaN                    1                     NaN
first   NaN  NaN        NaN        NaN  2000-01-03 00:00:00                     NaN
last    NaN  NaN        NaN        NaN  2000-02-03 00:00:00                     NaN
mean    NaN  NaN  11.500000  12.000000                  NaN  1 days 08:20:52.173913
std     NaN  NaN   7.071068   7.071068                  NaN  0 days 18:36:09.225425
min     NaN  NaN   0.000000   0.500000                  NaN         1 days 00:00:00
25%     NaN  NaN   5.750000   6.250000                  NaN         1 days 00:00:00
50%     NaN  NaN  11.500000  12.000000                  NaN         1 days 00:00:00
75%     NaN  NaN  17.250000  17.750000                  NaN         1 days 00:00:00
max     NaN  NaN  23.000000  23.500000                  NaN         3 days 00:00:00

In [25]: df['td'].describe(include='all')
Out[25]: 
count                        23
mean     1 days 08:20:52.173913
std      0 days 18:36:09.225425
min             1 days 00:00:00
25%             1 days 00:00:00
50%             1 days 00:00:00
75%             1 days 00:00:00
max             3 days 00:00:00
Name: td, dtype: object

In [26]: df['td'].describe(include='all')['mean']
Out[26]: Timedelta('1 days 08:20:52.173913')

You can 'fix' this by rounding (and you can check Timedelta(...).resolution to make sure that you are not cutting things off, e.g. Timedelta('1min 1s').resolution -> 's'

In [27]: df['td'].describe(include='all')['mean'].round('s')
Out[27]: Timedelta('1 days 08:20:52')

so prob need to have a wrapper for various functions (e.g. mean/std) to do this (for numeric like)
with an option.

@bthyreau If you think this is worthwhile, pls create a new issue.

@jorisvandenbossche
@cpcloud

@jreback jreback closed this Oct 6, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Dtype Conversions Unexpected or buggy dtype conversions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants