# Data Aggregation and Group Operations

* Split
* Computing group summary statistics
* Apply functions to each column of a DataFrame
* Apply within-group transformations or other manipulations
* Compute pivot tables and cross-tabulations
* Perform quantile analysis and other group analyses

## GroupBy Mechanics

In [2]:
import pandas as pd
import numpy as np

In [3]:
df = pd.DataFrame({'key1': ['a', 'a', 'b', 'b', 'a'],
                   'key2': ['one', 'two', 'one', 'two', 'one'],
                   'data1': np.random.randn(5),
                   'data2': np.random.randn(5)})

In [4]:
grouped = df['data1'].groupby(df['key1'])

In [5]:
grouped

<pandas.core.groupby.SeriesGroupBy object at 0x7f2478507d50>

In [6]:
grouped.mean()

key1
a    0.371471
b   -0.552806
Name: data1, dtype: float64

In [10]:
df['data1'].groupby([df['key1'], df['key2']]).mean()

key1  key2
a     one     0.451829
      two     0.210756
b     one    -0.335555
      two    -0.770057
Name: data1, dtype: float64

In [11]:
df['data1'].groupby([df['key1'], df['key2']]).sum()

key1  key2
a     one     0.903658
      two     0.210756
b     one    -0.335555
      two    -0.770057
Name: data1, dtype: float64

In [12]:
df['data1'].groupby([df['key1'], df['key2']]).count()

key1  key2
a     one     2
      two     1
b     one     1
      two     1
dtype: int64

In [13]:
means = df['data1'].groupby([df['key1'], df['key2']]).mean()

In [14]:
means

key1  key2
a     one     0.451829
      two     0.210756
b     one    -0.335555
      two    -0.770057
Name: data1, dtype: float64

In [15]:
means.unstack()

key2,one,two
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,0.451829,0.210756
b,-0.335555,-0.770057


In the examples above, the group keys are all Series, though they could be any arrays of the right length

In [16]:
states = np.array(['Ohio', 'California', 'California', 'Ohio', 'Ohio'])

In [17]:
states

array(['Ohio', 'California', 'California', 'Ohio', 'Ohio'], 
      dtype='|S10')

In [18]:
years = np.array([2005, 2005, 2006, 2005, 2006])

In [19]:
df['data1'].groupby([states, years]).mean()

California  2005    0.210756
            2006   -0.335555
Ohio        2005    0.006300
            2006    0.121001
Name: data1, dtype: float64

Frequently the grouping information you're looking for is located in the same DataFrame as the data you're looking to summarize. In that case, you can pass column names as the group keys.

In [22]:
df.groupby('key1').mean() # must be the dataframe method to pass the key like this

Unnamed: 0_level_0,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,0.371471,-0.721434
b,-0.552806,-0.544059


In [25]:
df.groupby('key1').mean().stack().unstack(0) # transpose the dataframe

key1,a,b
data1,0.371471,-0.552806
data2,-0.721434,-0.544059


In [26]:
df.groupby(['key1', 'key2']).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,data1,data2
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,one,0.451829,-0.367857
a,two,0.210756,-1.428587
b,one,-0.335555,0.34532
b,two,-0.770057,-1.433438


In [30]:
df.groupby(['key1', 'key2']).size()

key1  key2
a     one     2
      two     1
b     one     1
      two     1
dtype: int64

### Iterating Over Groups
the GroupBy object supports iteration, generating a sequence of 2-tuples containing the group name along with the chunk of data. Consider the following example:

In [36]:
for name, group in df.groupby('key1'):
    print name
    print group
    print '\n'

a
      data1     data2 key1 key2
0  0.782657 -0.208217    a  one
1  0.210756 -1.428587    a  two
4  0.121001 -0.527497    a  one


b
      data1     data2 key1 key2
2 -0.335555  0.345320    b  one
3 -0.770057 -1.433438    b  two




In [40]:
for key, group in df.groupby(['key1', 'key2']):
    print "First key is '%s' and second key is '%s'" % (key[0], key[1])
    print group
    print '\n'

First key is 'a' and second key is 'one'
      data1     data2 key1 key2
0  0.782657 -0.208217    a  one
4  0.121001 -0.527497    a  one


First key is 'a' and second key is 'two'
      data1     data2 key1 key2
1  0.210756 -1.428587    a  two


First key is 'b' and second key is 'one'
      data1    data2 key1 key2
2 -0.335555  0.34532    b  one


First key is 'b' and second key is 'two'
      data1     data2 key1 key2
3 -0.770057 -1.433438    b  two




In [41]:
pieces = dict(list(df.groupby('key1')))

In [42]:
pieces['b']

Unnamed: 0,data1,data2,key1,key2
2,-0.335555,0.34532,b,one
3,-0.770057,-1.433438,b,two


In [44]:
pieces['a']

Unnamed: 0,data1,data2,key1,key2
0,0.782657,-0.208217,a,one
1,0.210756,-1.428587,a,two
4,0.121001,-0.527497,a,one


In [47]:
dict(list(df.groupby('key1')))

{'a':       data1     data2 key1 key2
 0  0.782657 -0.208217    a  one
 1  0.210756 -1.428587    a  two
 4  0.121001 -0.527497    a  one, 'b':       data1     data2 key1 key2
 2 -0.335555  0.345320    b  one
 3 -0.770057 -1.433438    b  two}

By default `groupby` groups on `axis=0`, but you can group on any of the other axes. For example, we could group the columns of our example `df` here by `dtype` like so:

In [48]:
df.dtypes

data1    float64
data2    float64
key1      object
key2      object
dtype: object

In [49]:
grouped = df.groupby(df.dtypes, axis=1)

In [52]:
grouped = dict(list(grouped))

In [54]:
grouped.keys()

[dtype('O'), dtype('float64')]

In [63]:
for key, value in grouped.items():
    print "Key is '%s'" % key
    print value
    print '\n'

Key is 'object'
  key1 key2
0    a  one
1    a  two
2    b  one
3    b  two
4    a  one


Key is 'float64'
      data1     data2
0  0.782657 -0.208217
1  0.210756 -1.428587
2 -0.335555  0.345320
3 -0.770057 -1.433438
4  0.121001 -0.527497




### Selecting a Column or Subset of Columns
Indexing a GroupBy object created from a DataFrame with a column name or array of column names has the effect of selecting those columns for aggregation.

In [71]:
df.groupby('key1')['data1'].mean() # syntactic sugar for df['data1'].groupby(df['key1']).mean()

key1
a    0.371471
b   -0.552806
Name: data1, dtype: float64

In [73]:
df.groupby('key1')[['data1']].mean() # syntactic sugar for df[['data1']].groupby(df['key1']).mean()

Unnamed: 0_level_0,data1
key1,Unnamed: 1_level_1
a,0.371471
b,-0.552806


In [74]:
df.groupby(['key1', 'key2'])[['data2']].mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,data2
key1,key2,Unnamed: 2_level_1
a,one,-0.367857
a,two,-1.428587
b,one,0.34532
b,two,-1.433438


The object returned by this indexing operation is a grouped DataFrame if a list or array is passed and a grouped Series if just a single column name is passed as a scalar.

### Grouped with Dicts and Series
Grouping information may exist in a form other than an array. Let's consider another example DataFrame.

In [75]:
people = pd.DataFrame(np.random.randn(5, 5),
                      columns=['a', 'b', 'c', 'd', 'e'],
                      index=['Joe', 'Steve', 'Wes', 'Jim', 'Travis'])

In [76]:
people

Unnamed: 0,a,b,c,d,e
Joe,0.094362,0.642776,0.385145,-0.60908,-0.529635
Steve,0.435791,1.192925,-0.52309,1.344837,0.429187
Wes,0.101914,0.584637,0.477111,-0.680072,1.646028
Jim,-1.257609,0.05352,0.116609,1.959327,0.294609
Travis,0.392042,0.610157,-0.117805,0.396418,-0.088487


In [83]:
people.mean(axis=1) # row means

Joe      -0.003286
Steve     0.575930
Wes       0.425924
Jim       0.233291
Travis    0.238465
dtype: float64

In [91]:
# add a few na's
people.ix[2:3, ['b', 'c']] = np.nan

In [92]:
people

Unnamed: 0,a,b,c,d,e
Joe,0.094362,0.642776,0.385145,-0.60908,-0.529635
Steve,0.435791,1.192925,-0.52309,1.344837,0.429187
Wes,0.101914,,,-0.680072,1.646028
Jim,-1.257609,0.05352,0.116609,1.959327,0.294609
Travis,0.392042,0.610157,-0.117805,0.396418,-0.088487


Now, suppose I have a group correspondence for the columns and want to sum together the columns by group:

In [94]:
mapping = {'a': 'red', 'b': 'red', 'c': 'blue', 'd': 'blue', 'e': 'red', 'f': 'orange'}

In [95]:
by_column = people.groupby(mapping, axis=1)

In [96]:
by_column.mean()

Unnamed: 0,blue,red
Joe,-0.111968,0.069168
Steve,0.410874,0.685968
Wes,-0.680072,0.873971
Jim,1.037968,-0.30316
Travis,0.139306,0.304571


The functionality applies to Series

In [97]:
map_series = pd.Series(mapping)

In [98]:
map_series

a       red
b       red
c      blue
d      blue
e       red
f    orange
dtype: object

In [99]:
people.groupby(map_series, axis=1).mean()

Unnamed: 0,blue,red
Joe,-0.111968,0.069168
Steve,0.410874,0.685968
Wes,-0.680072,0.873971
Jim,1.037968,-0.30316
Travis,0.139306,0.304571


In [100]:
people.groupby(map_series, axis=1).count()

Unnamed: 0,blue,red
Joe,2,3
Steve,2,3
Wes,1,2
Jim,2,3
Travis,2,3


### Grouping with Functions

In [101]:
people.groupby(len).sum() # len of index values (in this case, the names)

Unnamed: 0,a,b,c,d,e
3,-1.061332,0.696296,0.501754,0.670175,1.411002
5,0.435791,1.192925,-0.52309,1.344837,0.429187
6,0.392042,0.610157,-0.117805,0.396418,-0.088487


In [102]:
# group by len and an array
key_list = ['one', 'one', 'one', 'two', 'two']

In [103]:
people.groupby([len, key_list]).sum()

Unnamed: 0,Unnamed: 1,a,b,c,d,e
3,one,0.196277,0.642776,0.385145,-1.289151,1.116393
3,two,-1.257609,0.05352,0.116609,1.959327,0.294609
5,one,0.435791,1.192925,-0.52309,1.344837,0.429187
6,two,0.392042,0.610157,-0.117805,0.396418,-0.088487


### Grouping by Index Levels

A final convenience for hierarchically-indexed data sets is the ability to aggregate using one of the levels of an axis index. To do this, pass the level number or name using the `level` keyword.

In [105]:
columns = pd.MultiIndex.from_arrays([['US', 'US', 'US', 'JP', 'JP'],
                                     [1, 3, 5, 1, 3]],
                                     names=['cty', 'tenor'])

In [106]:
heir_df = pd.DataFrame(np.random.randn(4, 5),
                       columns=columns)

In [107]:
heir_df

cty,US,US,US,JP,JP
tenor,1,3,5,1,3
0,-0.409122,0.37151,1.302132,-0.349013,0.399253
1,-0.045113,-0.361555,-1.227326,-0.93205,-0.272072
2,0.188757,-0.869613,-1.146343,0.746753,-1.729842
3,-1.572284,-1.431826,1.054139,0.295854,-0.739949


In [108]:
heir_df.groupby(level='cty', axis=1).count()

cty,JP,US
0,2,3
1,2,3
2,2,3
3,2,3


### Data Aggregation

In [109]:
df

Unnamed: 0,data1,data2,key1,key2
0,0.782657,-0.208217,a,one
1,0.210756,-1.428587,a,two
2,-0.335555,0.34532,b,one
3,-0.770057,-1.433438,b,two
4,0.121001,-0.527497,a,one


In [112]:
df.groupby('key1')['data1'].quantile(0.9) # quantile is a Series method 

key1
a    0.668277
b   -0.379005
Name: data1, dtype: float64

You can create your own aggregation functions, pass any function that aggregates an array to the `agg` method.

In [113]:
def peak_to_peak(arr):
    return arr.max() - arr.min()

In [114]:
df.groupby('key1').agg(peak_to_peak)

Unnamed: 0_level_0,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,0.661656,1.22037
b,0.434503,1.778758


In [115]:
df.groupby('key1')['data1'].agg(peak_to_peak)

key1
a    0.661656
b    0.434503
Name: data1, dtype: float64

You'll notice that some methods like `describe` also work, even though they are not aggregations, strictly speaking.

In [116]:
df.groupby('key1').describe()

Unnamed: 0_level_0,Unnamed: 1_level_0,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
a,count,3.0,3.0
a,mean,0.371471,-0.721434
a,std,0.358914,0.632878
a,min,0.121001,-1.428587
a,25%,0.165879,-0.978042
a,50%,0.210756,-0.527497
a,75%,0.496706,-0.367857
a,max,0.782657,-0.208217
b,count,2.0,2.0
b,mean,-0.552806,-0.544059


In [118]:
# import requests and string io library
from StringIO import StringIO
import requests

In [119]:
# link to data
data_link = "https://raw.githubusercontent.com/wesm/pydata-book/master/ch08/tips.csv"

In [120]:
string = requests.get(data_link).content

In [121]:
tips = pd.read_csv(StringIO(string.decode('utf-8')))

In [122]:
tips

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.50,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4
5,25.29,4.71,Male,No,Sun,Dinner,4
6,8.77,2.00,Male,No,Sun,Dinner,2
7,26.88,3.12,Male,No,Sun,Dinner,4
8,15.04,1.96,Male,No,Sun,Dinner,2
9,14.78,3.23,Male,No,Sun,Dinner,2


In [123]:
# create tip percentage
tips['tip_pct'] = tips['tip'] / tips['total_bill']

In [126]:
tips[:5]

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,tip_pct
0,16.99,1.01,Female,No,Sun,Dinner,2,0.059447
1,10.34,1.66,Male,No,Sun,Dinner,3,0.160542
2,21.01,3.5,Male,No,Sun,Dinner,3,0.166587
3,23.68,3.31,Male,No,Sun,Dinner,2,0.13978
4,24.59,3.61,Female,No,Sun,Dinner,4,0.146808


### Column-wise and Multiple Function Application
As you've seen above, aggregating a Series or all of the columns of a DataFrame is a matter of using `aggregate` with the desired function or calling a method like `mean` or `std`. However, you may want to aggregate using a different function depending on the column or multiple functions at once. Fortunately, this is straightforward to do.

In [127]:
# first group by sex and smoker
tip_group = tips.groupby(['sex', 'smoker'])

In [128]:
group_pct = tip_group['tip_pct']

In [129]:
group_pct.mean()

sex     smoker
Female  No        0.156921
        Yes       0.182150
Male    No        0.160669
        Yes       0.152771
Name: tip_pct, dtype: float64

In [130]:
# or 
group_pct.agg('mean')

sex     smoker
Female  No        0.156921
        Yes       0.182150
Male    No        0.160669
        Yes       0.152771
Name: tip_pct, dtype: float64

In [131]:
# now use this method for multiple functions
group_pct.agg(['mean', 'std', peak_to_peak])

Unnamed: 0_level_0,Unnamed: 1_level_0,mean,std,peak_to_peak
sex,smoker,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Female,No,0.156921,0.036421,0.195876
Female,Yes,0.18215,0.071595,0.360233
Male,No,0.160669,0.041849,0.220186
Male,Yes,0.152771,0.090588,0.674707


In [133]:
# replace the names
group_pct.agg([('Foo', 'mean'), ('Bar', 'std'), ('PeakToPeak', peak_to_peak)])

Unnamed: 0_level_0,Unnamed: 1_level_0,Foo,Bar,PeakToPeak
sex,smoker,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Female,No,0.156921,0.036421,0.195876
Female,Yes,0.18215,0.071595,0.360233
Male,No,0.160669,0.041849,0.220186
Male,Yes,0.152771,0.090588,0.674707


With a DataFrame, you have more options as you can specify a list of functions to apply to all of the columns or different functions per column. To start, let's compute three statistics for two columns, `tip_pct` and `total_bill`.

In [134]:
functions = ['count', 'mean', 'max']

In [135]:
result = tip_group['tip_pct', 'total_bill'].agg(functions)

In [136]:
result

Unnamed: 0_level_0,Unnamed: 1_level_0,tip_pct,tip_pct,tip_pct,total_bill,total_bill,total_bill
Unnamed: 0_level_1,Unnamed: 1_level_1,count,mean,max,count,mean,max
sex,smoker,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
Female,No,54,0.156921,0.252672,54,18.105185,35.83
Female,Yes,33,0.18215,0.416667,33,17.977879,44.3
Male,No,97,0.160669,0.29199,97,19.791237,48.33
Male,Yes,60,0.152771,0.710345,60,22.2845,50.81


As you can see, the resulting DataFrame has hierarchical columns, the same as you would get aggregating each column separately and using `concat` to glue the results together using he column names as the `keys` argument

In [137]:
result['tip_pct']

Unnamed: 0_level_0,Unnamed: 1_level_0,count,mean,max
sex,smoker,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Female,No,54,0.156921,0.252672
Female,Yes,33,0.18215,0.416667
Male,No,97,0.160669,0.29199
Male,Yes,60,0.152771,0.710345


As above, a list of tuples with custom names can be passed.

In [138]:
ftuples = [('Durchschnitt', 'mean'), ('Abweichung', np.var)]

In [139]:
tip_group['tip_pct', 'total_bill'].agg(ftuples)

Unnamed: 0_level_0,Unnamed: 1_level_0,tip_pct,tip_pct,total_bill,total_bill
Unnamed: 0_level_1,Unnamed: 1_level_1,Durchschnitt,Abweichung,Durchschnitt,Abweichung
sex,smoker,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
Female,No,0.156921,0.001327,18.105185,53.092422
Female,Yes,0.18215,0.005126,17.977879,84.451517
Male,No,0.160669,0.001751,19.791237,76.152961
Male,Yes,0.152771,0.008206,22.2845,98.244673


Now, suppose you wanted to apply potentially different functions to one or more of the columns. The trick is to pass a dict to agg that contains a mapping of column names to any of the function specifications listed so far.

In [140]:
tip_group.agg({'tip': np.max, 'size': 'sum'})

Unnamed: 0_level_0,Unnamed: 1_level_0,tip,size
sex,smoker,Unnamed: 2_level_1,Unnamed: 3_level_1
Female,No,5.2,140
Female,Yes,6.5,74
Male,No,9.0,263
Male,Yes,10.0,150


In [141]:
tip_group.agg({'tip_pct': ['min', 'max', 'mean', 'std'],
               'size': 'sum'})

Unnamed: 0_level_0,Unnamed: 1_level_0,tip_pct,tip_pct,tip_pct,tip_pct,size
Unnamed: 0_level_1,Unnamed: 1_level_1,min,max,mean,std,sum
sex,smoker,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
Female,No,0.056797,0.252672,0.156921,0.036421,140
Female,Yes,0.056433,0.416667,0.18215,0.071595,74
Male,No,0.071804,0.29199,0.160669,0.041849,263
Male,Yes,0.035638,0.710345,0.152771,0.090588,150


### Returning Aggregated Data in "unindexed" Form

In [142]:
# with index
tips.groupby(['sex', 'smoker']).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,total_bill,tip,size,tip_pct
sex,smoker,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Female,No,18.105185,2.773519,2.592593,0.156921
Female,Yes,17.977879,2.931515,2.242424,0.18215
Male,No,19.791237,3.113402,2.71134,0.160669
Male,Yes,22.2845,3.051167,2.5,0.152771


In [143]:
# without index
tips.groupby(['sex', 'smoker'], as_index=False).mean()

Unnamed: 0,sex,smoker,total_bill,tip,size,tip_pct
0,Female,No,18.105185,2.773519,2.592593,0.156921
1,Female,Yes,17.977879,2.931515,2.242424,0.18215
2,Male,No,19.791237,3.113402,2.71134,0.160669
3,Male,Yes,22.2845,3.051167,2.5,0.152771


### Group-wise Operations and Transformations

The first option is to aggregate then merge.

In [144]:
df

Unnamed: 0,data1,data2,key1,key2
0,0.782657,-0.208217,a,one
1,0.210756,-1.428587,a,two
2,-0.335555,0.34532,b,one
3,-0.770057,-1.433438,b,two
4,0.121001,-0.527497,a,one


In [145]:
k1_means = df.groupby('key1').mean().add_prefix('mean_')

In [146]:
k1_means

Unnamed: 0_level_0,mean_data1,mean_data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,0.371471,-0.721434
b,-0.552806,-0.544059


In [147]:
pd.merge(df, k1_means, left_on='key1', right_index=True)

Unnamed: 0,data1,data2,key1,key2,mean_data1,mean_data2
0,0.782657,-0.208217,a,one,0.371471,-0.721434
1,0.210756,-1.428587,a,two,0.371471,-0.721434
4,0.121001,-0.527497,a,one,0.371471,-0.721434
2,-0.335555,0.34532,b,one,-0.552806,-0.544059
3,-0.770057,-1.433438,b,two,-0.552806,-0.544059


Another method is to use transform but isn't the same result

In [160]:
trans = df.groupby('key1').transform(np.mean).add_prefix('mean_')

In [162]:
pd.concat([df, trans], axis=1)

Unnamed: 0,data1,data2,key1,key2,mean_data1,mean_data2
0,0.782657,-0.208217,a,one,0.371471,-0.721434
1,0.210756,-1.428587,a,two,0.371471,-0.721434
2,-0.335555,0.34532,b,one,-0.552806,-0.544059
3,-0.770057,-1.433438,b,two,-0.552806,-0.544059
4,0.121001,-0.527497,a,one,0.371471,-0.721434


### Apply: General split-apply-combine

In [168]:
# define function that selects the rows with the largest values in a particular column
def top(df, n=5, column='tip_pct'):
    return df.sort_values(by=column)[-n:]

In [170]:
top(tips, n=6)

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,tip_pct
109,14.31,4.0,Female,Yes,Sat,Dinner,2,0.279525
183,23.17,6.5,Male,Yes,Sun,Dinner,4,0.280535
232,11.61,3.39,Male,No,Sat,Dinner,2,0.29199
67,3.07,1.0,Female,Yes,Sat,Dinner,1,0.325733
178,9.6,4.0,Female,Yes,Sun,Dinner,2,0.416667
172,7.25,5.15,Male,Yes,Sun,Dinner,2,0.710345


In [171]:
tips.groupby('smoker').apply(top)

Unnamed: 0_level_0,Unnamed: 1_level_0,total_bill,tip,sex,smoker,day,time,size,tip_pct
smoker,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
No,88,24.71,5.85,Male,No,Thur,Lunch,2,0.236746
No,185,20.69,5.0,Male,No,Sun,Dinner,5,0.241663
No,51,10.29,2.6,Female,No,Sun,Dinner,2,0.252672
No,149,7.51,2.0,Male,No,Thur,Lunch,2,0.266312
No,232,11.61,3.39,Male,No,Sat,Dinner,2,0.29199
Yes,109,14.31,4.0,Female,Yes,Sat,Dinner,2,0.279525
Yes,183,23.17,6.5,Male,Yes,Sun,Dinner,4,0.280535
Yes,67,3.07,1.0,Female,Yes,Sat,Dinner,1,0.325733
Yes,178,9.6,4.0,Female,Yes,Sun,Dinner,2,0.416667
Yes,172,7.25,5.15,Male,Yes,Sun,Dinner,2,0.710345


In [172]:
tips.groupby(['smoker', 'day']).apply(top, n=2, column='total_bill')

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,total_bill,tip,sex,smoker,day,time,size,tip_pct
smoker,day,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
No,Fri,91,22.49,3.5,Male,No,Fri,Dinner,2,0.155625
No,Fri,94,22.75,3.25,Female,No,Fri,Dinner,2,0.142857
No,Sat,59,48.27,6.73,Male,No,Sat,Dinner,4,0.139424
No,Sat,212,48.33,9.0,Male,No,Sat,Dinner,4,0.18622
No,Sun,112,38.07,4.0,Male,No,Sun,Dinner,3,0.10507
No,Sun,156,48.17,5.0,Male,No,Sun,Dinner,6,0.103799
No,Thur,85,34.83,5.17,Female,No,Thur,Lunch,4,0.148435
No,Thur,142,41.19,5.0,Male,No,Thur,Lunch,5,0.121389
Yes,Fri,90,28.97,3.0,Male,Yes,Fri,Dinner,2,0.103555
Yes,Fri,95,40.17,4.73,Male,Yes,Fri,Dinner,4,0.11775


A new example

In [174]:
result = tips.groupby('smoker')['tip_pct'].describe()

In [175]:
result

smoker       
No      count    151.000000
        mean       0.159328
        std        0.039910
        min        0.056797
        25%        0.136906
        50%        0.155625
        75%        0.185014
        max        0.291990
Yes     count     93.000000
        mean       0.163196
        std        0.085119
        min        0.035638
        25%        0.106771
        50%        0.153846
        75%        0.195059
        max        0.710345
dtype: float64

In [176]:
result.unstack()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
smoker,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
No,151,0.159328,0.03991,0.056797,0.136906,0.155625,0.185014,0.29199
Yes,93,0.163196,0.085119,0.035638,0.106771,0.153846,0.195059,0.710345


In [177]:
result.unstack('smoker')

smoker,No,Yes
count,151.0,93.0
mean,0.159328,0.163196
std,0.03991,0.085119
min,0.056797,0.035638
25%,0.136906,0.106771
50%,0.155625,0.153846
75%,0.185014,0.195059
max,0.29199,0.710345


Another example around suppressing group keys

In [178]:
# with group keys
tips.groupby('smoker').apply(top)

Unnamed: 0_level_0,Unnamed: 1_level_0,total_bill,tip,sex,smoker,day,time,size,tip_pct
smoker,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
No,88,24.71,5.85,Male,No,Thur,Lunch,2,0.236746
No,185,20.69,5.0,Male,No,Sun,Dinner,5,0.241663
No,51,10.29,2.6,Female,No,Sun,Dinner,2,0.252672
No,149,7.51,2.0,Male,No,Thur,Lunch,2,0.266312
No,232,11.61,3.39,Male,No,Sat,Dinner,2,0.29199
Yes,109,14.31,4.0,Female,Yes,Sat,Dinner,2,0.279525
Yes,183,23.17,6.5,Male,Yes,Sun,Dinner,4,0.280535
Yes,67,3.07,1.0,Female,Yes,Sat,Dinner,1,0.325733
Yes,178,9.6,4.0,Female,Yes,Sun,Dinner,2,0.416667
Yes,172,7.25,5.15,Male,Yes,Sun,Dinner,2,0.710345


In [179]:
# without group keys
tips.groupby('smoker', group_keys=False).apply(top)

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,tip_pct
88,24.71,5.85,Male,No,Thur,Lunch,2,0.236746
185,20.69,5.0,Male,No,Sun,Dinner,5,0.241663
51,10.29,2.6,Female,No,Sun,Dinner,2,0.252672
149,7.51,2.0,Male,No,Thur,Lunch,2,0.266312
232,11.61,3.39,Male,No,Sat,Dinner,2,0.29199
109,14.31,4.0,Female,Yes,Sat,Dinner,2,0.279525
183,23.17,6.5,Male,Yes,Sun,Dinner,4,0.280535
67,3.07,1.0,Female,Yes,Sat,Dinner,1,0.325733
178,9.6,4.0,Female,Yes,Sun,Dinner,2,0.416667
172,7.25,5.15,Male,Yes,Sun,Dinner,2,0.710345


In [186]:
# with group keys, unstack experiment
tips.groupby('smoker').apply(top).stack()

smoker                 
No      88   total_bill       24.71
             tip               5.85
             sex               Male
             smoker              No
             day               Thur
             time             Lunch
             size                 2
             tip_pct       0.236746
        185  total_bill       20.69
             tip                  5
             sex               Male
             smoker              No
             day                Sun
             time            Dinner
             size                 5
             tip_pct       0.241663
        51   total_bill       10.29
             tip                2.6
             sex             Female
             smoker              No
             day                Sun
             time            Dinner
             size                 2
             tip_pct       0.252672
        149  total_bill        7.51
             tip                  2
             sex               Male
    

In [187]:
# with group keys, unstack experiment
tips.groupby('smoker').apply(top).stack().unstack()

Unnamed: 0_level_0,Unnamed: 1_level_0,total_bill,tip,sex,smoker,day,time,size,tip_pct
smoker,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
No,51,10.29,2.6,Female,No,Sun,Dinner,2,0.252672
No,88,24.71,5.85,Male,No,Thur,Lunch,2,0.236746
No,149,7.51,2.0,Male,No,Thur,Lunch,2,0.266312
No,185,20.69,5.0,Male,No,Sun,Dinner,5,0.241663
No,232,11.61,3.39,Male,No,Sat,Dinner,2,0.29199
Yes,67,3.07,1.0,Female,Yes,Sat,Dinner,1,0.325733
Yes,109,14.31,4.0,Female,Yes,Sat,Dinner,2,0.279525
Yes,172,7.25,5.15,Male,Yes,Sun,Dinner,2,0.710345
Yes,178,9.6,4.0,Female,Yes,Sun,Dinner,2,0.416667
Yes,183,23.17,6.5,Male,Yes,Sun,Dinner,4,0.280535


### Quantile and Bucket Analysis

In [188]:
frame = pd.DataFrame({'data1': np.random.randn(1000),
                      'data2': np.random.randn(1000)})

In [190]:
frame[:5]

Unnamed: 0,data1,data2
0,-0.098911,-1.086666
1,0.459876,0.904293
2,1.433638,-0.131928
3,0.17662,-0.453357
4,-1.016423,-0.310098


In [191]:
factor = pd.cut(frame.data1, 4)

In [193]:
factor[:5]

0    (-1.379, 0.37]
1     (0.37, 2.119]
2     (0.37, 2.119]
3    (-1.379, 0.37]
4    (-1.379, 0.37]
Name: data1, dtype: category
Categories (4, object): [(-3.135, -1.379] < (-1.379, 0.37] < (0.37, 2.119] < (2.119, 3.868]]

In [194]:
def get_stats(group):
    return {'min': group.min(),
            'max': group.max(),
            'count': group.count(),
            'mean': group.mean()}

In [195]:
grouped = frame.data2.groupby(factor)

In [196]:
grouped.mean()

data1
(-3.135, -1.379]    0.131713
(-1.379, 0.37]     -0.017407
(0.37, 2.119]       0.023584
(2.119, 3.868]     -0.343248
Name: data2, dtype: float64

In [197]:
frame.data1.groupby(factor).mean()

data1
(-3.135, -1.379]   -1.801166
(-1.379, 0.37]     -0.402291
(0.37, 2.119]       0.944516
(2.119, 3.868]      2.457562
Name: data1, dtype: float64

In [198]:
grouped.apply(get_stats)

data1                  
(-3.135, -1.379]  count     71.000000
                  max        2.989046
                  mean       0.131713
                  min       -2.802857
(-1.379, 0.37]    count    558.000000
                  max        2.837292
                  mean      -0.017407
                  min       -3.120593
(0.37, 2.119]     count    345.000000
                  max        2.828254
                  mean       0.023584
                  min       -2.340711
(2.119, 3.868]    count     26.000000
                  max        2.044392
                  mean      -0.343248
                  min       -1.499685
dtype: float64

In [199]:
grouped.apply(get_stats).unstack()

Unnamed: 0_level_0,count,max,mean,min
data1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
"(-3.135, -1.379]",71,2.989046,0.131713,-2.802857
"(-1.379, 0.37]",558,2.837292,-0.017407,-3.120593
"(0.37, 2.119]",345,2.828254,0.023584,-2.340711
"(2.119, 3.868]",26,2.044392,-0.343248,-1.499685


In [200]:
# for quantiles
grouping = pd.qcut(frame.data1, 10, labels=False)

In [201]:
grouped = frame.data2.groupby(grouping)

In [202]:
grouped.apply(get_stats).unstack()

Unnamed: 0_level_0,count,max,mean,min
data1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,100,2.989046,0.004926,-2.802857
1,100,2.446142,-0.030596,-3.120593
2,100,2.625605,0.01299,-2.006462
3,100,2.837292,0.14386,-2.693699
4,100,2.321671,-0.023444,-2.564311
5,100,2.593309,-0.081502,-2.629838
6,100,2.724077,-0.034147,-2.204518
7,100,2.828254,0.129157,-2.060767
8,100,2.265786,0.002047,-2.31916
9,100,2.044392,-0.134784,-2.340711


### Example: Filling Missing Values with Group-specific Values

In [203]:
states = ['Ohio', 'New York', 'Vermont', 'Florida', 'Oregon', 'Nevada', 'California', 'Idaho']

In [204]:
group_key = ['East'] * 4 + ['West'] * 4

In [205]:
data = pd.Series(np.random.randn(8), index=states)

In [206]:
data[['Vermont', 'Nevada', 'Idaho']] = np.nan

In [207]:
data

Ohio          0.465702
New York      0.417743
Vermont            NaN
Florida      -0.712033
Oregon       -0.885029
Nevada             NaN
California    1.811195
Idaho              NaN
dtype: float64

In [208]:
data.groupby(group_key).mean()

East    0.057137
West    0.463083
dtype: float64

In [209]:
fill_mean = lambda g: g.fillna(g.mean())

In [210]:
data.groupby(group_key).apply(fill_mean)

Ohio          0.465702
New York      0.417743
Vermont       0.057137
Florida      -0.712033
Oregon       -0.885029
Nevada        0.463083
California    1.811195
Idaho         0.463083
dtype: float64