# Data Aggregation and Group Operations

* Split
* Computing group summary statistics
* Apply functions to each column of a DataFrame
* Apply within-group transformations or other manipulations
* Compute pivot tables and cross-tabulations
* Perform quantile analysis and other group analyses

## GroupBy Mechanics

In [2]:
import pandas as pd
import numpy as np

In [3]:
df = pd.DataFrame({'key1': ['a', 'a', 'b', 'b', 'a'],
                   'key2': ['one', 'two', 'one', 'two', 'one'],
                   'data1': np.random.randn(5),
                   'data2': np.random.randn(5)})

In [4]:
grouped = df['data1'].groupby(df['key1'])

In [5]:
grouped

<pandas.core.groupby.SeriesGroupBy object at 0x7f2478507d50>

In [6]:
grouped.mean()

key1
a    0.371471
b   -0.552806
Name: data1, dtype: float64

In [10]:
df['data1'].groupby([df['key1'], df['key2']]).mean()

key1  key2
a     one     0.451829
      two     0.210756
b     one    -0.335555
      two    -0.770057
Name: data1, dtype: float64

In [11]:
df['data1'].groupby([df['key1'], df['key2']]).sum()

key1  key2
a     one     0.903658
      two     0.210756
b     one    -0.335555
      two    -0.770057
Name: data1, dtype: float64

In [12]:
df['data1'].groupby([df['key1'], df['key2']]).count()

key1  key2
a     one     2
      two     1
b     one     1
      two     1
dtype: int64

In [13]:
means = df['data1'].groupby([df['key1'], df['key2']]).mean()

In [14]:
means

key1  key2
a     one     0.451829
      two     0.210756
b     one    -0.335555
      two    -0.770057
Name: data1, dtype: float64

In [15]:
means.unstack()

key2,one,two
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,0.451829,0.210756
b,-0.335555,-0.770057


In the examples above, the group keys are all Series, though they could be any arrays of the right length

In [16]:
states = np.array(['Ohio', 'California', 'California', 'Ohio', 'Ohio'])

In [17]:
states

array(['Ohio', 'California', 'California', 'Ohio', 'Ohio'], 
      dtype='|S10')

In [18]:
years = np.array([2005, 2005, 2006, 2005, 2006])

In [19]:
df['data1'].groupby([states, years]).mean()

California  2005    0.210756
            2006   -0.335555
Ohio        2005    0.006300
            2006    0.121001
Name: data1, dtype: float64

Frequently the grouping information you're looking for is located in the same DataFrame as the data you're looking to summarize. In that case, you can pass column names as the group keys.

In [22]:
df.groupby('key1').mean() # must be the dataframe method to pass the key like this

Unnamed: 0_level_0,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,0.371471,-0.721434
b,-0.552806,-0.544059


In [25]:
df.groupby('key1').mean().stack().unstack(0) # transpose the dataframe

key1,a,b
data1,0.371471,-0.552806
data2,-0.721434,-0.544059


In [26]:
df.groupby(['key1', 'key2']).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,data1,data2
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,one,0.451829,-0.367857
a,two,0.210756,-1.428587
b,one,-0.335555,0.34532
b,two,-0.770057,-1.433438


In [30]:
df.groupby(['key1', 'key2']).size()

key1  key2
a     one     2
      two     1
b     one     1
      two     1
dtype: int64

### Iterating Over Groups
the GroupBy object supports iteration, generating a sequence of 2-tuples containing the group name along with the chunk of data. Consider the following example:

In [36]:
for name, group in df.groupby('key1'):
    print name
    print group
    print '\n'

a
      data1     data2 key1 key2
0  0.782657 -0.208217    a  one
1  0.210756 -1.428587    a  two
4  0.121001 -0.527497    a  one


b
      data1     data2 key1 key2
2 -0.335555  0.345320    b  one
3 -0.770057 -1.433438    b  two




In [40]:
for key, group in df.groupby(['key1', 'key2']):
    print "First key is '%s' and second key is '%s'" % (key[0], key[1])
    print group
    print '\n'

First key is 'a' and second key is 'one'
      data1     data2 key1 key2
0  0.782657 -0.208217    a  one
4  0.121001 -0.527497    a  one


First key is 'a' and second key is 'two'
      data1     data2 key1 key2
1  0.210756 -1.428587    a  two


First key is 'b' and second key is 'one'
      data1    data2 key1 key2
2 -0.335555  0.34532    b  one


First key is 'b' and second key is 'two'
      data1     data2 key1 key2
3 -0.770057 -1.433438    b  two




In [41]:
pieces = dict(list(df.groupby('key1')))

In [42]:
pieces['b']

Unnamed: 0,data1,data2,key1,key2
2,-0.335555,0.34532,b,one
3,-0.770057,-1.433438,b,two


In [44]:
pieces['a']

Unnamed: 0,data1,data2,key1,key2
0,0.782657,-0.208217,a,one
1,0.210756,-1.428587,a,two
4,0.121001,-0.527497,a,one


In [47]:
dict(list(df.groupby('key1')))

{'a':       data1     data2 key1 key2
 0  0.782657 -0.208217    a  one
 1  0.210756 -1.428587    a  two
 4  0.121001 -0.527497    a  one, 'b':       data1     data2 key1 key2
 2 -0.335555  0.345320    b  one
 3 -0.770057 -1.433438    b  two}

By default `groupby` groups on `axis=0`, but you can group on any of the other axes. For example, we could group the columns of our example `df` here by `dtype` like so:

In [48]:
df.dtypes

data1    float64
data2    float64
key1      object
key2      object
dtype: object

In [49]:
grouped = df.groupby(df.dtypes, axis=1)

In [52]:
grouped = dict(list(grouped))

In [54]:
grouped.keys()

[dtype('O'), dtype('float64')]

In [63]:
for key, value in grouped.items():
    print "Key is '%s'" % key
    print value
    print '\n'

Key is 'object'
  key1 key2
0    a  one
1    a  two
2    b  one
3    b  two
4    a  one


Key is 'float64'
      data1     data2
0  0.782657 -0.208217
1  0.210756 -1.428587
2 -0.335555  0.345320
3 -0.770057 -1.433438
4  0.121001 -0.527497




### Selecting a Column or Subset of Columns
Indexing a GroupBy object created from a DataFrame with a column name or array of column names has the effect of selecting those columns for aggregation.

In [71]:
df.groupby('key1')['data1'].mean() # syntactic sugar for df['data1'].groupby(df['key1']).mean()

key1
a    0.371471
b   -0.552806
Name: data1, dtype: float64

In [73]:
df.groupby('key1')[['data1']].mean() # syntactic sugar for df[['data1']].groupby(df['key1']).mean()

Unnamed: 0_level_0,data1
key1,Unnamed: 1_level_1
a,0.371471
b,-0.552806


In [74]:
df.groupby(['key1', 'key2'])[['data2']].mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,data2
key1,key2,Unnamed: 2_level_1
a,one,-0.367857
a,two,-1.428587
b,one,0.34532
b,two,-1.433438


The object returned by this indexing operation is a grouped DataFrame if a list or array is passed and a grouped Series if just a single column name is passed as a scalar.

### Grouped with Dicts and Series
Grouping information may exist in a form other than an array. Let's consider another example DataFrame.

In [75]:
people = pd.DataFrame(np.random.randn(5, 5),
                      columns=['a', 'b', 'c', 'd', 'e'],
                      index=['Joe', 'Steve', 'Wes', 'Jim', 'Travis'])

In [76]:
people

Unnamed: 0,a,b,c,d,e
Joe,0.094362,0.642776,0.385145,-0.60908,-0.529635
Steve,0.435791,1.192925,-0.52309,1.344837,0.429187
Wes,0.101914,0.584637,0.477111,-0.680072,1.646028
Jim,-1.257609,0.05352,0.116609,1.959327,0.294609
Travis,0.392042,0.610157,-0.117805,0.396418,-0.088487


In [83]:
people.mean(axis=1) # row means

Joe      -0.003286
Steve     0.575930
Wes       0.425924
Jim       0.233291
Travis    0.238465
dtype: float64

In [91]:
# add a few na's
people.ix[2:3, ['b', 'c']] = np.nan

In [92]:
people

Unnamed: 0,a,b,c,d,e
Joe,0.094362,0.642776,0.385145,-0.60908,-0.529635
Steve,0.435791,1.192925,-0.52309,1.344837,0.429187
Wes,0.101914,,,-0.680072,1.646028
Jim,-1.257609,0.05352,0.116609,1.959327,0.294609
Travis,0.392042,0.610157,-0.117805,0.396418,-0.088487


Now, suppose I have a group correspondence for the columns and want to sum together the columns by group:

In [94]:
mapping = {'a': 'red', 'b': 'red', 'c': 'blue', 'd': 'blue', 'e': 'red', 'f': 'orange'}

In [95]:
by_column = people.groupby(mapping, axis=1)

In [96]:
by_column.mean()

Unnamed: 0,blue,red
Joe,-0.111968,0.069168
Steve,0.410874,0.685968
Wes,-0.680072,0.873971
Jim,1.037968,-0.30316
Travis,0.139306,0.304571


The functionality applies to Series

In [97]:
map_series = pd.Series(mapping)

In [98]:
map_series

a       red
b       red
c      blue
d      blue
e       red
f    orange
dtype: object

In [99]:
people.groupby(map_series, axis=1).mean()

Unnamed: 0,blue,red
Joe,-0.111968,0.069168
Steve,0.410874,0.685968
Wes,-0.680072,0.873971
Jim,1.037968,-0.30316
Travis,0.139306,0.304571


In [100]:
people.groupby(map_series, axis=1).count()

Unnamed: 0,blue,red
Joe,2,3
Steve,2,3
Wes,1,2
Jim,2,3
Travis,2,3


### Grouping with Functions

In [101]:
people.groupby(len).sum() # len of index values (in this case, the names)

Unnamed: 0,a,b,c,d,e
3,-1.061332,0.696296,0.501754,0.670175,1.411002
5,0.435791,1.192925,-0.52309,1.344837,0.429187
6,0.392042,0.610157,-0.117805,0.396418,-0.088487


In [102]:
# group by len and an array
key_list = ['one', 'one', 'one', 'two', 'two']

In [103]:
people.groupby([len, key_list]).sum()

Unnamed: 0,Unnamed: 1,a,b,c,d,e
3,one,0.196277,0.642776,0.385145,-1.289151,1.116393
3,two,-1.257609,0.05352,0.116609,1.959327,0.294609
5,one,0.435791,1.192925,-0.52309,1.344837,0.429187
6,two,0.392042,0.610157,-0.117805,0.396418,-0.088487


### Grouping by Index Levels

A final convenience for hierarchically-indexed data sets is the ability to aggregate using one of the levels of an axis index. To do this, pass the level number or name using the `level` keyword.

In [105]:
columns = pd.MultiIndex.from_arrays([['US', 'US', 'US', 'JP', 'JP'],
                                     [1, 3, 5, 1, 3]],
                                     names=['cty', 'tenor'])

In [106]:
heir_df = pd.DataFrame(np.random.randn(4, 5),
                       columns=columns)

In [107]:
heir_df

cty,US,US,US,JP,JP
tenor,1,3,5,1,3
0,-0.409122,0.37151,1.302132,-0.349013,0.399253
1,-0.045113,-0.361555,-1.227326,-0.93205,-0.272072
2,0.188757,-0.869613,-1.146343,0.746753,-1.729842
3,-1.572284,-1.431826,1.054139,0.295854,-0.739949


In [108]:
heir_df.groupby(level='cty', axis=1).count()

cty,JP,US
0,2,3
1,2,3
2,2,3
3,2,3


### Data Aggregation

In [109]:
df

Unnamed: 0,data1,data2,key1,key2
0,0.782657,-0.208217,a,one
1,0.210756,-1.428587,a,two
2,-0.335555,0.34532,b,one
3,-0.770057,-1.433438,b,two
4,0.121001,-0.527497,a,one


In [112]:
df.groupby('key1')['data1'].quantile(0.9) # quantile is a Series method 

key1
a    0.668277
b   -0.379005
Name: data1, dtype: float64

You can create your own aggregation functions, pass any function that aggregates an array to the `agg` method.

In [113]:
def peak_to_peak(arr):
    return arr.max() - arr.min()

In [114]:
df.groupby('key1').agg(peak_to_peak)

Unnamed: 0_level_0,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,0.661656,1.22037
b,0.434503,1.778758


In [115]:
df.groupby('key1')['data1'].agg(peak_to_peak)

key1
a    0.661656
b    0.434503
Name: data1, dtype: float64

You'll notice that some methods like `describe` also work, even though they are not aggregations, strictly speaking.

In [116]:
df.groupby('key1').describe()

Unnamed: 0_level_0,Unnamed: 1_level_0,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
a,count,3.0,3.0
a,mean,0.371471,-0.721434
a,std,0.358914,0.632878
a,min,0.121001,-1.428587
a,25%,0.165879,-0.978042
a,50%,0.210756,-0.527497
a,75%,0.496706,-0.367857
a,max,0.782657,-0.208217
b,count,2.0,2.0
b,mean,-0.552806,-0.544059
