# Data Aggregation and Group Operations

* Split
* Computing group summary statistics
* Apply functions to each column of a DataFrame
* Apply within-group transformations or other manipulations
* Compute pivot tables and cross-tabulations
* Perform quantile analysis and other group analyses

## GroupBy Mechanics

In [2]:
import pandas as pd
import numpy as np

In [3]:
df = pd.DataFrame({'key1': ['a', 'a', 'b', 'b', 'a'],
                   'key2': ['one', 'two', 'one', 'two', 'one'],
                   'data1': np.random.randn(5),
                   'data2': np.random.randn(5)})

In [4]:
grouped = df['data1'].groupby(df['key1'])

In [5]:
grouped

<pandas.core.groupby.SeriesGroupBy object at 0x7f2478507d50>

In [6]:
grouped.mean()

key1
a    0.371471
b   -0.552806
Name: data1, dtype: float64

In [10]:
df['data1'].groupby([df['key1'], df['key2']]).mean()

key1  key2
a     one     0.451829
      two     0.210756
b     one    -0.335555
      two    -0.770057
Name: data1, dtype: float64

In [11]:
df['data1'].groupby([df['key1'], df['key2']]).sum()

key1  key2
a     one     0.903658
      two     0.210756
b     one    -0.335555
      two    -0.770057
Name: data1, dtype: float64

In [12]:
df['data1'].groupby([df['key1'], df['key2']]).count()

key1  key2
a     one     2
      two     1
b     one     1
      two     1
dtype: int64

In [13]:
means = df['data1'].groupby([df['key1'], df['key2']]).mean()

In [14]:
means

key1  key2
a     one     0.451829
      two     0.210756
b     one    -0.335555
      two    -0.770057
Name: data1, dtype: float64

In [15]:
means.unstack()

key2,one,two
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,0.451829,0.210756
b,-0.335555,-0.770057


In the examples above, the group keys are all Series, though they could be any arrays of the right length

In [16]:
states = np.array(['Ohio', 'California', 'California', 'Ohio', 'Ohio'])

In [17]:
states

array(['Ohio', 'California', 'California', 'Ohio', 'Ohio'], 
      dtype='|S10')

In [18]:
years = np.array([2005, 2005, 2006, 2005, 2006])

In [19]:
df['data1'].groupby([states, years]).mean()

California  2005    0.210756
            2006   -0.335555
Ohio        2005    0.006300
            2006    0.121001
Name: data1, dtype: float64

Frequently the grouping information you're looking for is located in the same DataFrame as the data you're looking to summarize. In that case, you can pass column names as the group keys.

In [22]:
df.groupby('key1').mean() # must be the dataframe method to pass the key like this

Unnamed: 0_level_0,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,0.371471,-0.721434
b,-0.552806,-0.544059


In [25]:
df.groupby('key1').mean().stack().unstack(0) # transpose the dataframe

key1,a,b
data1,0.371471,-0.552806
data2,-0.721434,-0.544059


In [26]:
df.groupby(['key1', 'key2']).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,data1,data2
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,one,0.451829,-0.367857
a,two,0.210756,-1.428587
b,one,-0.335555,0.34532
b,two,-0.770057,-1.433438


In [30]:
df.groupby(['key1', 'key2']).size()

key1  key2
a     one     2
      two     1
b     one     1
      two     1
dtype: int64