In [1]:
%pylab inline
import pandas as pd
import numpy as np

Populating the interactive namespace from numpy and matplotlib


group key can take several forms1
1. [A list or array of values that is the same length as the axis being grouped](#list_key)
    * [single list/array](#single_list_key)
    * [multiple list/array](#multi_list_key)
2. [A value indicating a column name in a DataFrame](#colname_key)
    * [iterate over groups](#sample_iterate)
    * [select certain columns to aggregate](#syntactic_sugar_column)
3. [A dict or Series giving a correspondence between the values on the axis being grouped and the group names](#group_with_dict)
    * [groupby on columns](#groupby_on_columns)
4. [A function to be invoked on the axis **index** or the individual labels in the **index** ](#group_funcs)

As of this writing, any missing values in a group key will be **excluded** from the result. 

<a id="list_key"></a>
## use any list/array with <span style="color:red">right length</span> as key
<a id="single_list_key"></a>
### a single list/array

In [2]:
df = pd.DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a'],
                   'key2' : ['one', 'two', 'one', 'two', 'one'],
                   'data1' : np.arange(1,6),
                   'data2' : np.arange(6,11)})
df

Unnamed: 0,data1,data2,key1,key2
0,1,6,a,one
1,2,7,a,two
2,3,8,b,one
3,4,9,b,two
4,5,10,a,one


In [3]:
grps = df.data1.groupby(df.key1)
for key,grp in grps:
    print "%s: \n%s"%(key,str(grp))

a: 
0    1
1    2
4    5
Name: data1, dtype: int32
b: 
2    3
3    4
Name: data1, dtype: int32


In [4]:
grps.mean()

key1
a       2.666667
b       3.500000
Name: data1, dtype: float64

<a id="multi_list_key"></a>
### multiple list/array as key

In [5]:
grps = df[["data1","data2"]].groupby([df.key1,df.key2])
for key,grp in grps:
    print "%s: \n%s"%(key,str(grp))

('a', 'one'): 
   data1  data2
0      1      6
4      5     10
('a', 'two'): 
   data1  data2
1      2      7
('b', 'one'): 
   data1  data2
2      3      8
('b', 'two'): 
   data1  data2
3      4      9


In [6]:
grps.mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,data1,data2
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,one,3,8
a,two,2,7
b,one,3,8
b,two,4,9


In [20]:
grps = df.data1.groupby([df.key1,df.key2])
series = grps.mean() # groupby on single column will return hierarchical series
series.unstack() # from hierarchical series to dataframe

key2,one,two
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,3,2
b,3,4


<a id="colname_key"></a>
## use column name as key

In [8]:
df.groupby("key1").size()

key1
a       3
b       2
dtype: int64

In [9]:
df.groupby(["key1","key2"]).size()

key1  key2
a     one     2
      two     1
b     one     1
      two     1
dtype: int64

In [10]:
df.groupby(["key1","key2"]).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,data1,data2
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,one,3,8
a,two,2,7
b,one,3,8
b,two,4,9


<a id="sample_iterate"></a>
### itearte over groups

In [11]:
grps = df.groupby(["key1","key2"])
for index,((key1,key2),grp) in enumerate(grps):
    print "********************** Group %d"%(index+1)
    print "Key: <%s,%s>"%(key1,key2)
    print grp

********************** Group 1
Key: <a,one>
   data1  data2 key1 key2
0      1      6    a  one
4      5     10    a  one
********************** Group 2
Key: <a,two>
   data1  data2 key1 key2
1      2      7    a  two
********************** Group 3
Key: <b,one>
   data1  data2 key1 key2
2      3      8    b  one
********************** Group 4
Key: <b,two>
   data1  data2 key1 key2
3      4      9    b  two


<a id="syntactic_sugar_column"></a>
### select certain columns to aggregate
Indexing a GroupBy object created from a DataFrame with a column name or array of column names has the effect of selecting those columns for aggregation. This means that:

```python
df.groupby('key1')['data1']
df.groupby('key1')[['data2']]
```
are <span style="color:red;font-size:1.5em">syntactic sugar</span> for:
```python
df['data1'].groupby(df['key1'])
df[['data2']].groupby(df['key1'])
```

In [12]:
# pass in a single column, return a series
df.groupby(['key1', 'key2'])['data2'].size()

key1  key2
a     one     2
      two     1
b     one     1
      two     1
dtype: int64

In [13]:
# pass in a list, return DataFrame, even the list contain a single element
df.groupby(['key1', 'key2'])['data1','data2'].mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,data1,data2
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,one,3,8
a,two,2,7
b,one,3,8
b,two,4,9


<a id="group_with_dict"></a>
## Grouping with Dicts and Series

In [14]:
people = pd.DataFrame(np.arange(1,26).reshape(5,5),columns=['a', 'b', 'c', 'd', 'e'],index=['Tom', 'Alice', 'Mary', 'Jim', 'Steve'])
people

Unnamed: 0,a,b,c,d,e
Tom,1,2,3,4,5
Alice,6,7,8,9,10
Mary,11,12,13,14,15
Jim,16,17,18,19,20
Steve,21,22,23,24,25


In [15]:
genders = {'Tom':'male','Jim':'male','Steve':'male','Alice':'female','Mary':'female'}
grps = people.groupby(genders)
for key,grp in grps:
    print "********************* %s"%key
    print grp

********************* female
        a   b   c   d   e
Alice   6   7   8   9  10
Mary   11  12  13  14  15
********************* male
        a   b   c   d   e
Tom     1   2   3   4   5
Jim    16  17  18  19  20
Steve  21  22  23  24  25


<a id="groupby_on_columns"></a>
### groupby on columns
* by default, axis=0, indicating groupby on rows
* passing axis=1, indicating groupby on columns

In [16]:
colors = pd.Series(   {"a":"red","c":"red","e":"red","b":"blue","d":"blue"} )
people.groupby(colors,axis=1).sum()

Unnamed: 0,blue,red
Tom,6,9
Alice,16,24
Mary,26,39
Jim,36,54
Steve,46,69


<a id="group_funcs"></a>
## Grouping with Functions

In [17]:
s = pd.Series(["c","a","d","a","b","a","c","a","b","d","a","a","d","a"])
s

0     c
1     a
2     d
3     a
4     b
5     a
6     c
7     a
8     b
9     d
10    a
11    a
12    d
13    a
dtype: object

In [18]:
s.groupby(lambda index:s[index]).size() # the argument passed in is index, not the value itself

a    7
b    2
c    2
d    3
dtype: int64