In [1]:
import numpy as np
print("numpy version: {}".format(np.__version__))
import pandas as pd 
print("pandas version: {}".format(pd.__version__))
import matplotlib
import matplotlib.pyplot as plt
print("matplotlib version: {}".format(matplotlib.__version__))
import scipy as sp
print("scipy version: {}".format(sp.__version__))
import sklearn as sl
print("scikit-learn: {}".format(sl.__version__))
import seaborn as sns
print("seaborn: {}".format(sns.__version__))
import statsmodels as sm
print("statsmodels: {}".format(sm.__version__))

numpy version: 1.17.4
pandas version: 0.25.3
matplotlib version: 3.1.2
scipy version: 1.3.3
scikit-learn: 0.21.3
seaborn: 0.9.0
statsmodels: 0.10.2


# GroupBy Mechanics

Hadley Wickham, an author of many popular packages for the R programming language, coined the term split-apply-combine for describing group operations. In the first stage of the process, data contained in a pandas object, whether a Series, DataFrame, or otherwise, is split into groups based on one or more keys that you provide. The splitting is performed on a particular axis of an object. For example, a DataFrame can be grouped on its rows ( ```axis=0``` ) or its columns ( ```axis=1``` ). Once this is done, a function is applied to each group, producing a new value. Finally, the results of all those function applications are combined into a result object. The form of the resulting object will usually depend on what’s being done to the data.

![title](images/group_aggregation.png)

Each grouping key can take many forms, and the keys do not have to be all of the same type:

- A list or array of values that is the same length as the axis being grouped
- A value indicating a column name in a DataFrame
- A dict or series giving a correspondence between the values on the axis being grouped and the group names
- A function to be invoked on the axis index or the individual labels in the index

In [2]:
df = pd.DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a'],
                    'key2' : ['one', 'two', 'one', 'two', 'one'],
                    'data1' : np.random.randn(5),
                    'data2' : np.random.randn(5)})

In [3]:
df

Unnamed: 0,key1,key2,data1,data2
0,a,one,1.557371,-1.040017
1,a,two,-1.205718,1.294198
2,b,one,-2.01394,-0.217927
3,b,two,0.833489,1.048025
4,a,one,-0.56803,0.28662


Suppose you wanted to compute the mean of the data1 column using the labels from
key1 . There are a number of ways to do this. One is to access data1 and call groupby
with the column (a Series) at key1 :

In [5]:
grouped = df['data1'].groupby(df['key1'])

In [6]:
grouped

<pandas.core.groupby.generic.SeriesGroupBy object at 0x7f74bca7b890>

The idea is that this object has all of the information needed to then apply some operation to each of the groups.

In [7]:
grouped.mean()

key1
a   -0.072126
b   -0.590226
Name: data1, dtype: float64

the data (a Series) has been aggregated according to the group key,
producing a new Series that is now indexed by the unique values in the key1 column.

In [8]:
means = df['data1'].groupby([df['key1'], df['key2']]).mean()

In [9]:
means

key1  key2
a     one     0.494670
      two    -1.205718
b     one    -2.013940
      two     0.833489
Name: data1, dtype: float64

In [10]:
means.unstack()

key2,one,two
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,0.49467,-1.205718
b,-2.01394,0.833489


In [11]:
states = np.array(['Ohio', 'California', 'California', 'Ohio', 'Ohio'])

In [12]:
years = np.array([2005, 2005, 2006, 2005, 2006])

In [13]:
df['data1'].groupby([states, years]).mean()

California  2005   -1.205718
            2006   -2.013940
Ohio        2005    1.195430
            2006   -0.568030
Name: data1, dtype: float64

In [14]:
df.groupby('key1').mean()

Unnamed: 0_level_0,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,-0.072126,0.180267
b,-0.590226,0.415049


In [15]:
df.groupby(['key1', 'key2']).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,data1,data2
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,one,0.49467,-0.376699
a,two,-1.205718,1.294198
b,one,-2.01394,-0.217927
b,two,0.833489,1.048025


Regardless of the objective in using ```groupby``` , a generally useful GroupBy method is
```size``` , which returns a Series containing group sizes(any missing values in a group key will be excluded from the result.):

In [16]:
df.groupby(['key1', 'key2']).size()

key1  key2
a     one     2
      two     1
b     one     1
      two     1
dtype: int64

### Iterating Over Groups

In [19]:
for name, group in df.groupby('key1'):
    print(name)
    print(group)
    print()

a
  key1 key2     data1     data2
0    a  one  1.557371 -1.040017
1    a  two -1.205718  1.294198
4    a  one -0.568030  0.286620

b
  key1 key2     data1     data2
2    b  one -2.013940 -0.217927
3    b  two  0.833489  1.048025



In [20]:
for (k1, k2), group in df.groupby(['key1', 'key2']):
    print((k1, k2))
    print(group)
    print()

('a', 'one')
  key1 key2     data1     data2
0    a  one  1.557371 -1.040017
4    a  one -0.568030  0.286620

('a', 'two')
  key1 key2     data1     data2
1    a  two -1.205718  1.294198

('b', 'one')
  key1 key2    data1     data2
2    b  one -2.01394 -0.217927

('b', 'two')
  key1 key2     data1     data2
3    b  two  0.833489  1.048025



In [21]:
pieces = dict(list(df.groupby('key1')))

In [22]:
pieces['b']

Unnamed: 0,key1,key2,data1,data2
2,b,one,-2.01394,-0.217927
3,b,two,0.833489,1.048025


By default groupby groups on ```axis=0``` , but you can group on any of the other axes.

In [23]:
df.dtypes

key1      object
key2      object
data1    float64
data2    float64
dtype: object

In [24]:
grouped = df.groupby(df.dtypes, axis=1)

In [25]:
for dtype, group in grouped:
    print(dtype)
    print(group)

float64
      data1     data2
0  1.557371 -1.040017
1 -1.205718  1.294198
2 -2.013940 -0.217927
3  0.833489  1.048025
4 -0.568030  0.286620
object
  key1 key2
0    a  one
1    a  two
2    b  one
3    b  two
4    a  one


### Selecting a Column or Subset of Columns

Indexing a GroupBy object created from a DataFrame with a column name or array
of column names has the effect of column subsetting for aggregation. This means
that:
```python
df.groupby('key1')['data1']
df.groupby('key1')[['data2']]
```
are syntactic sugar for:
```python
df['data1'].groupby(df['key1'])
df[['data2']].groupby(df['key1'])
```
Especially for large datasets, it may be desirable to aggregate only a few columns. For
example, in the preceding dataset, to compute means for just the data2 column and
get the result as a DataFrame, we could write:

In [28]:
df.groupby(['key1', 'key2'])[['data2']].mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,data2
key1,key2,Unnamed: 2_level_1
a,one,-0.376699
a,two,1.294198
b,one,-0.217927
b,two,1.048025


In [29]:
df

Unnamed: 0,key1,key2,data1,data2
0,a,one,1.557371,-1.040017
1,a,two,-1.205718,1.294198
2,b,one,-2.01394,-0.217927
3,b,two,0.833489,1.048025
4,a,one,-0.56803,0.28662


The object returned by this indexing operation is a grouped DataFrame if a list or
array is passed or a grouped Series if only a single column name is passed as a scalar:

In [30]:
s_grouped = df.groupby(['key1', 'key2'])['data2']

In [31]:
s_grouped

<pandas.core.groupby.generic.SeriesGroupBy object at 0x7f74bca21ed0>

In [32]:
s_grouped.mean()

key1  key2
a     one    -0.376699
      two     1.294198
b     one    -0.217927
      two     1.048025
Name: data2, dtype: float64

In [33]:
s_grouped.sum()

key1  key2
a     one    -0.753397
      two     1.294198
b     one    -0.217927
      two     1.048025
Name: data2, dtype: float64

### Grouping with Dicts and Series

In [34]:
people = pd.DataFrame(np.random.randn(5, 5),
                        columns=['a', 'b', 'c', 'd', 'e'],
                        index=['Joe', 'Steve', 'Wes', 'Jim', 'Travis'])

In [35]:
people.iloc[2:3, [1, 2]] = np.nan

In [36]:
people

Unnamed: 0,a,b,c,d,e
Joe,-1.151169,-1.107079,-1.672781,-0.274629,-1.100421
Steve,0.053982,-0.629894,1.299768,-0.527279,-0.717513
Wes,0.723746,,,0.211896,0.960253
Jim,0.532258,0.341073,0.403278,-0.374384,0.14585
Travis,-0.33023,1.303093,0.205246,2.498775,0.985992


In [37]:
mapping = {'a': 'red', 'b': 'red', 'c': 'blue',
           'd': 'blue', 'e': 'red', 'f' : 'orange'}

In [38]:
by_column = people.groupby(mapping, axis=1)

In [39]:
by_column.sum()

Unnamed: 0,blue,red
Joe,-1.94741,-3.358669
Steve,0.77249,-1.293425
Wes,0.211896,1.683999
Jim,0.028895,1.019182
Travis,2.704021,1.958855


In [41]:
map_series = pd.Series(mapping)

In [42]:
map_series

a       red
b       red
c      blue
d      blue
e       red
f    orange
dtype: object

In [43]:
people.groupby(map_series, axis=1).count()

Unnamed: 0,blue,red
Joe,2,3
Steve,2,3
Wes,1,2
Jim,2,3
Travis,2,3


### Grouping with Functions

Using Python functions is a more generic way of defining a group mapping compared
with a dict or Series. Any function passed as a group key will be called once per index
value, with the return values being used as the group names. More concretely, consider the example DataFrame from the previous section, which has people’s first
names as index values. Suppose you wanted to group by the length of the names;
while you could compute an array of string lengths, it’s simpler to just pass the len
function:

In [44]:
people.groupby(len).sum()

Unnamed: 0,a,b,c,d,e
3,0.104835,-0.766006,-1.269503,-0.437117,0.005682
5,0.053982,-0.629894,1.299768,-0.527279,-0.717513
6,-0.33023,1.303093,0.205246,2.498775,0.985992


In [45]:
key_list = ['one', 'one', 'one', 'two', 'two']

In [46]:
people.groupby([len, key_list]).min()

Unnamed: 0,Unnamed: 1,a,b,c,d,e
3,one,-1.151169,-1.107079,-1.672781,-0.274629,-1.100421
3,two,0.532258,0.341073,0.403278,-0.374384,0.14585
5,one,0.053982,-0.629894,1.299768,-0.527279,-0.717513
6,two,-0.33023,1.303093,0.205246,2.498775,0.985992


### Grouping by Index Levels

In [47]:
columns = pd.MultiIndex.from_arrays([['US', 'US', 'US', 'JP', 'JP'], [1, 3, 5, 1, 3]],
                                    names=['cty', 'tenor'])

In [48]:
hier_df = pd.DataFrame(np.random.randn(4, 5), columns=columns)

In [49]:
hier_df

cty,US,US,US,JP,JP
tenor,1,3,5,1,3
0,-0.018647,1.073508,1.27019,-0.791814,-1.443821
1,0.332702,-0.1505,0.841814,-1.764949,0.327163
2,-0.505958,2.033832,-1.245403,1.197849,-0.681379
3,-1.62497,-0.043434,0.000965,-0.349007,-2.064542


In [50]:
hier_df.groupby(level='cty', axis=1).count()

cty,JP,US
0,2,3
1,2,3
2,2,3
3,2,3


In [53]:
hier_df.groupby(level='tenor', axis=1).count()

tenor,1,3,5
0,2,2,1
1,2,2,1
2,2,2,1
3,2,2,1


In [65]:
hier_df.groupby(by=[lambda x: abs(x) > 0]).count()

cty,US,US,US,JP,JP
tenor,1,3,5,1,3
False,1,1,1,1,1
True,3,3,3,3,3


# Data Aggregation