# GroupBy: Split, Apply, Combine

In [18]:
import numpy as np
np.random.seed(0)
import pandas as pd

## Apply

Apply a function along an axis of the DataFrame.

Objects passed to the function are Series objects whose index is either the DataFrame’s index (axis=0) or the DataFrame’s columns (axis=1).  
By default, the final return type is inferred from the return type of the applied function. Otherwise, it depends on the result_type argument.

In [19]:
data = [[4, 9], [4, 9], [4, 9]]
df = pd.DataFrame(data, columns=['A', 'B'])

print(df)

   A  B
0  4  9
1  4  9
2  4  9


In [20]:
print(df.apply(np.sqrt))

     A    B
0  2.0  3.0
1  2.0  3.0
2  2.0  3.0


In [21]:
print(df.apply(np.sum, axis=0))

A    12
B    27
dtype: int64


In [22]:
print(df.apply(np.sum, axis=1))

0    13
1    13
2    13
dtype: int64


In [23]:
np.sum(df, axis=1)

0    13
1    13
2    13
dtype: int64

## GroupBy

By “group by” we are referring to a process involving one or more of the following steps:

- Splitting the data into groups based on some criteria.
- Applying a function to each group independently.
- Combining the results into a data structure.

In [24]:
keys = ['A', 'B', 'C', 'A', 'B', 'C']
names = ['N1', 'N1', 'N2', 'N2', 'N3', 'N4']

df = pd.DataFrame(
    {
        'key': keys,
        'name': names,
        'data': range(10, 16)
    }
)

print(df)

  key name  data
0   A   N1    10
1   B   N1    11
2   C   N2    12
3   A   N2    13
4   B   N3    14
5   C   N4    15


In [25]:
for c in ['A', 'B', 'C']:
    print(df[df['key'] == c].data.sum())

23
25
27


In [26]:
grouped = df.groupby('key')

In [27]:
for v in grouped:
    print(v)

('A',   key name  data
0   A   N1    10
3   A   N2    13)
('B',   key name  data
1   B   N1    11
4   B   N3    14)
('C',   key name  data
2   C   N2    12
5   C   N4    15)


### Aggregate

Once the GroupBy object has been created, several methods are available to perform a computation on the grouped data.

An obvious one is aggregation via the aggregate() or equivalently agg() method:

In [28]:
print(df.groupby('key').aggregate(np.sum))

     data
key      
A      23
B      25
C      27


In [29]:
print(df.groupby('key').agg(['min', 'max', 'mean', 'std', 'var']))

    data                        
     min max  mean      std  var
key                             
A     10  13  11.5  2.12132  4.5
B     11  14  12.5  2.12132  4.5
C     12  15  13.5  2.12132  4.5


In [30]:
print(df.groupby('key').aggregate(np.min))

    name  data
key           
A     N1    10
B     N1    11
C     N2    12


In [31]:
grouped = df.groupby(['key', 'name'])

print(grouped.aggregate(np.min))

          data
key name      
A   N1      10
    N2      13
B   N1      11
    N3      14
C   N2      12
    N4      15


### Transform

The transform method returns an object that is indexed the same (same size) as the one being grouped.

In [32]:
print(grouped.transform(lambda x: x**2))

   data
0   100
1   121
2   144
3   169
4   196
5   225


### Filter

The filter method returns a subset of the original object. Suppose we want to take only elements that belong to groups with a group sum greater than 2.

In [33]:
print(grouped.filter(lambda x: x.data > 12))

  key name  data
3   A   N2    13
4   B   N3    14
5   C   N4    15


In [34]:
print(grouped.filter(lambda x: x.data > 12, dropna=False))

   key name  data
0  NaN  NaN   NaN
1  NaN  NaN   NaN
2  NaN  NaN   NaN
3    A   N2  13.0
4    B   N3  14.0
5    C   N4  15.0
