# Group By: split-apply-combine

In this notebook we will cover:
* Splitting the data into groups based on some criteria.
* Applying a function to each group independently.
* Combining the results into a data structure.

If you like you can read some more here: https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html

# 0. Load data

In [1]:
import pandas as pd

%matplotlib inline

In [3]:
airquality = pd.read_csv("../workshop/data/airquality.csv", delimiter=";", decimal=",")

# rename columns from Dutch to English
airquality.columns = ["time", "location", "component", "value", "airquality_index"]

# 1. Group by

In [5]:
# Group the data based on 'component'

grouped = airquality.groupby("component")
print(grouped)

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x1137026d0>


Nothing special we've done here, the definition of grouping is just to provide a mapping of labels to group names.

The groups attribute is a dict whose keys are the computed unique groups and corresponding values being the axis labels belonging to each group.


PS: grouping automatically sorts the groups

In [6]:
print(grouped.groups)

{'CO': Int64Index([    0,    12,    15,    22,    28,    35,    42,    49,    57,
               63,
            ...
            45395, 45407, 45409, 45421, 45428, 45435, 45438, 45450, 45454,
            45458],
           dtype='int64', length=6489), 'FN': Int64Index([    3,     9,    17,    25,    31,    38,    44,    52,    59,
               66,
            ...
            45391, 45401, 45406, 45413, 45420, 45423, 45431, 45443, 45444,
            45455],
           dtype='int64', length=6549), 'NO': Int64Index([    1,    13,    16,    23,    29,    36,    48,    50,    58,
               64,
            ...
            45393, 45398, 45403, 45410, 45416, 45426, 45432, 45440, 45448,
            45453],
           dtype='int64', length=6534), 'NO2': Int64Index([    6,    10,    20,    27,    34,    41,    47,    55,    62,
               69,
            ...
            45389, 45396, 45404, 45411, 45417, 45429, 45436, 45442, 45445,
            45456],
           dtype='int64', length=6

In [7]:
# Choose a group and print the first 3 rows (use .get_group)
print(grouped.get_group("CO").head(3))

                         time              location component  value  \
0   2019-01-01 01:00:00+01:00  Amsterdam-Vondelpark        CO  298.1   
12  2019-01-01 02:00:00+01:00  Amsterdam-Vondelpark        CO  287.7   
15  2019-01-01 03:00:00+01:00  Amsterdam-Vondelpark        CO  244.6   

    airquality_index  
0                  2  
12                 2  
15                 1  


In [9]:
# Print each name and group in a for loop
for name, group in grouped:
    print(name)
    print(group.head())

CO
                         time              location component  value  \
0   2019-01-01 01:00:00+01:00  Amsterdam-Vondelpark        CO  298.1   
12  2019-01-01 02:00:00+01:00  Amsterdam-Vondelpark        CO  287.7   
15  2019-01-01 03:00:00+01:00  Amsterdam-Vondelpark        CO  244.6   
22  2019-01-01 04:00:00+01:00  Amsterdam-Vondelpark        CO  219.9   
28  2019-01-01 05:00:00+01:00  Amsterdam-Vondelpark        CO  214.5   

    airquality_index  
0                  2  
12                 2  
15                 1  
22                 1  
28                 1  
FN
                         time              location component  value  \
3   2019-01-01 01:00:00+01:00  Amsterdam-Vondelpark        FN   3.38   
9   2019-01-01 02:00:00+01:00  Amsterdam-Vondelpark        FN   1.87   
17  2019-01-01 03:00:00+01:00  Amsterdam-Vondelpark        FN   1.66   
25  2019-01-01 04:00:00+01:00  Amsterdam-Vondelpark        FN   0.76   
31  2019-01-01 05:00:00+01:00  Amsterdam-Vondelpark        FN  

In [15]:
# Group the dataframe by component and airquality index and print the rows related to group
# "PM10" and index 11
airquality.groupby(["component", "airquality_index"]).get_group(('PM10', 11))

Unnamed: 0,time,location,component,value,airquality_index
4,2019-01-01 01:00:00+01:00,Amsterdam-Vondelpark,PM10,425.9,11
28304,2019-06-20 11:00:00+02:00,Amsterdam-Vondelpark,PM10,250.5,11


# 2. Apply step

There are many things you can do with the group objects we've created, some examples are
* Standard aggregations such as `.sum()` and `.mean()`
* More special aggregations using `.agg()`
* Any other transformations using your own defined functions

Another useful function you can call on a group is `.describe()`

In [19]:
# try using .describe() on the groups you get when you group by component
airquality.groupby("component").describe()

Unnamed: 0_level_0,value,value,value,value,value,value,value,value,airquality_index,airquality_index,airquality_index,airquality_index,airquality_index,airquality_index,airquality_index,airquality_index
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max
component,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2
CO,6489.0,334.22515,91.344264,156.6,283.7,314.0,361.6,1985.8,6489.0,1.965018,0.356605,1.0,2.0,2.0,2.0,5.0
FN,6549.0,0.700971,0.649833,0.04,0.31,0.49,0.86,8.79,6549.0,1.216674,0.469526,1.0,1.0,1.0,1.0,6.0
NO,6534.0,2.608815,9.255303,-2.0,0.1,0.5,1.7,203.8,6534.0,1.082492,0.403668,1.0,1.0,1.0,1.0,7.0
NO2,6534.0,20.814264,14.899702,1.8,10.4,16.1,26.075,106.6,6534.0,2.485461,1.29784,1.0,2.0,2.0,3.0,8.0
O3,6529.0,54.44019,29.048798,-1.3,36.0,54.5,70.6,215.9,6529.0,4.042732,1.585158,1.0,3.0,4.0,5.0,10.0
PM10,6397.0,18.389042,13.0979,-0.4,10.6,15.8,22.6,425.9,6397.0,2.287479,1.055105,1.0,2.0,2.0,3.0,11.0
PM25,6427.0,10.456916,11.268769,-4.9,4.5,7.4,12.3,419.0,6427.0,1.754318,1.310677,1.0,1.0,1.0,2.0,11.0


In [18]:
# Calculate the mean airquality_index of each component
airquality.groupby("component").airquality_index.mean()

component
CO      1.965018
FN      1.216674
NO      1.082492
NO2     2.485461
O3      4.042732
PM10    2.287479
PM25    1.754318
Name: airquality_index, dtype: float64

In [20]:
# try to get the mean as well as the total sum of the value of each component
airquality.groupby("component").value.agg(["sum", "mean"])

Unnamed: 0_level_0,sum,mean
component,Unnamed: 1_level_1,Unnamed: 2_level_1
CO,2168787.0,334.22515
FN,4590.66,0.700971
NO,17046.0,2.608815
NO2,136000.4,20.814264
O3,355440.0,54.44019
PM10,117634.7,18.389042
PM25,67206.6,10.456916


In [21]:
# It is also possible to apply the aggregation specific for certain columns
# Can you try to get the sum of the value, and the mean of the airquality_index per component?
airquality.groupby("component").agg({"value": "sum", "airquality_index": "mean"})

Unnamed: 0_level_0,value,airquality_index
component,Unnamed: 1_level_1,Unnamed: 2_level_1
CO,2168787.0,1.965018
FN,4590.66,1.216674
NO,17046.0,1.082492
NO2,136000.4,2.485461
O3,355440.0,4.042732
PM10,117634.7,2.287479
PM25,67206.6,1.754318


You can also apply your own custom functions to groups, using `apply`. You can do this using `lambda` or a custom defined function.

In [23]:
# Try to get the max - min value of each component using a lambda function
airquality.groupby("component").value.apply(lambda group: group.max() - group.min())

component
CO      1829.20
FN         8.75
NO       205.80
NO2      104.80
O3       217.20
PM10     426.30
PM25     423.90
Name: value, dtype: float64

In [24]:
# Try to do the same but with a custom function
def max_min(group):
    return group.max() - group.min()

airquality.groupby("component").value.apply(max_min)

component
CO      1829.20
FN         8.75
NO       205.80
NO2      104.80
O3       217.20
PM10     426.30
PM25     423.90
Name: value, dtype: float64

Try to group the data per component and airquality_index and calculate the count of observations (hint: `size()`). Then turn the result into a dataframe.

In [38]:
airquality.groupby(["component","airquality_index"]).size().to_frame()

Unnamed: 0_level_0,Unnamed: 1_level_0,0
component,airquality_index,Unnamed: 2_level_1
CO,1,521
CO,2,5682
CO,3,279
CO,4,6
CO,5,1
FN,1,5252
FN,2,1212
FN,3,54
FN,4,26
FN,5,4


This format isn't great, try to create a dataframe with a column for each airquality_index and the count a values in the cells.

In [45]:
grouped_df = airquality.groupby(["component","airquality_index"]).size().to_frame()
grouped_df.unstack().fillna(value=0)

Unnamed: 0_level_0,0,0,0,0,0,0,0,0,0,0,0
airquality_index,1,2,3,4,5,6,7,8,9,10,11
component,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2
CO,521.0,5682.0,279.0,6.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
FN,5252.0,1212.0,54.0,26.0,4.0,1.0,0.0,0.0,0.0,0.0,0.0
NO,6192.0,206.0,93.0,32.0,6.0,3.0,2.0,0.0,0.0,0.0,0.0
NO2,1474.0,2499.0,1285.0,732.0,330.0,165.0,47.0,2.0,0.0,0.0,0.0
O3,565.0,684.0,675.0,1948.0,1733.0,613.0,211.0,80.0,15.0,5.0,0.0
PM10,1339.0,2939.0,1358.0,573.0,129.0,34.0,13.0,4.0,4.0,2.0,2.0
PM25,4221.0,937.0,444.0,505.0,183.0,71.0,52.0,8.0,2.0,3.0,1.0


# 3. Transformations

So far we've using functions on the groups, usually returning one value per group. The `transform` method returns an object that is indexed the same (so the same size) as the one being grouped.

Let's start with a small example. Write a function which returns the difference between a column's value and the mean of the column. Then use `transform` to apply the function to each group of component. 

In [54]:
def diff_mean(x):
    return x - x.mean()

In [55]:
transformed = airquality.groupby("component").transform(lambda x: diff_mean(x))
print(f"Original df shape: {airquality.shape}, transformed df shape: {transformed.shape}")
print(transformed.head())

Original df shape: (45459, 5), transformed df shape: (45459, 2)
        value  airquality_index
0  294.441038         -3.510353
1    4.918128         -1.681640
2   28.825906          0.449635
3    2.301305          0.408720
4  424.496031          8.831990


Let's add the new columns to our dataframe and look at the result.

In [57]:
airquality[["value_diff_mean", "airquality_index_diff_mean"]] = transformed
print(airquality.head())

                        time              location component   value  \
0  2019-01-01 01:00:00+01:00  Amsterdam-Vondelpark        CO  298.10   
1  2019-01-01 01:00:00+01:00  Amsterdam-Vondelpark        NO    5.20   
2  2019-01-01 01:00:00+01:00  Amsterdam-Vondelpark        O3   30.70   
3  2019-01-01 01:00:00+01:00  Amsterdam-Vondelpark        FN    3.38   
4  2019-01-01 01:00:00+01:00  Amsterdam-Vondelpark      PM10  425.90   

   airquality_index  value_diff_mean  airquality_index_diff_mean  
0                 2       294.441038                   -3.510353  
1                 1         4.918128                   -1.681640  
2                 3        28.825906                    0.449635  
3                 3         2.301305                    0.408720  
4                11       424.496031                    8.831990  


Another example where `transform` could be useful is when you want to standardize your data.

In the next cell block, try to standardize the data by grouping per component and using the method of substracting by the mean and dividing by the standard deviation. Add the columns to the new df and assert that indeed the mean of the columns is zero and the standard deviation 1.

In [64]:
airquality = airquality.drop(["value_diff_mean", "airquality_index_diff_mean"],axis=1)

In [66]:
def standardize(x):
    return (x - x.mean()) / x.std()

In [68]:
airquality[["value_standardized", "airquality_index_standardized"]] = \
    airquality.groupby("component").transform(lambda x: standardize(x))

In [71]:
# Assert that the columns are correct
print(airquality.value_standardized.mean())
print(airquality.value_standardized.std())
print(airquality.airquality_index_standardized.mean())
print(airquality.airquality_index_standardized.std())

-1.0003461380286942e-17
0.9999340028378061
7.002422966200859e-17
0.9999340028378061


Another useful application of `transform` is when you have missing values.
If you like, you can corrupt the value column of the data using the code below by adding random missing values to it (with a 10 percent chance), and then try to impute them for example with the average per group.

In [82]:
import numpy as np
airquality["value_with_nan"] = airquality.value.mask(np.random.random(airquality.shape[0]) < .1)

In [83]:
print(airquality.isnull().sum())

time                                0
location                            0
component                           0
value                               0
airquality_index                    0
value_standardized                  0
airquality_index_standardized       0
value_with_nan                   4520
dtype: int64
