# Group By: split-apply-combine

In this notebook we will cover:
* Splitting the data into groups based on some criteria.
* Applying a function to each group independently.
* Combining the results into a data structure.

If you like you can read some more here: https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html

In [None]:
### Steps for use with colab
# First step to mount your Google Drive
from google.colab import drive
drive.mount('/content/drive')
%cd /content/drive/My\ Drive
# Clone Pyladies repo 
#! git clone --recursive https://github.com/pyladiesams/Pandas-advanced-nov2019.git
# Install requirements
! pip install pandas==0.25.3
import pandas as pd
# Move into repo
%cd /content/drive/My\ Drive/Pandas-advanced-nov2019/workshop/

# 0. Load data

In [1]:
import pandas as pd

%matplotlib inline

In [3]:
airquality = pd.read_csv("./data/airquality.csv", delimiter=";", decimal=",")

# rename columns from Dutch to English
airquality.columns = ["time", "location", "component", "value", "airquality_index"]

# 1. Group by

In [5]:
# Group the data based on 'component'
grouped = airquality.groupby("component")
print(grouped)

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x1137026d0>


Nothing special we've done here, the definition of grouping is just to provide a mapping of labels to group names.

The groups attribute is a dict whose keys are the computed unique groups and corresponding values being the axis labels belonging to each group.


PS: grouping automatically sorts the groups

In [6]:
print(grouped.groups)

In [5]:
# Choose a group and print the first 3 rows (use .get_group)

## your code here

In [4]:
# Print each name and group in a for loop

## your code here

In [7]:
# Group the dataframe by component and airquality index and print the rows related to
# group "PM10" and index 11

## your code here

# 2. Apply step

There are many things you can do with the group objects we've created, some examples are
* Standard aggregations such as `.sum()` and `.mean()`
* More special aggregations using `.agg()`
* Any other transformations using your own defined functions

Another useful function you can call on a group is `.describe()`

In [8]:
# try using .describe() on the groups you get when you group by component

## your code here

In [9]:
# Calculate the mean airquality_index of each component

## your code here

In [10]:
# try to get the mean as well as the total sum of the value of each component

## your code here

In [11]:
# It is also possible to apply the aggregation specific for certain columns
# Can you try to get the sum of the value, and the mean of the airquality_index per
# component?


## your code here

You can also apply your own custom functions to groups, using `apply`. You can do this using `lambda` or a custom defined function.

In [12]:
# Try to get the max - min value of each component using a lambda function
airquality.groupby("component").value.apply(lambda group:  ## your code here )

In [13]:
# Try to do the same but with a custom function
def max_min(group):
    return ## your code here

airquality.groupby("component").value.apply(## your code here)

Try to group the data per component and airquality_index and calculate the count of observations (hint: `size()` ). Then turn the result into a dataframe.

In [14]:
## your code here

This format isn't great, try to create a dataframe with a column for each airquality_index and the count a values in the cells.

In [15]:
## your code here

# 3. Transformations

So far we've using functions on the groups, usually returning one value per group. The `transform` method returns an object that is indexed the same (so the same size) as the one being grouped.

Let's start with a small example. Write a function which returns the difference between a column's value and the mean of the column. Then use `transform` to apply the function to each group of component. 

In [16]:
def diff_mean(x):
    return ## your code here

In [17]:
transformed = airquality.groupby("component").transform(lambda x: diff_mean(x))
print(f"Original df shape: {airquality.shape}, transformed df shape: {transformed.shape}")
print(transformed.head())

Let's add the new columns to our dataframe and look at the result.

In [18]:
airquality[["value_diff_mean", "airquality_index_diff_mean"]] = transformed
print(airquality.head())

Another example where `transform` could be useful is when you want to standardize your data.

In the next cell block, try to standardize the data by grouping per component and using the method of substracting by the mean and dividing by the standard deviation. Add the columns to the new df and assert that indeed the mean of the columns is zero and the standard deviation 1.

In [64]:
airquality = airquality.drop(["value_diff_mean", "airquality_index_diff_mean"],axis=1)

In [66]:
def standardize(x):
    return ## your code here

In [68]:
airquality[["value_standardized", "airquality_index_standardized"]] = \
    ## your code here

In [19]:
# Assert that the columns are correct
print(airquality.value_standardized.mean())
print(airquality.value_standardized.std())
print(airquality.airquality_index_standardized.mean())
print(airquality.airquality_index_standardized.std())

Another useful application of `transform` is when you have missing values.
If you like, you can corrupt the value column of the data using the code below by adding random missing values to it (with a 10 percent chance), and then try to impute them for example with the average per group.

In [82]:
import numpy as np
airquality["value_with_nan"] = \
    airquality.value.mask(np.random.random(airquality.shape[0]) < .1)

In [20]:
# print(airquality.isnull().sum())

In [21]:
## your code here