<img src=images/gdd-logo.png width=300px align=right> 

# Aggregation in Pandas

This section covers all about aggregating with `df.groupby()`, including:

* [Grouped aggregation in pandas](#grouped-agg)
* [Single aggregations with `df.groupby()`](#groupby)
* [Multiple aggregations with `.agg()`](#mult-aggs)
* [<mark>Exercise: Using aggregations</mark>](#ex-aggs)

Let's import pandas and load the data again, since this is a new notebook.

In [None]:
import pandas as pd

chickweight = (
    pd.read_csv('data/chickweight.csv')
    .rename(str.lower, axis='columns')
)

To get statistics for the data you can use methods like:

* `df.describe()`: gives a selection of statistics for each numeric field
* `df.mean()`: calculates the mean of each column or row
* `df.max()`: calculates the max of each column or row
* `df.min()`: calculates the min of each column or row

In [None]:
chickweight.describe()

In [None]:
chickweight.mean()

<a id = 'grouped-agg'></a>
## Grouped Aggregation in Pandas

However, what if you want data about specific groups of the dataset, e.g. the mean weight for diet 2. In that case you need to ***aggregate*** the data.

Aggregation is the act of splitting up the original dataset to calculate statistics on sub-dataframes.

<img src="images/03_Aggregations/split-groupby-combine.png" width="440" height="440" align="center"/>

There are a few ways to do this in pandas:. 

<a id = 'groupby'></a>
## Single aggregations

To get the overall mean weight you can use:

In [None]:
(
    chickweight
    ['weight']
    .mean()
)

But what if you want to see what the mean weight is depending on the diet the chicken was on?

Then you first need to perform a `.groupby()` in order to split the data into those different diets.

In [None]:
(
    chickweight
    .groupby('diet')
    ['weight']
    .mean()
)

### <mark>Practice: Try out groupby</mark>

1. Find the minumum `weight` of the chickens at each `time` period.

There are lots of statistics that can be applied to numerical columns, like:
* `.mean()`
* `.sum()`
* `.min()` and `.max()`
* `.first()`
* `.var()`
* `.size()`
* `.count()`

2. Try them out. What does each one do?</mark>

In [None]:
# %load answers/03_Aggregations/practice-aggs.py

<details>
    
  <summary><span style="color:blue">Show answer</span></summary>
  
* `.mean()`: Return the mean of the values 
* `.sum()`: Return the sum of the values 
* `.min()` and `.max()`: Return the minimum or maximum of the values 
* `.first()`: Return the first non-null entry of each column
* `.var()`: Return the variance of each column
* `.size()`: Return the number of rows 
* `.count()`: Return the count of non-null entries of each column

</details>

<mark>**Question:** What happens to the index when you use groupby?</mark>

<details>
    
  <summary><span style="color:blue">Show answer</span></summary>
  
Whatever column(s) you group on will become the new index. 

</details>

You can even perform aggregations when the data is grouped by **multiple** columns:

In [None]:
(
    chickweight
    .groupby(['diet', 'time'])
    .mean()
)

If grouping by more than one column, the indices will be nested in a _multiindex_, with the order of the list items determining the hierarchy (e.g. `['diet', 'time']` will first group by diet and then within each diet by time).

To take those columns out of the index we can use the `.reset_index()` method.

<mark>**Question**: Why does the below code *not work* when you comment out `.reset_index()`?</mark>

In [None]:
(
    chickweight
    .groupby(['time', 'diet'])
    .mean()
    .reset_index()
    .loc[lambda df: df['time'] == 10]
)

<details>
    
  <summary><span style="color:blue">Show answer</span></summary>
  
  Because we are using a filter on the column `time` and when we perform a `.groupby()` on `time` and `diet` these columns become the index of the resulting dataframe. Therefore we cannot filter as we would if they were *normal* columns. 
    
   When you use `.reset_index()` it resets the time and diet as columns and therefore they can be used in the filter.
    
    
   Sometimes this is fine, but sometime you might want to undo this operation by resetting the index. You can use `.reset_index()` at the end to do this.

   The index of a dataframe typically has different behavior than a column.

</details>

<a id = 'mult-aggs'></a>
## Multiple aggregations with `.agg()`

You can get the results for more than one aggregation by using a keyword for the name of the **output column** and a tuple to specificy the **old columns** and the **aggregator** to use. 

```python
df
.groupby('grouping_col')
.agg(output_col = ('old_col', 'aggregator'))
```

**Example**: Find the size of `chick` and mean value for `weight` when grouped by `time`

In [None]:
(
    chickweight
    .groupby('time')
    .agg(num_chickens = ('chick', 'size'), 
         weight_mean = ('weight', 'mean'),
        )
)

Note: this way of aggregating became available in `pandas > 0.25`. To check your pandas version you can run the following cell:

In [None]:
pd.__version__

<a id = 'ex-aggs'></a>
### <mark>Exercise: Using aggregations</mark>

Determine the following aggregate information **per time period**:

- maximum chick id (use the chick column)
- median weight
- std (standard deviation) of weight

- any extras of your choice

**Bonus**: Once you have performed the `.groupby()`, filter the data to only show when the median weight exceeds `150g`.

In [None]:
# %load answers/03_Aggregations/ex-aggs.py

# Conclusion

You have now seen how to do a single aggregation, for example:

```python
(
    df
    .groupby('column')
    .mean()
)
```

And also that you can do multiple aggregations, with control over the output column name, for example:

```python
(
    df
    .groupby('column')
    .agg(min_col = ('old_col', 'min'),
         max_col = ('col_col', 'max')
        )
)
```

An important thing to note is that the pandas GroupBy object is **not a DataFrame until you perform some kind of aggregation function**. 

Once it is a dataframe, you can perform any DataFrame method, for example performing `.loc[]`.