# Recap from last week

1. Load Titanic data set
2. Count number of missing values in column `Age`
3. Compute mean age
4. Replace missing values with mean age rounded to nearest integer

In [None]:
# Uncomment this to use files in the local data/ directory
DATA_PATH = '../data'

# Uncomment this to load data directly from GitHub
# DATA_PATH = 'https://raw.githubusercontent.com/richardfoltyn/TECH2-H24/main/data'

***

# Grouping and aggregation with pandas

## Aggregation and reduction

*split-apply-combine* operations:

1. *Split* data into groups based on some criteria;
2. *Apply* some function to each group separately; and
3. *Combine* the results into a single `DataFrame` or `Series`.

See also the pandas [cheat sheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf) for an illustration of such operations.

### Working with entire DataFrames

- Apply functions such as `mean()`, `min()`, `max()`, etc. to entire columns
- See Titanic example from earlier

### Working on subsets of data (grouping)

- [`groupby()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html) allows us to apply operations to *sub-sets* of data

*Example: Compute means by class (first, second, third)*

#### Built-in aggregations

There are numerous routines to aggregate grouped data, for example:

- [`mean()`](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.mean.html):
    averages within each group
- [`sum()`](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.sum.html):
    sum values within each group
- [`std()`](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.std.html), 
    [`var()`](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.var.html): 
    within-group standard deviation and variance
- [`quantile()`](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.quantile.html):
    compute quantiles within each group
- [`size()`](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.size.html): 
    number of observations in each group
- [`count()`](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.count.html):
    number of non-missing observations in each group
- [`first()`](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.first.html), 
    [`last()`](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.last.html): 
    first and last elements in each group
-   [`min()`](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.min.html), 
    [`max()`](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.max.html): 
    minimum and maximum elements within a group

See the [official documentation](https://pandas.pydata.org/docs/user_guide/groupby.html#built-in-aggregation-methods) for a complete list.

*Example: Number of elements within each group*

*Example: Return first observation of each group*

<div class="alert alert-info">
<h3> Your turn</h3>
Use the Titanic data set to perform the following aggregations:
<ol>
    <li>Compute the average survival rate by sex (stored in the <TT>Sex</TT> column).</li>
    <li>Count the number of passengers aged 50+. Compute the average survival rate by sex for this group.</li>
    <li>Count the number of passengers below the age of 20 by class and sex. Compute the average survival rate for this group (by class and sex).</li>
</ol>
</div>

#### Writing custom aggregations

- Built-in functions don't cover all possible use cases
- Apply custom aggregation functions using
[`agg()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.agg.html)
(short-hand for [`aggregate()`](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.aggregate.html))
- Functions operate on each column separately
- Functions can be passed as string (e.g., `"mean"`) or as function reference (e.g., `np.mean`), or as a lambda expression

*Example: Compute mean age by class using `agg()`*

*Example: Count number of passengers aged 40+ by class*

#### Applying multiple functions at once

- Multiple functions applied to the same column: `.agg(['function1', 'function2'])`

*Example: Compute mean and median age by class*

- Apply multiple functions to different columns (["named aggregation"](https://pandas.pydata.org/docs/user_guide/groupby.html#named-aggregation)):

```python
    groups.agg(
        new_column_name1=('column_name1', 'operation1'),
        new_column_name2=('column_name2', 'operation2'),
        ...
    )
```

*Example: Compute maximum fare and mean age by class*

<div class="alert alert-info">
<h3> Your turn</h3>
Use the Titanic data set to perform the following aggregations:
<ol>
    <li>Compute the minimum, maximum and average age by embarkation port (stored in the column <TT>Embarked</TT>) in a single <TT>agg()</TT> operation.
    Note that there are several ways to solve this problem.</li>
    <li>Compute the number of passengers, the average age and the fraction of women by embarkation port in a single <TT>agg()</TT> operation. This one is more challenging and probably requires use of <TT>lambda</TT> expressions.</li>
</ol>
</div>

***

## Transformations

- Aggregations & reductions _reduce_ the dimensionality of the result (e.g., series of data => mean)
- Transformations: apply group-level operations to each _observation_, data dimension remains unchanged
- Transformations can be applied using [`transform()`](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.transform.html)

*Example: Compute average fare by class and assign it to each observation*

*Example: Deviation from average fare*

<div class="alert alert-info">
<h3> Your turn</h3>
Use the Titanic data set to perform the following aggregations:
<ol>
    <li>Compute the <i>excess</i> fare paid by each passenger relative to the minimum fare by embarkation port and class, i.e., compute <i>Fare - min(Fare)</i>
        by port and class.</li>
</ol>
</div>

***
# Working with time series data

- Time series data: indexed by time stamp, date, etc.
- Example: Quarterly GDP since 1950
- Pandas has comprehensive support for time series data

*Example: Create artificial daily data*

Construct three months of daily data from 2024-01-01 to 2024-03-31 using the 
[`date_range()`](https://pandas.pydata.org/docs/reference/api/pandas.date_range.html)

***

## Indexing with date/time indices

- Can pass dates, date ranges, etc., directly to `.loc[]`
- Supports partial indexing

*Example: Select single date, date range, whole months*

***

## Lags, differences, and other useful transformations

Common time-series operations:

- [`shift()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.shift.html): move observations forward/backward
- [`diff()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.diff.html): compute difference across periods
- [`pct_change()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.pct_change.html): compute percentage change across periods


***
## Resampling and aggregation

- Resampling is like `groupby()`, but applied to time periods
- Observations can be grouped by year (`'YE'`), quarter (`'QE'`), month (`'ME'`), week (`'W'`), etc.
- Apply build-in methods to grouped object just like with `groupby()` aggregation

*Example: Compute monthly averages*

*Example: Select last weekly observation*