# GroupBy

## Split-Apply-Combine Strategy in Pandas

## Prerequisites

- Functions
- pandas introduction (1 & 2)
- Reshape

## Outcomes

- Understand the split-apply-combine strategy
- Use basic aggregation methods on `df.groupby`
- Group by multiple keys

In [None]:
import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline

## Split-Apply-Combine Strategy

Three steps:

1. **Split**: split the data into groups based on values in one or more columns
2. **Apply**: apply a function or routine to each group separately
3. **Combine**: combine the output into a DataFrame, using group identifiers as the index

## Sample Data

Let's create a simple dataset to explore the concepts:

In [None]:
C = np.arange(1, 7, dtype=float)
C[[3, 5]] = np.nan
df = pd.DataFrame({
    "A" : [1, 1, 1, 2, 2, 2],
    "B" : [1, 1, 2, 2, 1, 1],
    "C": C,
})
df

## The Split Step

Call `groupby` on the DataFrame:

In [None]:
gbA = df.groupby("A")
type(gbA)

## Examining Groups

Use `get_group()` to see individual groups:

In [None]:
gbA.get_group(1)

In [None]:
gbA.get_group(2)

## Grouping by Multiple Columns

Pass a list of column names to group by multiple keys:

In [None]:
gbAB = df.groupby(["A", "B"])
gbAB.get_group((1, 1))

In [None]:
gbAB.count()

## Custom Aggregate Functions

Write custom aggregations and pass them to `.agg()`:

In [None]:
def num_missing(df):
    "Return the number of missing items in each column of df"
    return df.isnull().sum()

num_missing(df)

In [None]:
gbA.agg(num_missing)

## Transforms: The `apply` Method

Apply transforms to GroupBy objects:

In [None]:
def smallest_by_b(df):
    return df.nsmallest(2, "B")

gbA.apply(smallest_by_b, include_groups=False)

## `pd.Grouper`

Use `pd.Grouper` for more complex grouping scenarios:

In [None]:
df2 = df.copy()
df2["Date"] = pd.date_range(
    start=pd.Timestamp.today().strftime("%m/%d/%Y"),
    freq="BQE",
    periods=df.shape[0]
)
df2 = df2.set_index("A")
df2

In [None]:
# Group by year
df2.groupby(pd.Grouper(key="Date", freq="YE")).count()

In [None]:
# Group by index level
df2.groupby(pd.Grouper(level="A")).count()

## Case Study: Airline Delays

Real-world example with airline performance data

In [None]:
url = "https://datascience.quantecon.org/assets/data/airline_performance_dec16.csv.zip"
air_dec = pd.read_csv(url, parse_dates=['Date'])
air_dec.head()

## Weekly Average Delays

Compute average delay by carrier and week:

In [None]:
weekly_delays = (
    air_dec
    .groupby([pd.Grouper(key="Date", freq="W"), "Carrier"])
    ["ArrDelay"]
    .mean()
    .unstack(level="Carrier")
)
weekly_delays

## Visualizing Weekly Delays

In [None]:
axs = weekly_delays.plot.bar(
    figsize=(10, 8), subplots=True, legend=False, sharex=True,
    sharey=True, layout=(4, 3), grid=False
)

axs[0,0].get_figure().tight_layout()
for ax in axs[-1, :]:
    ax.set_xticklabels(weekly_delays.index.strftime("%a, %b. %d'"))

## Analyzing Delay Types

Five categories of delays:

In [None]:
delay_cols = [
    'CarrierDelay',
    'WeatherDelay',
    'NASDelay',
    'SecurityDelay',
    'LateAircraftDelay'
]

## Pre-Christmas Week Analysis

In [None]:
pre_christmas = air_dec.loc[
    (air_dec["Date"] >= "2016-12-12") & (air_dec["Date"] <= "2016-12-18")
]

def positive(df):
    return (df > 0).sum()

delay_totals = pre_christmas.groupby("Carrier")[delay_cols].agg(["sum", "mean", positive])
delay_totals

## Reshaping for Visualization

Restructure data for better plotting:

In [None]:
reshaped_delays = (
    delay_totals
    .stack()
    .T
    .swaplevel(axis=1)
    .sort_index(axis=1)
)
reshaped_delays.head()

## Plotting Delay Types

In [None]:
for agg in ["mean", "sum", "positive"]:
    axs = reshaped_delays[agg].plot(
        kind="bar", subplots=True, layout=(4, 3), figsize=(10, 8), legend=False,
        sharex=True, sharey=True
    )
    fig = axs[0, 0].get_figure()
    fig.suptitle(agg)

## Automating Analysis with Functions

Create reusable functions for repeated analysis:

In [None]:
def mean_delay_plot(df, freq, figsize=(10, 8)):
    """Make a bar chart of average flight delays for each carrier at a given frequency."""
    mean_delays = (
        df
        .groupby([pd.Grouper(key="Date", freq=freq), "Carrier"])
        ["ArrDelay"]
        .mean()
        .unstack(level="Carrier")
    )
    
    axs = mean_delays.plot.bar(
        figsize=figsize, subplots=True, legend=False, sharex=True,
        sharey=True, layout=(4, 3), grid=False
    )
    
    axs[0, 0].get_figure().tight_layout()
    for ax in axs[-1, :]:
        ax.set_xticklabels(mean_delays.index.strftime("%a, %b. %d'"))
    
    return axs

## Daily Frequency Analysis

In [None]:
mean_delay_plot(air_dec, "D", figsize=(16, 8));

## Delay Type Analysis Function

In [None]:
def delay_type_plot(df, start, end):
    """Make bar charts for delay types between start and end dates."""
    sub_df = df.loc[(df["Date"] >= start) & (df["Date"] <= end)]
    
    def positive(df):
        return (df > 0).sum()
    
    aggs = sub_df.groupby("Carrier")[delay_cols].agg(["sum", "mean", positive])
    reshaped = aggs.stack().T.swaplevel(axis=1).sort_index(axis=1)
    
    for agg in ["mean", "sum", "positive"]:
        axs = reshaped[agg].plot(
            kind="bar", subplots=True, layout=(4, 3), figsize=(10, 8), legend=False,
            sharex=True, sharey=True
        )
        fig = axs[0, 0].get_figure()
        fig.suptitle(agg)

## Analyzing Specific Days

In [None]:
# Both days together
delay_type_plot(air_dec, "12-17-16", "12-18-16")

In [None]:
# Only December 17th
delay_type_plot(air_dec, "12-17-16", "12-17-16")

## Key Takeaways

- **Split-Apply-Combine** is a powerful paradigm for group analysis
- Use `groupby()` to split data into groups
- Apply aggregations with built-in methods or custom functions
- `pd.Grouper` enables complex grouping scenarios
- **Automate** repeated analyses with functions

## Exercise: Cohort Analysis

Analyze Shopify order data by customer cohorts

In [None]:
random.seed(42)
np.random.seed(42)

url = "https://datascience.quantecon.org/assets/data/shopify_orders.csv.zip"
orders = pd.read_csv(url)
orders.info()

In [None]:
orders.head()

## Exercise Goal

**Want**: Compute monthly totals (orders, sales, quantity) by:
- Customer cohort (month of first order)
- Customer type (first-time vs returning)

### Suggested Steps:
1. Convert `Day` column to datetime
2. Add column for each customer's first order date
3. Group by 3 keys
4. Apply aggregation
5. Reshape results

## Practice Exercises

1. Explore aggregation methods on GroupBy objects
2. Write custom aggregation functions
3. Apply transforms with `.apply()`
4. Analyze delay patterns in airline data
5. Complete the cohort analysis exercise

## Questions?

### Resources:
- [Pandas GroupBy Documentation](https://pandas.pydata.org/pandas-docs/stable/groupby.html)
- Practice with real datasets
- Experiment with different aggregations