# Part 2.2 | Numerical Variables by Category

In this exercise, we'll explore how Starbucks customers respond to different promotional offers. The data contains customer interactions with various discount structures (BOGO, $2 off $10, $5 off $20, etc.).

**Key Question:** How does the incentive structure affect buying behavior?

## Setup

Run this cell to import libraries and load the data.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Load the data
file_path = 'https://tayweid.github.io/econ-0150/parts/part-2-2/data/'
data = pd.read_csv(file_path + 'Starbucks_Promos.csv', index_col=0)
data.head()

---

## Exercise 2.2 | Revenue by Offer Type

**Task:** Visualize the data to answer whether Bogo 5 or Bogo 10 has higher average spending.

Create a boxplot with `Offer ID` on the x-axis and `Revenue` on the y-axis.

In [None]:
# Your code here


**Observation:** The distribution is hard to see — why are so many values compressed at zero?

---

## Log Transform for Skewed Data

When data is heavily skewed, a log transformation can help us see the distribution better. We'll use log base 2, where each unit represents a **doubling** of spending.

- log2(1+$7) = 3
- log2(1+$15) = 4  
- log2(1+$31) = 5

In [None]:
# Your code here


---

## Exercise 2.2 | Log Revenue by Offer Type

**Task:** Create a boxplot with the log-transformed variable to better see the distribution.

Use `log2_Revenue` instead of `Revenue` on the y-axis.

In [None]:
# Your code here


**Observation:** Now we can see the data better — but wait, why are there so many zeros?

---

## Exercise 2.2 | Investigate the Data

**Task:** Count the unique values in the `Event` column to understand what's causing the zeros.

Use `.value_counts()` to see how many of each event type we have.

In [None]:
# Your code here


**Observation:** Not all rows are purchases! Offers and completions have zero revenue — only transactions are real spending.

---

## Exercise 2.2 | Filter for Transactions

**Task:** Keep only rows where `Event` equals `'transaction'` to focus on actual purchases.

Create a new dataframe called `transactions` with only the transaction rows.

In [None]:
# Your code here


---

## Exercise 2.2 | Visualize Filtered Data

**Task:** Create a boxplot of log revenue by offer type using only transactions.

Use the filtered `transactions` dataframe.

In [None]:
# Your code here


**Observation:** Now every row is a real purchase. Which offer type has higher spending?

---

## Exercise 2.2 | Grouped Statistics

**Task:** Calculate the mean, standard deviation, and count of log revenue by offer type.

Use `.groupby()` and `.agg()` to compute multiple statistics at once.

In [None]:
# Your code here


**Observation:** 5off20 has the highest mean — but is that the whole story? There's substantial variation within each group.

---

## Exercise 2.2 | Compare Two Offers

**Task:** Filter for just Bogo 5 and Bogo 10, then create a boxplot to compare them.

Use `.isin()` to filter for multiple values.

In [None]:
# Your code here


**Observation:** BOGO 10 has higher average spending — but look at the overlap! Many BOGO 5 buyers spent more than BOGO 10 buyers. When distributions overlap this much, is the difference meaningful?

---

## Summary

### The Workflow: Filter → Transform → Group → Visualize

1. **Filter** — keep only relevant rows
2. **Transform** — log scale for skewed data
3. **Group** — organize by a categorical variable
4. **Visualize** — compare distributions across groups

### Key Takeaways

- **Summary statistics can hide problems** — always visualize
- **Filter your data** — make sure you're analyzing what you think
- **Log transformation** helps with skewed data
- **Boxplots by category** show distributions, not just means
- **Overlapping distributions** raise inference questions
- 