# ECON 0150 | Replication Notebook

**Title:** Inflation and Holiday Sales

**Original Author:** Wakim

**Original Date:** Fall 2024

---

This notebook replicates the analysis from a student final project in ECON 0150: Economic Data Analysis. You can run this notebook yourself to explore the data, reproduce the findings, and try the extension exercises at the end.

## About This Replication

**Research Question:** How does November inflation affect U.S. Holiday Sales?

**Data Source:** Holiday sales (billions USD) and November CPI year-over-year inflation data (2004-2024)

**Methods:** First-differences model to address time trends

**Main Finding:** Using first differences, a 1 percentage point increase in inflation change is associated with a $6.07 billion increase in the change in holiday sales (p = 0.019). This suggests that nominal sales rise with inflation.

**Course Concepts Used:**
- OLS regression
- First-differencing to remove trends
- Residual analysis
- Addressing autocorrelation

---
## Step 0 | Setup

First, we import the necessary libraries and load the data.

In [None]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm

In [None]:
# Load data from course website
data_url = 'https://tayweid.github.io/econ-0150/projects/replications/0067/data/holiday_sales_inflation.csv'
df = pd.read_csv(data_url)

# Preview the data
df.head()

In [None]:
# Check the shape and data types
print(f"Dataset has {len(df)} rows and {len(df.columns)} columns")
print(f"\nColumns: {list(df.columns)}")
print(f"\nYears: {df['Year'].min()} to {df['Year'].max()}")

---
## Step 1 | Data Exploration

We explore the time series patterns in our key variables.

In [None]:
# Summary statistics
df[['Sales_Billions', 'CPI-YOY (Nov )']].describe()

In [None]:
# Time series plot
fig, axes = plt.subplots(2, 1, figsize=(10, 8))

axes[0].plot(df['Year'], df['Sales_Billions'], marker='o')
axes[0].set_xlabel('Year')
axes[0].set_ylabel('Holiday Sales (Billions USD)')
axes[0].set_title('U.S. Holiday Sales Over Time')
axes[0].grid(True, alpha=0.3)

axes[1].plot(df['Year'], df['CPI-YOY (Nov )'], marker='o', color='orange')
axes[1].set_xlabel('Year')
axes[1].set_ylabel('November CPI YoY (%)')
axes[1].set_title('November Inflation Rate Over Time')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

---
## Step 2 | Simple OLS (Levels)

First, we run a simple regression of sales on inflation in levels (not differences).

In [None]:
# Prepare data for regression
df_clean = df.dropna(subset=['CPI-YOY (Nov )', 'Sales_Billions'])

# Simple OLS in levels
X = df_clean['CPI-YOY (Nov )']
y = df_clean['Sales_Billions']
X_const = sm.add_constant(X)

model_levels = sm.OLS(y, X_const).fit()
print(model_levels.summary().tables[1])

In [None]:
# Visualization with regression line
plt.figure(figsize=(8, 5))
plt.scatter(X, y, alpha=0.8)

x_grid = np.linspace(X.min(), X.max(), 100)
y_pred = model_levels.predict(sm.add_constant(x_grid))
plt.plot(x_grid, y_pred, color='red', linewidth=2)

plt.xlabel('November CPI Year-over-Year (%)')
plt.ylabel('Holiday Sales (Billions USD)')
plt.title('Holiday Sales vs November Inflation (2004-2024)')
plt.tight_layout()
plt.show()

The levels regression shows a positive but only marginally significant relationship. However, this approach ignores the strong time trends in both variables, which could lead to spurious correlation.

---
## Step 3 | First Differences Model

To address the time trends, we use first differences: we model the *change* in sales as a function of the *change* in inflation.

In [None]:
# Sort by year and compute first differences
df = df.sort_values('Year')
df['dSales'] = df['Sales_Billions'].diff()
df['dInflation'] = df['CPI-YOY (Nov )'].diff()

# Drop missing values from differencing
df_diff = df.dropna(subset=['dSales', 'dInflation'])
print(f"After differencing: {len(df_diff)} observations")
df_diff[['Year', 'dSales', 'dInflation']].head()

In [None]:
# First differences regression
X_diff = df_diff['dInflation']
y_diff = df_diff['dSales']
X_diff_const = sm.add_constant(X_diff)

model_diff = sm.OLS(y_diff, X_diff_const).fit()
print(model_diff.summary().tables[1])

In [None]:
# Visualization of first differences model
plt.figure(figsize=(8, 5))
plt.scatter(df_diff['dInflation'], df_diff['dSales'], alpha=0.8)

x_grid = np.linspace(df_diff['dInflation'].min(), df_diff['dInflation'].max(), 100)
y_pred = model_diff.predict(sm.add_constant(x_grid))
plt.plot(x_grid, y_pred, color='red', linewidth=2)

plt.xlabel('Change in November CPI YoY (%)')
plt.ylabel('Change in Holiday Sales (Billions USD)')
plt.title('Change in Sales vs Change in Inflation')
plt.tight_layout()
plt.show()

In [None]:
# Residual plot
plt.figure(figsize=(8, 5))
plt.scatter(model_diff.fittedvalues, model_diff.resid, alpha=0.8)
plt.axhline(y=0, color='black', linestyle='--', linewidth=1)
plt.xlabel('Fitted Change in Sales (Billions)')
plt.ylabel('Residual')
plt.title('Residual Plot for First Differences Model')
plt.tight_layout()
plt.show()

---
## Step 4 | Results Interpretation

### Key Findings

**Levels Model (problematic):**
- Coefficient: 38.46 (p = 0.059) - only marginally significant
- R-squared: 0.176 - low explanatory power
- Durbin-Watson: 0.247 - strong evidence of autocorrelation (problem!)

**First Differences Model (preferred):**
- Coefficient: 6.07 (p = 0.019) - statistically significant
- Interpretation: A 1 percentage point increase in the *change* of inflation is associated with a $6.07 billion increase in the *change* of holiday sales
- Durbin-Watson: 0.918 - improved (though still some autocorrelation)

### Why First Differences?

1. **Time Trends**: Both sales and inflation have trends over time. Sales generally increase each year. Regressing levels can create spurious correlation.

2. **Detrending**: First differencing removes the trends, leaving us with year-over-year changes that are more stable.

3. **Economic Interpretation**: The first differences model asks: "In years when inflation increased more than usual, did sales also increase more than usual?"

### Caution

The negative constant (-26.13) in the first differences model indicates that on average, sales growth has been declining over time, even controlling for inflation changes.

---
## Replication Exercises

Try extending this analysis with the following exercises:

### Exercise 1: Nominal vs Real
The sales figures are in nominal dollars. Create a "real sales" variable by adjusting for CPI. Does the inflation-sales relationship still exist when using real sales?

### Exercise 2: Lag Analysis
Test whether *last year's* inflation predicts *this year's* sales change. Create a lagged inflation variable and run the regression.

### Exercise 3: Robust Standard Errors
Use heteroskedasticity-robust (HC3) standard errors. How do the results change?

### Challenge Exercise
The Durbin-Watson statistic suggests autocorrelation in residuals. Research what autocorrelation means and why it's a problem. Then try adding a lagged dependent variable (last year's sales change) to the model. Does this improve the residual diagnostics?

In [None]:
# Your code for Exercise 1: Nominal vs Real


In [None]:
# Your code for Exercise 2: Lag Analysis


In [None]:
# Your code for Exercise 3: Robust Standard Errors


In [None]:
# Your code for Challenge Exercise
