# 1: Data Generation for RDD Analysis

## Objective

Generate synthetic e-commerce shopping session data to learn Regression Discontinuity Design fundamentals.

## Setup

**Running Variable:** Cart value (€0-200)  
**Cutoff:** €50 for free shipping eligibility  
**Treatment:** Free shipping (€0) vs standard shipping (€5.95)  
**Outcome:** Purchase completion (binary)  
**True Treatment Effect:** 8 percentage points  

In [1]:
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

import sys
sys.path.append('../src')
from generate_data import generate_rdd_ecommerce_data

## Generate Data

In [2]:
# Generate synthetic shopping session data
df = generate_rdd_ecommerce_data(
    n_sessions=10000,
    cutoff=50.0,
    shipping_cost=5.95,
    treatment_effect=0.08,  # 8 percentage point increase
    random_seed=42
)

print(f"Dataset shape: {df.shape}")
df.head(10)

Dataset shape: (10000, 11)


Unnamed: 0,session_id,customer_age,account_tenure_days,previous_purchases,product_category,items_in_cart,cart_value,treatment,completed_purchase,Y0,Y1
0,1,25-34,38,5,Books & Media,5,38.16,0,1,1,0
1,2,55+,59,3,Fashion,5,27.9,0,0,0,0
2,3,35-44,355,0,Fashion,4,105.3,1,1,0,1
3,4,35-44,157,9,Electronics,4,16.95,0,0,0,0
4,5,25-34,287,1,Sports & Outdoors,4,74.3,1,1,0,1
5,6,25-34,289,0,Fashion,6,11.01,0,0,0,1
6,7,18-24,140,0,Books & Media,4,18.53,0,1,1,0
7,8,45-54,246,2,Electronics,2,75.0,1,1,1,1
8,9,35-44,133,2,Electronics,4,51.03,1,1,1,1
9,10,35-44,114,3,Home & Garden,2,80.75,1,1,1,1


## Quick Summary Statistics

In [3]:
print("Treatment distribution:")
print(df['treatment'].value_counts())
print("\nPurchase completion by treatment:")
print(df.groupby('treatment')['completed_purchase'].agg(['mean', 'sum', 'count']))
print("\nCart value distribution:")
print(df['cart_value'].describe())
print("\nProduct category distribution:")
print(df['product_category'].value_counts())

Treatment distribution:
treatment
1    5655
0    4345
Name: count, dtype: int64

Purchase completion by treatment:
               mean   sum  count
treatment                       
0          0.441887  1920   4345
1          0.577896  3268   5655

Cart value distribution:
count    10000.000000
mean        58.222083
std         27.014896
min          5.000000
25%         39.105000
50%         54.170000
75%         72.780000
max        200.000000
Name: cart_value, dtype: float64

Product category distribution:
product_category
Fashion              2953
Electronics          2478
Home & Garden        2045
Books & Media        1542
Sports & Outdoors     982
Name: count, dtype: int64


## Check Density Around Threshold

In [4]:
# How many sessions are near the €50 threshold?
window = df[(df['cart_value'] >= 40) & (df['cart_value'] <= 60)]
print(f"Sessions with cart value €40-60: {len(window)} ({len(window)/len(df)*100:.1f}%)")
print(f"\nTreatment distribution in this window:")
print(window['treatment'].value_counts())

Sessions with cart value €40-60: 3231 (32.3%)

Treatment distribution in this window:
treatment
0    1704
1    1527
Name: count, dtype: int64


## Save Data

In [5]:
# Save to CSV
df.to_csv('../data/rdd_ecommerce.csv', index=False)
print("Data saved to ../data/rdd_ecommerce.csv")

Data saved to ../data/rdd_ecommerce.csv
