# Netflix Incrementality Panel Construction

This notebook implements the continuous-time, event-sampled panel methodology from the Netflix incrementality paper.

## Key Differences from Standard Panel Approach

1. **Sampling Strategy**: Not a balanced panel, but strategic sampling of conversion and non-conversion moments
2. **Continuous-Time Features**: Ad stock calculated dynamically at each sampled timestamp
3. **Weighting Scheme**: Sophisticated weights to correct for sampling bias
4. **Unit of Analysis**: (user, vendor, timestamp) not (user, vendor, day)

## References

- Netflix Paper: "Incrementality Bidding & Attribution"
- Model: yᵢᵥ(t) = αᵢ + δₜ + γᵥ + Σₖ βₖxᵢᵥₖ(t) + Wᵢᵥ(t)'θ + εᵢᵥ(t)

In [16]:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
from tqdm import tqdm
import warnings
warnings.filterwarnings('ignore')

print('Libraries loaded successfully')

Libraries loaded successfully


## 1. Load Raw Event Data

In [None]:
# Load all event streams
impressions = pd.read_parquet('data/raw_sample_impressions.parquet')
impressions['timestamp'] = pd.to_datetime(impressions['OCCURRED_AT'])
impressions['user_id'] = impressions['USER_ID']
impressions['vendor_id'] = impressions['VENDOR_ID']
print(f'Loaded {len(impressions):,} impressions')

clicks = pd.read_parquet('data/raw_sample_clicks.parquet')
clicks['timestamp'] = pd.to_datetime(clicks['OCCURRED_AT'])
clicks['user_id'] = clicks['USER_ID']
clicks['vendor_id'] = clicks['VENDOR_ID']
print(f'Loaded {len(clicks):,} clicks')

purchases = pd.read_parquet('data/raw_sample_purchases.parquet')
purchases['timestamp'] = pd.to_datetime(purchases['PURCHASED_AT'])
purchases['user_id'] = purchases['USER_ID']
# GMV calculation: prices are in CENTS, convert to dollars
purchases['gmv'] = (purchases['QUANTITY'] * purchases['UNIT_PRICE']) / 100
print(f'Loaded {len(purchases):,} purchases')

auctions = pd.read_parquet('data/raw_sample_auctions_users.parquet')
auctions['timestamp'] = pd.to_datetime(auctions['CREATED_AT'])
auctions['user_id'] = auctions['OPAQUE_USER_ID']
print(f'Loaded {len(auctions):,} auctions')

## 2. Identify Conversions with Vendor Attribution

Match purchases to vendors via product-level impressions/clicks

In [18]:
# Get impression/click data at product level for vendor attribution
imp_with_product = impressions[['user_id', 'vendor_id', 'PRODUCT_ID', 'timestamp']].copy()
click_with_product = clicks[['user_id', 'vendor_id', 'PRODUCT_ID', 'timestamp']].copy()

# Create purchase-product mapping
purchase_products = purchases[['user_id', 'PRODUCT_ID', 'timestamp', 'gmv']].copy()

# Match clicks to purchases (same day attribution)
click_with_product['date'] = click_with_product['timestamp'].dt.date
purchase_products['date'] = purchase_products['timestamp'].dt.date

click_conversions = pd.merge(
    click_with_product,
    purchase_products[['user_id', 'PRODUCT_ID', 'date', 'gmv', 'timestamp']],
    on=['user_id', 'PRODUCT_ID', 'date'],
    how='inner',
    suffixes=('_click', '_purchase')
)

# Match impressions to purchases (same day attribution)
imp_with_product['date'] = imp_with_product['timestamp'].dt.date

imp_conversions = pd.merge(
    imp_with_product,
    purchase_products[['user_id', 'PRODUCT_ID', 'date', 'gmv', 'timestamp']],
    on=['user_id', 'PRODUCT_ID', 'date'],
    how='inner',
    suffixes=('_imp', '_purchase')
)

# Combine and deduplicate (prefer click attribution)
all_conversions = pd.concat([
    click_conversions[['user_id', 'vendor_id', 'timestamp_purchase', 'gmv']],
    imp_conversions[['user_id', 'vendor_id', 'timestamp_purchase', 'gmv']]
])

# Keep unique user-vendor-timestamp conversions
all_conversions = all_conversions.drop_duplicates(
    subset=['user_id', 'vendor_id', 'timestamp_purchase'], 
    keep='first'
)

# Aggregate to user-vendor-timestamp level
conversions = all_conversions.groupby(
    ['user_id', 'vendor_id', 'timestamp_purchase']
).agg({'gmv': 'sum'}).reset_index()

conversions = conversions.rename(columns={'timestamp_purchase': 'timestamp'})

print(f'\nIdentified {len(conversions)} user-vendor-timestamp conversions')
print(f'Total GMV: ${conversions.gmv.sum():,.2f}')
print(f'Unique users: {conversions.user_id.nunique()}')
print(f'Unique vendors: {conversions.vendor_id.nunique()}')


Identified 157 user-vendor-timestamp conversions
Total GMV: $505,900.00
Unique users: 111
Unique vendors: 154


## 3. Create Positive Samples (+)

One row for each conversion event at the exact timestamp

In [19]:
# Create positive samples
positives = conversions.copy()
positives['outcome'] = 1
positives['sample_type'] = 'positive'
positives['sample_weight'] = 1.0

print(f'Created {len(positives)} positive samples')
print(f'Mean GMV: ${positives.gmv.mean():.2f}')
print(positives.head())

Created 157 positive samples
Mean GMV: $3222.29
                                     user_id  \
0  ext1:02910a27-2d21-4945-8b34-31a2bdba9ce7   
1  ext1:047ec1e3-142c-47d2-b492-9b3047c26591   
2  ext1:05ffe8c2-b009-4cac-b0d4-77688bb01774   
3  ext1:064d91e2-c3b0-47e9-8155-cecf17f5e84e   
4  ext1:08694c1c-3a35-47e1-8f36-25d7171e2403   

                          vendor_id           timestamp   gmv  outcome  \
0  064bd96336977540a524f04181b7c74b 2025-06-07 02:48:08  1500        1   
1  01951ac662f57100ba9c7b69a206b10d 2025-07-16 15:17:47   500        1   
2  0193c5e7b7487c2297ae709c7bac36e6 2025-09-18 09:20:17  6000        1   
3  06508dd4a57971b0b2242e04d0c73641 2025-04-19 10:51:19  2800        1   
4  01902b58f7da79b79f43cd00b4bd3051 2025-07-02 13:47:50  1500        1   

  sample_type  sample_weight  
0    positive            1.0  
1    positive            1.0  
2    positive            1.0  
3    positive            1.0  
4    positive            1.0  


## 4. Create Negative Samples (-)

Random sample of (user, vendor, random_timestamp) with no conversion

In [20]:
# Get observation window
min_time = impressions['timestamp'].min()
max_time = impressions['timestamp'].max()
time_range_seconds = (max_time - min_time).total_seconds()

# Get unique user-vendor pairs that had any activity
active_pairs = pd.concat([
    impressions[['user_id', 'vendor_id']],
    clicks[['user_id', 'vendor_id']]
]).drop_duplicates()

print(f'Found {len(active_pairs):,} active user-vendor pairs')
print(f'Time window: {min_time} to {max_time}')

# Sample negatives
# Rule of thumb: 10-20x more negatives than positives for stable estimation
n_negatives = len(positives) * 15
print(f'\nSampling {n_negatives} negative observations...')

# Randomly sample user-vendor pairs with replacement
negative_pairs = active_pairs.sample(n=n_negatives, replace=True, random_state=42)

# Generate random timestamps for each
random_seconds = np.random.uniform(0, time_range_seconds, size=n_negatives)
negative_pairs['timestamp'] = min_time + pd.to_timedelta(random_seconds, unit='s')

# Mark as negatives
negative_pairs['outcome'] = 0
negative_pairs['gmv'] = 0.0
negative_pairs['sample_type'] = 'negative'

# Calculate negative weight
# w⁻ = (Total user-time space) / (# negative samples)
# Total user-time = # active user-vendor pairs × # days
n_days = (max_time - min_time).days

total_user_time = len(active_pairs) * n_days
negative_weight = total_user_time / n_negatives

negative_pairs['sample_weight'] = negative_weight

negatives = negative_pairs

print(f'Created {len(negatives)} negative samples')
print(f'Negative weight: {negative_weight:.2f}')
print(negatives.head())

Found 599,032 active user-vendor pairs
Time window: 2025-03-25 00:01:05 to 2025-09-20 23:59:56

Sampling 2355 negative observations...
Created 2355 negative samples
Negative weight: 45531.52
                                          user_id  \
149628  ext1:d0fcb300-57cf-4b03-91b8-4c9fb7f0f69f   
163660  ext1:910d10a3-aa04-49cc-bc70-89b0c91fe1f2   
568162  ext1:0ee7c369-683f-4aed-a0a4-08877570b015   
366479  ext1:4075d35d-a94e-40f6-bf0c-c955f8151a71   
133523  ext1:d380462b-180f-434d-91d7-d4571b0358a0   

                               vendor_id                     timestamp  \
149628  018f4e7e4cff72738f36359e65c2259e 2025-09-14 16:18:14.972595714   
163660  0190b91e03f47451af9a03dfe7f36a66 2025-07-01 15:21:30.206157375   
568162  01964ed8d81e75128a8f70c39a4b6a36 2025-07-09 14:51:40.247825669   
366479  018e9bcec3e37aa8ad15a89b912fb485 2025-07-14 03:04:45.539634172   
133523  01985c650bb57ed2b5baebcfc4f87e02 2025-07-11 18:51:20.900539825   

        outcome  gmv sample_type  sample_weig

## 5. Create Double-Negative Samples (+0)

Duplicate of each positive with outcome=0 and negative weight

In [21]:
# Create double-negatives
double_negatives = positives.copy()
double_negatives['outcome'] = 0
double_negatives['sample_type'] = 'double_negative'
double_negatives['sample_weight'] = -1.0

print(f'Created {len(double_negatives)} double-negative samples')
print(double_negatives.head())

Created 157 double-negative samples
                                     user_id  \
0  ext1:02910a27-2d21-4945-8b34-31a2bdba9ce7   
1  ext1:047ec1e3-142c-47d2-b492-9b3047c26591   
2  ext1:05ffe8c2-b009-4cac-b0d4-77688bb01774   
3  ext1:064d91e2-c3b0-47e9-8155-cecf17f5e84e   
4  ext1:08694c1c-3a35-47e1-8f36-25d7171e2403   

                          vendor_id           timestamp   gmv  outcome  \
0  064bd96336977540a524f04181b7c74b 2025-06-07 02:48:08  1500        0   
1  01951ac662f57100ba9c7b69a206b10d 2025-07-16 15:17:47   500        0   
2  0193c5e7b7487c2297ae709c7bac36e6 2025-09-18 09:20:17  6000        0   
3  06508dd4a57971b0b2242e04d0c73641 2025-04-19 10:51:19  2800        0   
4  01902b58f7da79b79f43cd00b4bd3051 2025-07-02 13:47:50  1500        0   

       sample_type  sample_weight  
0  double_negative           -1.0  
1  double_negative           -1.0  
2  double_negative           -1.0  
3  double_negative           -1.0  
4  double_negative           -1.0  


## 6. Combine All Samples

Stack positives, negatives, and double-negatives

In [22]:
# Combine all samples
panel = pd.concat([positives, negatives, double_negatives], ignore_index=True)

# Sort by user, vendor, timestamp
panel = panel.sort_values(['user_id', 'vendor_id', 'timestamp']).reset_index(drop=True)

print(f'\nCombined panel size: {len(panel):,} rows')
print(f'\nSample type breakdown:')
print(panel['sample_type'].value_counts())
print(f'\nOutcome distribution:')
print(panel['outcome'].value_counts())
print(f'\nWeighted outcome mean: {(panel["outcome"] * panel["sample_weight"]).sum() / panel["sample_weight"].sum():.6f}')


Combined panel size: 2,669 rows

Sample type breakdown:
sample_type
negative           2355
positive            157
double_negative     157
Name: count, dtype: int64

Outcome distribution:
outcome
0    2512
1     157
Name: count, dtype: int64

Weighted outcome mean: 0.000001


## 7. Calculate Continuous-Time Ad Stock Features

For each row at timestamp t, calculate vendor-specific ad stock from all prior impressions/clicks

In [23]:
print('Calculating continuous-time ad stock features...')
print('Using optimized vectorized approach for performance.\n')

# Define decay half-lives (in hours)
decay_specs = [
    ('1hr', 1),
    ('3hr', 3),
    ('12hr', 12),
    ('1day', 24),
    ('3day', 72)
]

def calculate_adstock_optimized(panel_df, exposure_df, decay_halflife_hours, exposure_type='impression'):
    """
    Optimized vectorized calculation of ad stock
    Groups by user-vendor to process efficiently
    """
    decay_rate = np.log(2) / decay_halflife_hours
    results = []
    
    # Group both panel and exposures by user-vendor
    for (user_id, vendor_id), panel_group in tqdm(
        panel_df.groupby(['user_id', 'vendor_id']), 
        desc=f'{exposure_type} {decay_halflife_hours}hr'
    ):
        # Get exposures for this user-vendor pair
        user_vendor_exp = exposure_df[
            (exposure_df['user_id'] == user_id) & 
            (exposure_df['vendor_id'] == vendor_id)
        ].sort_values('timestamp')
        
        if len(user_vendor_exp) == 0:
            # No exposures - all zeros
            results.extend([0.0] * len(panel_group))
            continue
        
        # For each panel timestamp, calculate ad stock
        exposure_times = user_vendor_exp['timestamp'].values
        
        adstocks = []
        for panel_time in panel_group['timestamp'].values:
            # Find exposures before this time
            time_diffs_seconds = (panel_time - exposure_times).astype('timedelta64[s]').astype(float)
            
            # Only keep positive diffs (past exposures) and convert to hours
            prior_mask = time_diffs_seconds > 0
            if not np.any(prior_mask):
                adstocks.append(0.0)
                continue
                
            time_diffs_hours = time_diffs_seconds[prior_mask] / 3600
            
            # Calculate exponential decay weights
            weights = np.exp(-decay_rate * time_diffs_hours)
            adstocks.append(weights.sum())
        
        results.extend(adstocks)
    
    return pd.Series(results, index=panel_df.index)

# Calculate impression ad stock for all decay rates
for name, halflife in decay_specs:
    panel[f'adstock_imp_{name}'] = calculate_adstock_optimized(
        panel, impressions, halflife, 'impression'
    )

# Calculate click ad stock for all decay rates  
for name, halflife in decay_specs:
    panel[f'adstock_click_{name}'] = calculate_adstock_optimized(
        panel, clicks, halflife, 'click'
    )

print('\nAd stock features calculated')
print('\nSample ad stock statistics:')
print(panel[['adstock_imp_1hr', 'adstock_imp_1day', 'adstock_click_1hr']].describe())

Calculating continuous-time ad stock features...
Using optimized vectorized approach for performance.



impression 1hr: 100%|██████████| 2503/2503 [02:18<00:00, 18.09it/s]
impression 3hr: 100%|██████████| 2503/2503 [02:19<00:00, 17.94it/s]
impression 12hr: 100%|██████████| 2503/2503 [02:18<00:00, 18.10it/s]
impression 24hr: 100%|██████████| 2503/2503 [02:19<00:00, 18.00it/s]
impression 72hr: 100%|██████████| 2503/2503 [02:20<00:00, 17.87it/s]
click 1hr: 100%|██████████| 2503/2503 [00:04<00:00, 533.19it/s]
click 3hr: 100%|██████████| 2503/2503 [00:04<00:00, 536.05it/s]
click 12hr: 100%|██████████| 2503/2503 [00:04<00:00, 534.58it/s]
click 24hr: 100%|██████████| 2503/2503 [00:04<00:00, 526.11it/s]
click 72hr: 100%|██████████| 2503/2503 [00:04<00:00, 537.52it/s]



Ad stock features calculated

Sample ad stock statistics:
       adstock_imp_1hr  adstock_imp_1day  adstock_click_1hr
count     2.669000e+03      2.669000e+03        2669.000000
mean      1.454700e-01      2.978965e-01           0.111182
std       7.066548e-01      1.520682e+00           0.427172
min       0.000000e+00      0.000000e+00           0.000000
25%       0.000000e+00      0.000000e+00           0.000000
50%       0.000000e+00      7.612156e-35           0.000000
75%      6.170078e-188      1.776555e-08           0.000000
max       1.504522e+01      3.010593e+01           4.762028


## 8. Calculate Event Stock (Retargeting Signals)

Similar to ad stock, but for user actions like auctions/clicks

In [24]:
print('Calculating event stock features...')

def calculate_user_event_stock(panel_df, event_df, halflife_hours):
    """
    Calculate event stock at user level (not vendor-specific)
    Optimized version
    """
    decay_rate = np.log(2) / halflife_hours
    results = []
    
    for user_id, panel_group in tqdm(panel_df.groupby('user_id'), desc=f'Event stock {halflife_hours}hr'):
        # Get events for this user
        user_events = event_df[event_df['user_id'] == user_id].sort_values('timestamp')
        
        if len(user_events) == 0:
            results.extend([0.0] * len(panel_group))
            continue
        
        event_times = user_events['timestamp'].values
        
        stocks = []
        for panel_time in panel_group['timestamp'].values:
            time_diffs_seconds = (panel_time - event_times).astype('timedelta64[s]').astype(float)
            
            prior_mask = time_diffs_seconds > 0
            if not np.any(prior_mask):
                stocks.append(0.0)
                continue
            
            time_diffs_hours = time_diffs_seconds[prior_mask] / 3600
            weights = np.exp(-decay_rate * time_diffs_hours)
            stocks.append(weights.sum())
        
        results.extend(stocks)
    
    return pd.Series(results, index=panel_df.index)

# Calculate auction stock
panel['auction_stock_6hr'] = calculate_user_event_stock(panel, auctions, 6)

# Click stock is already calculated in previous cell as adstock_click_1hr
panel['click_stock_1hr'] = panel['adstock_click_1hr']

print('Event stock features calculated')
print(panel[['auction_stock_6hr', 'click_stock_1hr']].describe())

Calculating event stock features...


Event stock 6hr: 100%|██████████| 816/816 [00:05<00:00, 138.35it/s]

Event stock features calculated
       auction_stock_6hr  click_stock_1hr
count       2.669000e+03      2669.000000
mean        3.943215e+00         0.111182
std         1.023454e+01         0.427172
min         0.000000e+00         0.000000
25%         6.431615e-13         0.000000
50%         4.112866e-02         0.000000
75%         2.558479e+00         0.000000
max         1.047098e+02         4.762028





## 9. Add Time-of-Day Controls (Fourier Series)

Capture smooth diurnal and weekly patterns

In [25]:
# Extract time components
panel['hour'] = panel['timestamp'].dt.hour
panel['dayofweek'] = panel['timestamp'].dt.dayofweek

# Fourier series for hour of day (24-hour cycle)
panel['sin_hour'] = np.sin(2 * np.pi * panel['hour'] / 24)
panel['cos_hour'] = np.cos(2 * np.pi * panel['hour'] / 24)

# Fourier series for day of week (7-day cycle)
panel['sin_dow'] = np.sin(2 * np.pi * panel['dayofweek'] / 7)
panel['cos_dow'] = np.cos(2 * np.pi * panel['dayofweek'] / 7)

# Simple binary controls
panel['is_weekend'] = (panel['dayofweek'] >= 5).astype(int)
panel['is_evening'] = ((panel['hour'] >= 18) | (panel['hour'] <= 6)).astype(int)

print('Time controls added')
print(panel[['sin_hour', 'cos_hour', 'sin_dow', 'cos_dow']].describe())

Time controls added
          sin_hour      cos_hour      sin_dow      cos_dow
count  2669.000000  2.669000e+03  2669.000000  2669.000000
mean     -0.021592  3.263586e-02     0.019455    -0.018009
std       0.701033  7.123174e-01     0.712164     0.701779
min      -1.000000 -1.000000e+00    -0.974928    -0.900969
25%      -0.707107 -7.071068e-01    -0.781831    -0.900969
50%       0.000000  6.123234e-17     0.000000    -0.222521
75%       0.707107  7.071068e-01     0.781831     0.623490
max       1.000000  1.000000e+00     0.974928     1.000000


## 10. Add Fixed Effects Identifiers

In [26]:
# Create week identifier for time fixed effects
panel['week_id'] = panel['timestamp'].dt.to_period('W').astype(str)

# Add diminishing returns features (squared terms for non-linear effects)
panel['adstock_imp_1day_sq'] = panel['adstock_imp_1day'] ** 2
panel['adstock_click_1hr_sq'] = panel['adstock_click_1hr'] ** 2

# User and vendor IDs are already present

print('Fixed effect identifiers added')
print(f'  Unique users: {panel["user_id"].nunique()}')
print(f'  Unique vendors: {panel["vendor_id"].nunique()}')
print(f'  Unique weeks: {panel["week_id"].nunique()}')
print('\nDiminishing returns features added')
print(f'  adstock_imp_1day_sq (mean): {panel["adstock_imp_1day_sq"].mean():.6f}')
print(f'  adstock_click_1hr_sq (mean): {panel["adstock_click_1hr_sq"].mean():.6f}')

Fixed effect identifiers added
  Unique users: 816
  Unique vendors: 2406
  Unique weeks: 26

Diminishing returns features added
  adstock_imp_1day_sq (mean): 2.400349
  adstock_click_1hr_sq (mean): 0.194769


## 11. Add Rolling Window Exposures

Count of impressions/clicks in recent windows

In [27]:
print('Calculating rolling window features...')

def count_recent_exposures_optimized(panel_df, exposure_df, window_days):
    """
    Count exposures in recent window - optimized version
    """
    window_seconds = window_days * 24 * 3600
    results = []
    
    for (user_id, vendor_id), panel_group in tqdm(
        panel_df.groupby(['user_id', 'vendor_id']),
        desc=f'Rolling {window_days}d'
    ):
        # Get exposures for this user-vendor
        user_vendor_exp = exposure_df[
            (exposure_df['user_id'] == user_id) &
            (exposure_df['vendor_id'] == vendor_id)
        ].sort_values('timestamp')
        
        if len(user_vendor_exp) == 0:
            results.extend([0] * len(panel_group))
            continue
        
        exposure_times = user_vendor_exp['timestamp'].values
        
        counts = []
        for panel_time in panel_group['timestamp'].values:
            # Count exposures in window
            time_diffs_seconds = (panel_time - exposure_times).astype('timedelta64[s]').astype(float)
            
            # Count exposures within window (past exposures only)
            in_window = (time_diffs_seconds > 0) & (time_diffs_seconds <= window_seconds)
            counts.append(np.sum(in_window))
        
        results.extend(counts)
    
    return pd.Series(results, index=panel_df.index)

# Calculate rolling windows
panel['impressions_7d'] = count_recent_exposures_optimized(panel, impressions, 7)
panel['impressions_14d'] = count_recent_exposures_optimized(panel, impressions, 14)
panel['clicks_7d'] = count_recent_exposures_optimized(panel, clicks, 7)

print('Rolling window features calculated')
print(panel[['impressions_7d', 'impressions_14d', 'clicks_7d']].describe())

Calculating rolling window features...


Rolling 7d: 100%|██████████| 2503/2503 [02:16<00:00, 18.38it/s]
Rolling 14d: 100%|██████████| 2503/2503 [02:17<00:00, 18.18it/s]
Rolling 7d: 100%|██████████| 2503/2503 [00:04<00:00, 536.07it/s]

Rolling window features calculated
       impressions_7d  impressions_14d    clicks_7d
count     2669.000000      2669.000000  2669.000000
mean         0.410641         0.507306     0.198202
std          1.924351         2.316998     0.653141
min          0.000000         0.000000     0.000000
25%          0.000000         0.000000     0.000000
50%          0.000000         0.000000     0.000000
75%          0.000000         0.000000     0.000000
max         32.000000        45.000000     5.000000





## 12. Final Panel Summary

In [28]:
print('='*80)
print('NETFLIX PANEL SUMMARY')
print('='*80)

print(f'\nPanel Structure:')
print(f'  Total observations: {len(panel):,}')
print(f'  Positives: {(panel["sample_type"] == "positive").sum():,}')
print(f'  Negatives: {(panel["sample_type"] == "negative").sum():,}')
print(f'  Double-negatives: {(panel["sample_type"] == "double_negative").sum():,}')

print(f'\nFixed Effects:')
print(f'  Users: {panel["user_id"].nunique():,}')
print(f'  Vendors: {panel["vendor_id"].nunique():,}')
print(f'  Weeks: {panel["week_id"].nunique():,}')

print(f'\nOutcome Variable:')
print(f'  Conversions: {panel["outcome"].sum():,}')
print(f'  Total GMV (positive samples only): ${panel[panel.sample_type == "positive"]["gmv"].sum():,.2f}')
print(f'  Weighted conversion rate: {(panel["outcome"] * panel["sample_weight"]).sum() / panel["sample_weight"].sum():.6f}')

print(f'\nFeature Summary:')
ad_stock_cols = [col for col in panel.columns if col.startswith('adstock_')]
print(f'  Ad stock features: {len(ad_stock_cols)}')
print(f'  Event stock features: 2')
print(f'  Time controls: 6')
print(f'  Rolling window features: 3')
print(f'  Diminishing returns features: 2')

print(f'\nKey Statistics:')
print(panel[['adstock_imp_1day', 'adstock_click_1day', 'auction_stock_6hr', 'gmv']].describe())

print(f'\nSample of panel data:')
display_cols = ['user_id', 'vendor_id', 'timestamp', 'outcome', 'sample_type', 
                'adstock_imp_1day', 'adstock_click_1hr', 'sample_weight', 'gmv']
print(panel[display_cols].head(10))

NETFLIX PANEL SUMMARY

Panel Structure:
  Total observations: 2,669
  Positives: 157
  Negatives: 2,355
  Double-negatives: 157

Fixed Effects:
  Users: 816
  Vendors: 2,406
  Weeks: 26

Outcome Variable:
  Conversions: 157
  Total GMV (positive samples only): $505,900.00
  Weighted conversion rate: 0.000001

Feature Summary:
  Ad stock features: 12
  Event stock features: 2
  Time controls: 6
  Rolling window features: 3
  Diminishing returns features: 2

Key Statistics:
       adstock_imp_1day  adstock_click_1day  auction_stock_6hr           gmv
count      2.669000e+03         2669.000000       2.669000e+03   2669.000000
mean       2.978965e-01            0.177854       3.943215e+00    379.093293
std        1.520682e+00            0.584582       1.023454e+01   1566.977128
min        0.000000e+00            0.000000       0.000000e+00      0.000000
25%        0.000000e+00            0.000000       6.431615e-13      0.000000
50%        7.612156e-35            0.000000       4.112866e-0

## 13. Save Panel Data

In [29]:
# Save to parquet
output_path = 'data/netflix_panel.parquet'
panel.to_parquet(output_path, index=False)

import os
file_size_mb = os.path.getsize(output_path) / (1024**2)

print(f'\nPanel saved to {output_path}')
print(f'File size: {file_size_mb:.2f} MB')
print(f'\nThis panel is ready for three-way fixed effects estimation using pyfixest.')


Panel saved to data/netflix_panel.parquet
File size: 0.27 MB

This panel is ready for three-way fixed effects estimation using pyfixest.
