# Comprehensive Budget Pacing Analysis

## Objective
Analyze the relationship between budget PACING, FINAL_BID, IS_WINNER, and RANKING in marketplace's auction system.

## Research Questions
1. How does pacing affect bid amounts and win probability?
2. Can we distinguish throttling from discount pacing mechanisms?
3. What are the characteristics of campaigns at different pacing levels?
4. What are the implications for incrementality analysis?

## Data Period
September 2-8, 2025 (14 days), 0.1% user sample

## Note on Units
FINAL_BID and PRICE are in CENTS, not dollars.

In [7]:
import pandas as pd
import numpy as np
from tqdm import tqdm
import warnings
warnings.filterwarnings('ignore')

print("Loading data with progress tracking...")
print("="*80)

df_auctions_results = pd.read_parquet('data/raw_auctions_results_20251011.parquet')
print(f"✓ Loaded AUCTIONS_RESULTS: {len(df_auctions_results):,} rows")

df_auctions_users = pd.read_parquet('data/raw_auctions_users_20251011.parquet')
print(f"✓ Loaded AUCTIONS_USERS: {len(df_auctions_users):,} rows")

df_impressions = pd.read_parquet('data/raw_impressions_20251011.parquet')
print(f"✓ Loaded IMPRESSIONS: {len(df_impressions):,} rows")

df_clicks = pd.read_parquet('data/raw_clicks_20251011.parquet')
print(f"✓ Loaded CLICKS: {len(df_clicks):,} rows")

df_catalog = pd.read_parquet('data/catalog_20251011.parquet')
print(f"✓ Loaded CATALOG: {len(df_catalog):,} rows")

print("\nMerging auction data with timestamps...")
# Drop CREATED_AT from auctions_results to avoid duplicates
df_auctions_results_clean = df_auctions_results.drop(columns=['CREATED_AT'])
df = pd.merge(df_auctions_results_clean, df_auctions_users[['AUCTION_ID', 'CREATED_AT', 'PLACEMENT', 'OPAQUE_USER_ID']], 
              on='AUCTION_ID', how='left')
print(f"✓ Merged dataset: {len(df):,} rows")

print("\nCreating derived features...")
df['datetime'] = pd.to_datetime(df['CREATED_AT'])
df['hour'] = df['datetime'].dt.hour
df['day_of_week'] = df['datetime'].dt.dayofweek
df['date'] = df['datetime'].dt.date
df['FINAL_BID_DOLLARS'] = df['FINAL_BID'] / 100
df['PRICE_DOLLARS'] = df['PRICE'] / 100

print("\nData shape:", df.shape)
print("\nColumn dtypes:")
print(df.dtypes)
print("\n" + "="*80)

Loading data with progress tracking...
✓ Loaded AUCTIONS_RESULTS: 18,838,670 rows
✓ Loaded AUCTIONS_USERS: 413,457 rows
✓ Loaded IMPRESSIONS: 533,146 rows
✓ Loaded CLICKS: 16,706 rows
✓ Loaded CATALOG: 2,007,695 rows

Merging auction data with timestamps...
✓ Merged dataset: 18,840,598 rows

Creating derived features...

Data shape: (18840598, 20)

Column dtypes:
AUCTION_ID                   object
VENDOR_ID                    object
CAMPAIGN_ID                  object
PRODUCT_ID                   object
RANKING                       int64
IS_WINNER                      bool
QUALITY                     float64
FINAL_BID                     int64
PRICE                       float64
CONVERSION_RATE             float64
PACING                      float64
CREATED_AT           datetime64[ns]
PLACEMENT                    object
OPAQUE_USER_ID               object
datetime             datetime64[ns]
hour                          int32
day_of_week                   int32
date                  

## Section 1: Descriptive Statistics

In [8]:
print("="*80)
print("DESCRIPTIVE STATISTICS")
print("="*80)

print("\n1. PACING DISTRIBUTION")
print("-" * 80)
print(f"Total bids: {len(df):,}")
print(f"Mean pacing: {df['PACING'].mean():.4f}")
print(f"Median pacing: {df['PACING'].median():.4f}")
print(f"Std pacing: {df['PACING'].std():.4f}")
print(f"Min pacing: {df['PACING'].min():.6f}")
print(f"Max pacing: {df['PACING'].max():.4f}")

print("\nPacing Quantiles:")
for q in [0.01, 0.05, 0.10, 0.25, 0.50, 0.75, 0.90, 0.95, 0.99]:
    print(f"  {q*100:5.1f}%: {df['PACING'].quantile(q):.6f}")

print("\nPacing Categories:")
print(f"  High (0.9-1.0):   {(df['PACING'] >= 0.9).sum():,} ({(df['PACING'] >= 0.9).mean()*100:.2f}%)")
print(f"  Medium (0.5-0.9): {((df['PACING'] >= 0.5) & (df['PACING'] < 0.9)).sum():,} ({((df['PACING'] >= 0.5) & (df['PACING'] < 0.9)).mean()*100:.2f}%)")
print(f"  Low (<0.5):       {(df['PACING'] < 0.5).sum():,} ({(df['PACING'] < 0.5).mean()*100:.2f}%)")

print("\n2. BID DISTRIBUTION")
print("-" * 80)
print(f"Mean bid (cents): {df['FINAL_BID'].mean():.2f}")
print(f"Mean bid (dollars): ${df['FINAL_BID_DOLLARS'].mean():.4f}")
print(f"Median bid (cents): {df['FINAL_BID'].median():.2f}")
print(f"Median bid (dollars): ${df['FINAL_BID_DOLLARS'].median():.4f}")
print(f"Std bid (cents): {df['FINAL_BID'].std():.2f}")
print(f"Min bid (cents): {df['FINAL_BID'].min():.2f}")
print(f"Max bid (cents): {df['FINAL_BID'].max():.2f}")

print("\nBid Quantiles (dollars):")
for q in [0.10, 0.25, 0.50, 0.75, 0.90, 0.95, 0.99]:
    print(f"  {q*100:5.1f}%: ${df['FINAL_BID_DOLLARS'].quantile(q):.4f}")

print("\n3. WIN RATES BY PACING LEVEL")
print("-" * 80)
df['pacing_cat'] = pd.cut(df['PACING'], bins=[0, 0.5, 0.9, 1.0], labels=['Low', 'Med', 'High'])
for cat in ['High', 'Med', 'Low']:
    cat_data = df[df['pacing_cat'] == cat]
    print(f"{cat} pacing:")
    print(f"  N bids: {len(cat_data):,}")
    print(f"  Win rate: {cat_data['IS_WINNER'].mean()*100:.2f}%")
    print(f"  Mean bid: ${cat_data['FINAL_BID_DOLLARS'].mean():.4f}")
    print(f"  Mean rank: {cat_data['RANKING'].mean():.2f}")

print("\n4. PACING × BID INTERACTION WIN RATES")
print("-" * 80)
df['bid_tercile'] = pd.qcut(df['FINAL_BID'], q=3, labels=['Low', 'Med', 'High'], duplicates='drop')
# Custom pacing bins to handle concentration at 1.0
df['pacing_tercile'] = pd.cut(df['PACING'], bins=[0, 0.5, 0.9, 1.0], labels=['Low', 'Med', 'High'])

cross_tab = pd.crosstab([df['bid_tercile']], [df['pacing_tercile']], 
                        values=df['IS_WINNER'], aggfunc='mean')
print("\nWin Rate Matrix (Bid × Pacing):")
print(cross_tab.round(4))

print("\n5. CORRELATIONS")
print("-" * 80)
print(f"FINAL_BID vs PACING: {df['FINAL_BID'].corr(df['PACING']):.4f}")
print(f"IS_WINNER vs PACING: {df['IS_WINNER'].corr(df['PACING']):.4f}")
print(f"RANKING vs PACING: {df['RANKING'].corr(df['PACING']):.4f}")
print(f"QUALITY vs PACING: {df['QUALITY'].corr(df['PACING']):.4f}")
print(f"FINAL_BID vs QUALITY: {df['FINAL_BID'].corr(df['QUALITY']):.4f}")

print("\n" + "="*80)

DESCRIPTIVE STATISTICS

1. PACING DISTRIBUTION
--------------------------------------------------------------------------------
Total bids: 18,840,598
Mean pacing: 0.8910
Median pacing: 1.0000
Std pacing: 0.2634
Min pacing: 0.006738
Max pacing: 1.0000

Pacing Quantiles:
    1.0%: 0.014909
    5.0%: 0.136717
   10.0%: 0.425693
   25.0%: 1.000000
   50.0%: 1.000000
   75.0%: 1.000000
   90.0%: 1.000000
   95.0%: 1.000000
   99.0%: 1.000000

Pacing Categories:
  High (0.9-1.0):   15,573,844 (82.66%)
  Medium (0.5-0.9): 1,169,905 (6.21%)
  Low (<0.5):       2,096,849 (11.13%)

2. BID DISTRIBUTION
--------------------------------------------------------------------------------
Mean bid (cents): 11.80
Mean bid (dollars): $0.1180
Median bid (cents): 6.00
Median bid (dollars): $0.0600
Std bid (cents): 14.44
Min bid (cents): 0.00
Max bid (cents): 100.00

Bid Quantiles (dollars):
   10.0%: $0.0100
   25.0%: $0.0300
   50.0%: $0.0600
   75.0%: $0.1600
   90.0%: $0.2900
   95.0%: $0.3900
   99.0%:

### Conversion Rate Analysis

Examine how CONVERSION_RATE relates to pacing, bids, quality, and outcomes.

In [17]:
print("="*80)
print("CONVERSION_RATE ANALYSIS")
print("="*80)

print("\n1. CONVERSION_RATE DISTRIBUTION")
print("-" * 80)
print(f"Mean conversion rate: {df['CONVERSION_RATE'].mean():.6f}")
print(f"Median conversion rate: {df['CONVERSION_RATE'].median():.6f}")
print(f"Std conversion rate: {df['CONVERSION_RATE'].std():.6f}")
print(f"Min conversion rate: {df['CONVERSION_RATE'].min():.6f}")
print(f"Max conversion rate: {df['CONVERSION_RATE'].max():.6f}")

print("\nConversion Rate Quantiles:")
for q in [0.10, 0.25, 0.50, 0.75, 0.90, 0.95, 0.99]:
    print(f"  {q*100:5.1f}%: {df['CONVERSION_RATE'].quantile(q):.6f}")

print("\n2. CONVERSION_RATE CORRELATIONS")
print("-" * 80)
print(f"CONVERSION_RATE vs PACING:     {df['CONVERSION_RATE'].corr(df['PACING']):.4f}")
print(f"CONVERSION_RATE vs FINAL_BID:  {df['CONVERSION_RATE'].corr(df['FINAL_BID']):.4f}")
print(f"CONVERSION_RATE vs QUALITY:    {df['CONVERSION_RATE'].corr(df['QUALITY']):.4f}")
print(f"CONVERSION_RATE vs RANKING:    {df['CONVERSION_RATE'].corr(df['RANKING']):.4f}")
print(f"CONVERSION_RATE vs IS_WINNER:  {df['CONVERSION_RATE'].corr(df['IS_WINNER']):.4f}")

print("\n3. CONVERSION_RATE BY PACING LEVEL")
print("-" * 80)
for cat in ['High', 'Med', 'Low']:
    cat_data = df[df['pacing_cat'] == cat]
    print(f"{cat} pacing:")
    print(f"  Mean CVR: {cat_data['CONVERSION_RATE'].mean():.6f}")
    print(f"  Median CVR: {cat_data['CONVERSION_RATE'].median():.6f}")
    print(f"  Std CVR: {cat_data['CONVERSION_RATE'].std():.6f}")

print("\n4. CONVERSION_RATE BY WIN STATUS")
print("-" * 80)
winners = df[df['IS_WINNER'] == True]
losers = df[df['IS_WINNER'] == False]
print(f"Winners:")
print(f"  Mean CVR: {winners['CONVERSION_RATE'].mean():.6f}")
print(f"  Median CVR: {winners['CONVERSION_RATE'].median():.6f}")
print(f"  N: {len(winners):,}")
print(f"\nLosers:")
print(f"  Mean CVR: {losers['CONVERSION_RATE'].mean():.6f}")
print(f"  Median CVR: {losers['CONVERSION_RATE'].median():.6f}")
print(f"  N: {len(losers):,}")
print(f"\nDifference: {winners['CONVERSION_RATE'].mean() - losers['CONVERSION_RATE'].mean():.6f}")

print("\n5. CONVERSION_RATE BY QUALITY QUARTILE")
print("-" * 80)
df['quality_quartile'] = pd.qcut(df['QUALITY'], q=4, labels=['Q1', 'Q2', 'Q3', 'Q4'], duplicates='drop')
for q in ['Q1', 'Q2', 'Q3', 'Q4']:
    q_data = df[df['quality_quartile'] == q]
    if len(q_data) > 0:
        print(f"Quality {q}:")
        print(f"  Mean CVR: {q_data['CONVERSION_RATE'].mean():.6f}")
        print(f"  Mean Quality: {q_data['QUALITY'].mean():.6f}")
        print(f"  Win rate: {q_data['IS_WINNER'].mean()*100:.2f}%")

print("\n6. CAMPAIGN-LEVEL CVR ANALYSIS")
print("-" * 80)
campaign_cvr = df.groupby('CAMPAIGN_ID').agg({
    'CONVERSION_RATE': ['mean', 'std'],
    'PACING': 'mean',
    'FINAL_BID': 'mean',
    'IS_WINNER': 'mean',
    'AUCTION_ID': 'count'
}).reset_index()
campaign_cvr.columns = ['_'.join(col).strip('_') for col in campaign_cvr.columns.values]
campaign_cvr = campaign_cvr[campaign_cvr['AUCTION_ID_count'] >= 50]

print(f"Campaigns with 50+ bids: {len(campaign_cvr):,}")
print(f"\nCampaign-level correlations:")
print(f"  Mean CVR vs Mean Pacing:   {campaign_cvr['CONVERSION_RATE_mean'].corr(campaign_cvr['PACING_mean']):.4f}")
print(f"  Mean CVR vs Mean Bid:      {campaign_cvr['CONVERSION_RATE_mean'].corr(campaign_cvr['FINAL_BID_mean']):.4f}")
print(f"  Mean CVR vs Win Rate:      {campaign_cvr['CONVERSION_RATE_mean'].corr(campaign_cvr['IS_WINNER_mean']):.4f}")

print("\n7. CVR × PACING INTERACTION")
print("-" * 80)
df['cvr_tercile'] = pd.qcut(df['CONVERSION_RATE'], q=3, labels=['Low', 'Med', 'High'], duplicates='drop')
cvr_pacing_matrix = pd.crosstab([df['cvr_tercile']], [df['pacing_cat']], 
                                values=df['IS_WINNER'], aggfunc='mean')
print("\nWin Rate Matrix (CVR × Pacing):")
print(cvr_pacing_matrix.round(4))

print("\n8. INTERPRETATION")
print("-" * 80)
corr_cvr_pacing = df['CONVERSION_RATE'].corr(df['PACING'])
corr_cvr_quality = df['CONVERSION_RATE'].corr(df['QUALITY'])
winner_cvr_diff = winners['CONVERSION_RATE'].mean() - losers['CONVERSION_RATE'].mean()

if abs(corr_cvr_pacing) < 0.05:
    print("• CVR is INDEPENDENT of pacing (weak correlation)")
    print("  → Budget allocation does NOT prioritize high-CVR campaigns")
else:
    print(f"• CVR correlates with pacing (r={corr_cvr_pacing:.4f})")
    print("  → Budget allocation may consider conversion predictions")

if abs(corr_cvr_quality) > 0.3:
    print(f"\n• CVR strongly correlates with QUALITY (r={corr_cvr_quality:.4f})")
    print("  → Quality score incorporates conversion predictions")
elif abs(corr_cvr_quality) > 0.1:
    print(f"\n• CVR moderately correlates with QUALITY (r={corr_cvr_quality:.4f})")
    print("  → Quality score partially reflects conversion potential")
else:
    print(f"\n• CVR weakly correlates with QUALITY (r={corr_cvr_quality:.4f})")
    print("  → Quality and CVR are largely independent signals")

if abs(winner_cvr_diff) > 0.001:
    print(f"\n• Winners have {'HIGHER' if winner_cvr_diff > 0 else 'LOWER'} CVR than losers")
    print(f"  → Difference: {abs(winner_cvr_diff):.6f}")
    print("  → CVR influences auction outcomes")
else:
    print("\n• Winners and losers have similar CVR")
    print("  → CVR has minimal impact on auction outcomes")

print("\n" + "="*80)

CONVERSION_RATE ANALYSIS

1. CONVERSION_RATE DISTRIBUTION
--------------------------------------------------------------------------------
Mean conversion rate: 0.010004
Median conversion rate: 0.009010
Std conversion rate: 0.007716
Min conversion rate: 0.000001
Max conversion rate: 0.056500

Conversion Rate Quantiles:
   10.0%: 0.001456
   25.0%: 0.004257
   50.0%: 0.009010
   75.0%: 0.013267
   90.0%: 0.019160
   95.0%: 0.024458
   99.0%: 0.037425

2. CONVERSION_RATE CORRELATIONS
--------------------------------------------------------------------------------
CONVERSION_RATE vs PACING:     0.0275
CONVERSION_RATE vs FINAL_BID:  0.1960
CONVERSION_RATE vs QUALITY:    0.1578
CONVERSION_RATE vs RANKING:    0.0433
CONVERSION_RATE vs IS_WINNER:  0.0077

3. CONVERSION_RATE BY PACING LEVEL
--------------------------------------------------------------------------------
High pacing:
  Mean CVR: 0.010067
  Median CVR: 0.009010
  Std CVR: 0.007790
Med pacing:
  Mean CVR: 0.010074
  Median CVR: 0

In [None]:
print("="*80)
print("COMPREHENSIVE CORRELATION MATRIX")
print("="*80)

print("\nComputing pairwise correlations for all numeric variables...")

# Select numeric columns
numeric_cols = ['RANKING', 'QUALITY', 'FINAL_BID', 'PRICE', 'CONVERSION_RATE', 
                'PACING', 'IS_WINNER', 'hour', 'day_of_week']

# Create correlation matrix
corr_sample = df[numeric_cols].sample(min(100000, len(df)), random_state=42)
corr_matrix = corr_sample.corr()

print("\n1. FULL CORRELATION MATRIX")
print("-" * 80)
print("\nVariables: RANKING, QUALITY, FINAL_BID, PRICE, CONVERSION_RATE, PACING, IS_WINNER, hour, day_of_week")
print("\nCorrelation Matrix:")
print(corr_matrix.round(4).to_string())

print("\n2. STRONGEST CORRELATIONS (|r| > 0.10)")
print("-" * 80)

# Extract upper triangle
import numpy as np
strong_corrs = []

for i in range(len(corr_matrix.columns)):
    for j in range(i+1, len(corr_matrix.columns)):
        corr_val = corr_matrix.iloc[i, j]
        if abs(corr_val) > 0.10:
            strong_corrs.append({
                'var1': corr_matrix.columns[i],
                'var2': corr_matrix.columns[j],
                'correlation': corr_val
            })

strong_corrs_df = pd.DataFrame(strong_corrs).sort_values('correlation', key=abs, ascending=False)

print(f"\nStrong correlations (n={len(strong_corrs_df)}):")
print("Variable 1          | Variable 2          | Correlation")
print("-" * 80)
for _, row in strong_corrs_df.iterrows():
    print(f"{row['var1']:18s} | {row['var2']:18s} | {row['correlation']:10.4f}")

print("\n3. VARIABLE-SPECIFIC CORRELATIONS")
print("-" * 80)

key_outcomes = ['IS_WINNER', 'RANKING', 'PRICE']
for outcome in key_outcomes:
    print(f"\n{outcome} correlations:")
    outcome_corrs = corr_matrix[outcome].drop(outcome).sort_values(key=abs, ascending=False)
    for var, corr in outcome_corrs.items():
        print(f"  {var:20s}: {corr:7.4f}")

print("\n4. INTERPRETATION OF KEY CORRELATIONS")
print("-" * 80)

# IS_WINNER correlations
winner_pacing = corr_matrix.loc['IS_WINNER', 'PACING']
winner_bid = corr_matrix.loc['IS_WINNER', 'FINAL_BID']
winner_quality = corr_matrix.loc['IS_WINNER', 'QUALITY']

print(f"\nIS_WINNER drivers:")
print(f"  PACING:     {winner_pacing:7.4f} {'(STRONG)' if abs(winner_pacing) > 0.2 else '(MODERATE)' if abs(winner_pacing) > 0.1 else '(WEAK)'}")
print(f"  FINAL_BID:  {winner_bid:7.4f} {'(STRONG)' if abs(winner_bid) > 0.2 else '(MODERATE)' if abs(winner_bid) > 0.1 else '(WEAK)'}")
print(f"  QUALITY:    {winner_quality:7.4f} {'(STRONG)' if abs(winner_quality) > 0.2 else '(MODERATE)' if abs(winner_quality) > 0.1 else '(WEAK)'}")

if abs(winner_pacing) > abs(winner_bid):
    print("\n  → PACING is the strongest predictor of winning")
    print("  → Budget state matters more than bid amount")
else:
    print("\n  → FINAL_BID is the strongest predictor of winning")
    print("  → Bid amount matters more than budget state")

print("\n" + "="*80)

### Comprehensive Correlation Matrix

All pairwise correlations between numeric variables.

### Ranking Function Analysis: Does RANKING = f(QUALITY × FINAL_BID)?

Test if ranking is determined by quality-adjusted bidding (GSP-style).

In [18]:
print("="*80)
print("AUCTION MECHANISM DETECTION: FIRST-PRICE VS SECOND-PRICE")
print("="*80)

print("\nQuestion: Does platform run first-price or second-price auctions?")
print("Method: Compare FINAL_BID (submitted bid) with PRICE (clearing price)")
print("\nExpected patterns:")
print("  • First-price:  FINAL_BID ≈ PRICE (winner pays their bid)")
print("  • Second-price: FINAL_BID > PRICE (winner pays second-highest bid)")

print("\n1. DATA AVAILABILITY")
print("-" * 80)
winners = df[df['IS_WINNER'] == True].copy()
winners_with_price = winners[winners['PRICE'].notna()]
losers = df[df['IS_WINNER'] == False].copy()

print(f"Total bids: {len(df):,}")
print(f"Winners: {len(winners):,} ({len(winners)/len(df)*100:.2f}%)")
print(f"Winners with PRICE populated: {len(winners_with_price):,} ({len(winners_with_price)/len(winners)*100:.2f}%)")
print(f"Losers: {len(losers):,} ({len(losers)/len(df)*100:.2f}%)")
print(f"Losers with PRICE populated: {losers['PRICE'].notna().sum():,} ({losers['PRICE'].notna().mean()*100:.2f}%)")

print("\n2. BID VS PRICE COMPARISON (WINNERS ONLY)")
print("-" * 80)
winners_with_price['bid_price_diff'] = winners_with_price['FINAL_BID'] - winners_with_price['PRICE']
winners_with_price['bid_price_ratio'] = winners_with_price['FINAL_BID'] / (winners_with_price['PRICE'] + 0.01)

print(f"\nBid vs Price statistics (cents):")
print(f"  Mean FINAL_BID: {winners_with_price['FINAL_BID'].mean():.2f}")
print(f"  Mean PRICE: {winners_with_price['PRICE'].mean():.2f}")
print(f"  Mean difference (BID - PRICE): {winners_with_price['bid_price_diff'].mean():.2f}")
print(f"  Median difference: {winners_with_price['bid_price_diff'].median():.2f}")
print(f"  Mean ratio (BID / PRICE): {winners_with_price['bid_price_ratio'].mean():.4f}")

print("\nDifference distribution:")
for q in [0.01, 0.05, 0.10, 0.25, 0.50, 0.75, 0.90, 0.95, 0.99]:
    print(f"  {q*100:5.1f}%: {winners_with_price['bid_price_diff'].quantile(q):.2f} cents")

print("\n3. AUCTION TYPE CLASSIFICATION")
print("-" * 80)
tolerance = 0.5  # cents tolerance for "approximately equal"
winners_with_price['auction_type'] = 'Unknown'
winners_with_price.loc[abs(winners_with_price['bid_price_diff']) <= tolerance, 'auction_type'] = 'First-Price'
winners_with_price.loc[winners_with_price['bid_price_diff'] > tolerance, 'auction_type'] = 'Second-Price'
winners_with_price.loc[winners_with_price['bid_price_diff'] < -tolerance, 'auction_type'] = 'Anomaly'

auction_type_counts = winners_with_price['auction_type'].value_counts()
print(f"\nAuction type distribution (tolerance = {tolerance} cents):")
for auction_type in ['First-Price', 'Second-Price', 'Anomaly']:
    if auction_type in auction_type_counts.index:
        count = auction_type_counts[auction_type]
        pct = count / len(winners_with_price) * 100
        print(f"  {auction_type:15s}: {count:,} ({pct:.2f}%)")

print("\n4. SECOND-PRICE DISCOUNT ANALYSIS")
print("-" * 80)
second_price = winners_with_price[winners_with_price['auction_type'] == 'Second-Price']
if len(second_price) > 0:
    print(f"\nSecond-price auctions (n={len(second_price):,}):")
    print(f"  Mean discount (BID - PRICE): ${second_price['bid_price_diff'].mean()/100:.4f}")
    print(f"  Median discount: ${second_price['bid_price_diff'].median()/100:.4f}")
    print(f"  Mean savings rate: {(1 - second_price['PRICE']/second_price['FINAL_BID']).mean()*100:.2f}%")
    
    print("\n  Discount distribution:")
    for q in [0.10, 0.25, 0.50, 0.75, 0.90, 0.95, 0.99]:
        print(f"    {q*100:5.1f}%: ${second_price['bid_price_diff'].quantile(q)/100:.4f}")
else:
    print("\nNo second-price auctions detected")

print("\n5. AUCTION TYPE BY PLACEMENT")
print("-" * 80)
auction_by_placement = pd.crosstab(winners_with_price['PLACEMENT'], 
                                   winners_with_price['auction_type'], 
                                   normalize='index') * 100
print("\nAuction type % by placement:")
print(auction_by_placement.round(2))

print("\n6. TEMPORAL VARIATION")
print("-" * 80)
winners_with_price['hour_bin'] = pd.cut(winners_with_price['hour'], 
                                         bins=[0, 6, 12, 18, 24], 
                                         labels=['Night', 'Morning', 'Afternoon', 'Evening'],
                                         include_lowest=True)
auction_by_time = pd.crosstab(winners_with_price['hour_bin'], 
                              winners_with_price['auction_type'], 
                              normalize='index') * 100
print("\nAuction type % by time of day:")
print(auction_by_time.round(2))

print("\n7. ANOMALY INVESTIGATION (PRICE > BID)")
print("-" * 80)
anomalies = winners_with_price[winners_with_price['auction_type'] == 'Anomaly']
if len(anomalies) > 0:
    print(f"\nAnomalies (PRICE > BID): {len(anomalies):,} cases")
    print(f"  Mean BID: ${anomalies['FINAL_BID'].mean()/100:.4f}")
    print(f"  Mean PRICE: ${anomalies['PRICE'].mean()/100:.4f}")
    print(f"  Mean overcharge: ${abs(anomalies['bid_price_diff'].mean())/100:.4f}")
    
    print("\n  Possible explanations:")
    print("    • PRICE includes platform fees/markup")
    print("    • Reserve prices or minimum bid floors")
    print("    • PRICE = commodity price, not clearing price")
    print("    • Data quality issues")
else:
    print("\nNo anomalies detected (all PRICE ≤ BID)")

print("\n8. INTERPRETATION")
print("-" * 80)
first_price_pct = (auction_type_counts.get('First-Price', 0) / len(winners_with_price)) * 100
second_price_pct = (auction_type_counts.get('Second-Price', 0) / len(winners_with_price)) * 100
anomaly_pct = (auction_type_counts.get('Anomaly', 0) / len(winners_with_price)) * 100

if first_price_pct > 90:
    print("\n✓ PREDOMINANTLY FIRST-PRICE AUCTION")
    print(f"  {first_price_pct:.1f}% of auctions have BID ≈ PRICE")
    print("  Winners pay their submitted bid")
elif second_price_pct > 90:
    print("\n✓ PREDOMINANTLY SECOND-PRICE AUCTION")
    print(f"  {second_price_pct:.1f}% of auctions have BID > PRICE")
    print("  Winners pay second-highest bid (VCG-style)")
elif first_price_pct > 60:
    print("\n⚠ MIXED SYSTEM WITH FIRST-PRICE MAJORITY")
    print(f"  First-price: {first_price_pct:.1f}%")
    print(f"  Second-price: {second_price_pct:.1f}%")
    print("  System may vary by placement or time")
else:
    print("\n⚠ HYBRID OR COMPLEX AUCTION SYSTEM")
    print(f"  First-price: {first_price_pct:.1f}%")
    print(f"  Second-price: {second_price_pct:.1f}%")
    print(f"  Anomalies: {anomaly_pct:.1f}%")

# Check for placement variation
placement_range = auction_by_placement['First-Price'].max() - auction_by_placement['First-Price'].min()
if placement_range > 10:
    print(f"\n✓ Auction type VARIES BY PLACEMENT (range: {placement_range:.1f}pp)")
    print("  Different placements use different auction mechanisms")
else:
    print(f"\n✓ Auction type is CONSISTENT across placements (range: {placement_range:.1f}pp)")

# Check for temporal variation
time_range = auction_by_time['First-Price'].max() - auction_by_time['First-Price'].min()
if time_range > 10:
    print(f"\n⚠ Auction type VARIES BY TIME (range: {time_range:.1f}pp)")
    print("  Temporal A/B testing or gradual rollout detected")
else:
    print(f"\n✓ Auction type is STABLE over time (range: {time_range:.1f}pp)")

if anomaly_pct > 5:
    print(f"\n⚠ SIGNIFICANT ANOMALIES ({anomaly_pct:.1f}%)")
    print("  PRICE may not represent clearing price")
    print("  Could be catalog commodity price or include fees")

print("\n" + "="*80)

AUCTION MECHANISM DETECTION: FIRST-PRICE VS SECOND-PRICE

Question: Does platform run first-price or second-price auctions?
Method: Compare FINAL_BID (submitted bid) with PRICE (clearing price)

Expected patterns:
  • First-price:  FINAL_BID ≈ PRICE (winner pays their bid)
  • Second-price: FINAL_BID > PRICE (winner pays second-highest bid)

1. DATA AVAILABILITY
--------------------------------------------------------------------------------
Total bids: 18,840,598
Winners: 15,510,672 (82.33%)
Winners with PRICE populated: 15,510,672 (100.00%)
Losers: 3,329,926 (17.67%)
Losers with PRICE populated: 0 (0.00%)

2. BID VS PRICE COMPARISON (WINNERS ONLY)
--------------------------------------------------------------------------------

Bid vs Price statistics (cents):
  Mean FINAL_BID: 12.54
  Mean PRICE: 11.47
  Mean difference (BID - PRICE): 1.07
  Median difference: 0.00
  Mean ratio (BID / PRICE): 1.0787

Difference distribution:
    1.0%: 0.00 cents
    5.0%: 0.00 cents
   10.0%: 0.00 c

### Auction Mechanism Detection: First-Price vs Second-Price

Compare FINAL_BID vs PRICE to infer auction type.

## Section 2: Core Pacing Mechanisms

### Test 1: Is PACING a bid multiplier?
If FINAL_BID = BASE_BID × PACING, we should see positive correlation within campaigns.

In [9]:
print("="*80)
print("TEST 1: PACING AS BID MULTIPLIER")
print("="*80)

print("\nHypothesis: FINAL_BID = BASE_BID × PACING")
print("Expected: Positive within-campaign correlation")

campaign_stats = df.groupby('CAMPAIGN_ID').agg({
    'FINAL_BID': ['mean', 'std', 'count'],
    'PACING': ['mean', 'std']
}).reset_index()
campaign_stats.columns = ['_'.join(col).strip('_') for col in campaign_stats.columns.values]

variable_campaigns = campaign_stats[
    (campaign_stats['FINAL_BID_count'] >= 100) &
    (campaign_stats['PACING_std'] > 0.2)
]

print(f"\nCampaigns with variable pacing: {len(variable_campaigns):,}")
print(f"  (n≥100 bids, pacing_std>0.2)")

sample_campaigns = variable_campaigns.sample(min(500, len(variable_campaigns)), random_state=42)['CAMPAIGN_ID'].values
within_corrs = []
for campaign_id in tqdm(sample_campaigns, desc="Computing within-campaign correlations"):
    camp_data = df[df['CAMPAIGN_ID'] == campaign_id]
    if len(camp_data) >= 20:
        corr = camp_data['PACING'].corr(camp_data['FINAL_BID'])
        within_corrs.append(corr)

print(f"\nWithin-campaign correlations (n={len(within_corrs):,} campaigns):")
print(f"  Mean correlation: {np.mean(within_corrs):.4f}")
print(f"  Median correlation: {np.median(within_corrs):.4f}")
print(f"  Positive (>0.3): {(np.array(within_corrs) > 0.3).sum()} ({(np.array(within_corrs) > 0.3).mean()*100:.1f}%)")
print(f"  Negative (<-0.3): {(np.array(within_corrs) < -0.3).sum()} ({(np.array(within_corrs) < -0.3).mean()*100:.1f}%)")

print("\nInterpretation:")
if np.mean(within_corrs) > 0.3:
    print("  ✓ STRONG EVIDENCE: PACING appears to be a bid multiplier")
elif np.mean(within_corrs) < -0.3:
    print("  ✗ INVERSE RELATIONSHIP: Higher pacing → Lower bids (unexpected)")
else:
    print("  ⚠ WEAK CORRELATION: FINAL_BID likely already has pacing applied")
    print("    Data shows bids AFTER pacing discount, not before")

print("\n" + "="*80)

TEST 1: PACING AS BID MULTIPLIER

Hypothesis: FINAL_BID = BASE_BID × PACING
Expected: Positive within-campaign correlation

Campaigns with variable pacing: 6,651
  (n≥100 bids, pacing_std>0.2)


Computing within-campaign correlations: 100%|██████████| 500/500 [04:00<00:00,  2.08it/s]


Within-campaign correlations (n=500 campaigns):
  Mean correlation: -0.1454
  Median correlation: -0.1158
  Positive (>0.3): 54 (10.8%)
  Negative (<-0.3): 170 (34.0%)

Interpretation:
  ⚠ WEAK CORRELATION: FINAL_BID likely already has pacing applied
    Data shows bids AFTER pacing discount, not before






### Test 2: Temporal Pacing Patterns (Budget Depletion)

In [10]:
print("="*80)
print("TEST 2: TEMPORAL PACING PATTERNS")
print("="*80)

print("\nHypothesis: Pacing decreases during the day as budget depletes")

sample_campaigns_temporal = variable_campaigns.head(200)['CAMPAIGN_ID'].values
temporal_patterns = []

for campaign_id in tqdm(sample_campaigns_temporal, desc="Analyzing temporal patterns"):
    camp_data = df[df['CAMPAIGN_ID'] == campaign_id].copy()
    camp_data = camp_data.sort_values('datetime')
    
    for date in camp_data['date'].unique():
        day_data = camp_data[camp_data['date'] == date]
        if len(day_data) >= 10:
            day_data = day_data.copy()
            day_data['auction_order'] = range(len(day_data))
            time_corr = day_data['auction_order'].corr(day_data['PACING'])
            temporal_patterns.append({
                'CAMPAIGN_ID': campaign_id,
                'date': date,
                'time_pacing_corr': time_corr,
                'n_auctions': len(day_data),
                'first_pacing': day_data.iloc[0]['PACING'],
                'last_pacing': day_data.iloc[-1]['PACING'],
                'pacing_change': day_data.iloc[-1]['PACING'] - day_data.iloc[0]['PACING']
            })

temporal_df = pd.DataFrame(temporal_patterns)
print(f"\nCampaign-days analyzed: {len(temporal_df):,}")
print(f"  Mean time-pacing correlation: {temporal_df['time_pacing_corr'].mean():.4f}")
print(f"  Days with decreasing pacing: {(temporal_df['time_pacing_corr'] < -0.3).sum()} ({(temporal_df['time_pacing_corr'] < -0.3).mean()*100:.1f}%)")
print(f"  Days with increasing pacing: {(temporal_df['time_pacing_corr'] > 0.3).sum()} ({(temporal_df['time_pacing_corr'] > 0.3).mean()*100:.1f}%)")
print(f"  Mean pacing change (end - start): {temporal_df['pacing_change'].mean():.4f}")

print("\nInterpretation:")
if temporal_df['time_pacing_corr'].mean() < -0.2:
    print("  ✓ EVIDENCE: Pacing decreases during the day → budget depletion")
elif temporal_df['time_pacing_corr'].mean() > 0.2:
    print("  ✗ COUNTER-EVIDENCE: Pacing increases during the day")
else:
    print("  ⚠ NEUTRAL: No clear temporal pattern")

print("\n" + "="*80)

TEST 2: TEMPORAL PACING PATTERNS

Hypothesis: Pacing decreases during the day as budget depletes


Analyzing temporal patterns: 100%|██████████| 200/200 [01:31<00:00,  2.18it/s]


Campaign-days analyzed: 463
  Mean time-pacing correlation: -0.3235
  Days with decreasing pacing: 240 (51.8%)
  Days with increasing pacing: 60 (13.0%)
  Mean pacing change (end - start): -0.2169

Interpretation:
  ✓ EVIDENCE: Pacing decreases during the day → budget depletion






### Test 3: Campaign-Level Characteristics

In [11]:
print("="*80)
print("TEST 3: CAMPAIGN-LEVEL PACING CHARACTERISTICS")
print("="*80)

all_campaign_stats = df.groupby('CAMPAIGN_ID').agg({
    'PACING': ['mean', 'std'],
    'FINAL_BID': ['mean', 'std'],
    'IS_WINNER': 'mean',
    'QUALITY': 'mean',
    'AUCTION_ID': 'count'
}).reset_index()
all_campaign_stats.columns = ['_'.join(col).strip('_') for col in all_campaign_stats.columns.values]
all_campaign_stats = all_campaign_stats[all_campaign_stats['AUCTION_ID_count'] >= 50]

all_campaign_stats['bid_cv'] = all_campaign_stats['FINAL_BID_std'] / (all_campaign_stats['FINAL_BID_mean'] + 0.01)
all_campaign_stats['pacing_cv'] = all_campaign_stats['PACING_std'] / (all_campaign_stats['PACING_mean'] + 0.001)

print(f"\nTotal campaigns: {len(all_campaign_stats):,}")

print("\nCampaign Segments:")
stable_high = all_campaign_stats[(all_campaign_stats['PACING_mean'] > 0.95) & (all_campaign_stats['PACING_std'] < 0.05)]
variable = all_campaign_stats[all_campaign_stats['PACING_std'] > 0.2]
stable_low = all_campaign_stats[(all_campaign_stats['PACING_mean'] < 0.7) & (all_campaign_stats['PACING_std'] < 0.15)]

print(f"  Stable high pacing (mean>0.95, std<0.05): {len(stable_high):,} ({len(stable_high)/len(all_campaign_stats)*100:.1f}%)")
print(f"    Mean bid CV: {stable_high['bid_cv'].mean():.4f}")
print(f"    Mean win rate: {stable_high['IS_WINNER_mean'].mean()*100:.2f}%")

print(f"\n  Variable pacing (std>0.2): {len(variable):,} ({len(variable)/len(all_campaign_stats)*100:.1f}%)")
print(f"    Mean bid CV: {variable['bid_cv'].mean():.4f}")
print(f"    Mean win rate: {variable['IS_WINNER_mean'].mean()*100:.2f}%")

print(f"\n  Stable low pacing (mean<0.7, std<0.15): {len(stable_low):,} ({len(stable_low)/len(all_campaign_stats)*100:.1f}%)")
print(f"    Mean bid CV: {stable_low['bid_cv'].mean():.4f}")
print(f"    Mean win rate: {stable_low['IS_WINNER_mean'].mean()*100:.2f}%")

print("\nKey Correlations (Campaign-Level):")
print(f"  Bid CV ~ Pacing Std: {all_campaign_stats['bid_cv'].corr(all_campaign_stats['PACING_std']):.4f}")
print(f"  Mean Bid ~ Mean Pacing: {all_campaign_stats['FINAL_BID_mean'].corr(all_campaign_stats['PACING_mean']):.4f}")
print(f"  Win Rate ~ Mean Pacing: {all_campaign_stats['IS_WINNER_mean'].corr(all_campaign_stats['PACING_mean']):.4f}")

print("\n" + "="*80)

TEST 3: CAMPAIGN-LEVEL PACING CHARACTERISTICS

Total campaigns: 42,767

Campaign Segments:
  Stable high pacing (mean>0.95, std<0.05): 8,095 (18.9%)
    Mean bid CV: 0.4441
    Mean win rate: 81.71%

  Variable pacing (std>0.2): 13,260 (31.0%)
    Mean bid CV: 0.7387
    Mean win rate: 79.73%

  Stable low pacing (mean<0.7, std<0.15): 312 (0.7%)
    Mean bid CV: 0.3560
    Mean win rate: 88.29%

Key Correlations (Campaign-Level):
  Bid CV ~ Pacing Std: 0.3012
  Mean Bid ~ Mean Pacing: -0.1974
  Win Rate ~ Mean Pacing: -0.0118



## Section 3: Ten Hypotheses Testing

Testing 10 new hypotheses about pacing behavior.

In [12]:
print("="*80)
print("10 HYPOTHESES TESTING")
print("="*80)

results = []

print("\nH1: Vendor-level pacing coordination")
print("-" * 80)
vendor_campaign_counts = df.groupby('VENDOR_ID')['CAMPAIGN_ID'].nunique()
multi_campaign_vendors = vendor_campaign_counts[vendor_campaign_counts >= 3].index[:100]
within_vendor_stds = []
for vendor_id in multi_campaign_vendors:
    vendor_campaigns = df[df['VENDOR_ID'] == vendor_id]['CAMPAIGN_ID'].unique()[:5]
    campaign_pacings = [df[df['CAMPAIGN_ID'] == c]['PACING'].mean() for c in vendor_campaigns]
    if len(campaign_pacings) >= 3:
        within_vendor_stds.append(np.std(campaign_pacings))
random_pacings = df.groupby('CAMPAIGN_ID')['PACING'].mean().sample(min(500, len(df['CAMPAIGN_ID'].unique())))
random_groups = [random_pacings.values[i:i+5].std() for i in range(0, min(500, len(random_pacings)), 5)]
h1_ratio = np.mean(within_vendor_stds) / np.mean(random_groups) if within_vendor_stds else 1
print(f"Within-vendor pacing std: {np.mean(within_vendor_stds):.4f}")
print(f"Random baseline std: {np.mean(random_groups):.4f}")
print(f"Ratio: {h1_ratio:.4f}")
print(f"Evidence: {'SUPPORT' if h1_ratio < 0.85 else 'REJECT'}")
results.append({'H': 'H1', 'Evidence': 'SUPPORT' if h1_ratio < 0.85 else 'REJECT'})

print("\nH2: Quality predicts pacing stability")
print("-" * 80)
camp_stats = df.groupby('CAMPAIGN_ID').agg({'QUALITY': 'mean', 'PACING': 'std', 'AUCTION_ID': 'count'}).reset_index()
camp_stats = camp_stats[camp_stats['AUCTION_ID'] >= 50]
h2_corr = camp_stats['QUALITY'].corr(camp_stats['PACING'])
print(f"Correlation (Quality, Pacing Std): {h2_corr:.4f}")
print(f"N campaigns: {len(camp_stats):,}")
print(f"Evidence: {'SUPPORT' if h2_corr < -0.1 else 'REJECT'}")
results.append({'H': 'H2', 'Evidence': 'SUPPORT' if h2_corr < -0.1 else 'REJECT'})

print("\nH3: Placement-specific pacing strategies")
print("-" * 80)
placement_pacing = df.groupby('PLACEMENT')['PACING'].mean()
h3_range = placement_pacing.max() - placement_pacing.min()
print(f"Pacing range across placements: {h3_range:.4f}")
for placement in placement_pacing.index:
    print(f"  Placement {placement}: {placement_pacing[placement]:.4f}")
print(f"Evidence: {'SUPPORT' if h3_range > 0.05 else 'REJECT'}")
results.append({'H': 'H3', 'Evidence': 'SUPPORT' if h3_range > 0.05 else 'REJECT'})

print("\nH4: Daily pacing resets")
print("-" * 80)
late_night = df[df['hour'].isin([22, 23])].groupby('CAMPAIGN_ID')['PACING'].mean()
early_morning = df[df['hour'].isin([0, 1, 2])].groupby('CAMPAIGN_ID')['PACING'].mean()
common = set(late_night.index) & set(early_morning.index)
if len(common) > 10:
    jumps = [early_morning[c] - late_night[c] for c in list(common)[:200]]
    h4_jump = np.mean(jumps)
else:
    h4_jump = 0
print(f"Mean pacing jump (early - late): {h4_jump:.4f}")
print(f"Late night mean: {late_night.mean():.4f}")
print(f"Early morning mean: {early_morning.mean():.4f}")
print(f"Evidence: {'SUPPORT' if h4_jump > 0.1 else 'WEAK'}")
results.append({'H': 'H4', 'Evidence': 'SUPPORT' if h4_jump > 0.1 else 'WEAK'})

print("\nH5: Low-pacing winners are exceptional")
print("-" * 80)
low_pac_winners = df[(df['IS_WINNER']) & (df['PACING'] < 0.5)]
high_pac_winners = df[(df['IS_WINNER']) & (df['PACING'] > 0.9)]
h5_quality_ratio = low_pac_winners['QUALITY'].mean() / high_pac_winners['QUALITY'].mean()
print(f"Low-pacing winner quality: {low_pac_winners['QUALITY'].mean():.6f}")
print(f"High-pacing winner quality: {high_pac_winners['QUALITY'].mean():.6f}")
print(f"Ratio: {h5_quality_ratio:.4f}")
print(f"Low-pacing winner bid: ${low_pac_winners['FINAL_BID_DOLLARS'].mean():.4f}")
print(f"High-pacing winner bid: ${high_pac_winners['FINAL_BID_DOLLARS'].mean():.4f}")
print(f"Evidence: {'SUPPORT' if h5_quality_ratio > 1.05 else 'REJECT'}")
results.append({'H': 'H5', 'Evidence': 'SUPPORT' if h5_quality_ratio > 1.05 else 'REJECT'})

print("\nH6: CVR predicts budget allocation")
print("-" * 80)
camp_cvr = df.groupby('CAMPAIGN_ID').agg({'CONVERSION_RATE': 'mean', 'PACING': 'mean', 'FINAL_BID_DOLLARS': 'mean', 'AUCTION_ID': 'count'}).reset_index()
camp_cvr = camp_cvr[camp_cvr['AUCTION_ID'] > 50]
h6_corr_pacing = camp_cvr['CONVERSION_RATE'].corr(camp_cvr['PACING'])
h6_corr_bid = camp_cvr['CONVERSION_RATE'].corr(camp_cvr['FINAL_BID_DOLLARS'])
print(f"Correlation (CVR, Pacing): {h6_corr_pacing:.4f}")
print(f"Correlation (CVR, Bid): {h6_corr_bid:.4f}")
print(f"Evidence: {'SUPPORT' if abs(h6_corr_pacing) > 0.1 or abs(h6_corr_bid) > 0.1 else 'REJECT'}")
results.append({'H': 'H6', 'Evidence': 'SUPPORT' if abs(h6_corr_pacing) > 0.1 or abs(h6_corr_bid) > 0.1 else 'REJECT'})

print("\nH7: Product bidding concentration")
print("-" * 80)
product_counts = df['PRODUCT_ID'].value_counts()
sorted_counts = np.sort(product_counts.values)
n = len(sorted_counts)
h7_gini = (2 * np.sum(np.arange(1, n+1) * sorted_counts)) / (n * np.sum(sorted_counts)) - (n + 1) / n
top_10pct = int(len(product_counts) * 0.1)
h7_top10_share = product_counts.iloc[:top_10pct].sum() / len(df)
print(f"Gini coefficient: {h7_gini:.4f}")
print(f"Top 10% products share: {h7_top10_share*100:.2f}%")
print(f"Total unique products: {len(product_counts):,}")
print(f"Evidence: {'SUPPORT' if h7_gini > 0.7 else 'MODERATE'}")
results.append({'H': 'H7', 'Evidence': 'SUPPORT' if h7_gini > 0.7 else 'MODERATE'})

print("\nH8: Weekend pacing differs from weekday")
print("-" * 80)
df['is_weekend'] = df['day_of_week'].isin([5, 6])
weekend = df[df['is_weekend']]
weekday = df[~df['is_weekend']]
h8_diff = weekend['PACING'].mean() - weekday['PACING'].mean()
print(f"Weekend pacing: {weekend['PACING'].mean():.4f}")
print(f"Weekday pacing: {weekday['PACING'].mean():.4f}")
print(f"Difference: {h8_diff:.4f}")
print(f"Evidence: {'SUPPORT' if abs(h8_diff) > 0.03 else 'REJECT'}")
results.append({'H': 'H8', 'Evidence': 'SUPPORT' if abs(h8_diff) > 0.03 else 'REJECT'})

print("\nH9: Multi-product campaign pacing")
print("-" * 80)
camp_products = df.groupby('CAMPAIGN_ID')['PRODUCT_ID'].nunique()
camp_pacing = df.groupby('CAMPAIGN_ID')['PACING'].mean()
combined = pd.DataFrame({'n_products': camp_products, 'pacing': camp_pacing})
single = combined[combined['n_products'] == 1]
multi = combined[combined['n_products'] >= 5]
h9_diff = abs(single['pacing'].mean() - multi['pacing'].mean())
print(f"Single-product campaigns: {single['pacing'].mean():.4f}")
print(f"Multi-product campaigns (5+): {multi['pacing'].mean():.4f}")
print(f"Difference: {h9_diff:.4f}")
print(f"Evidence: {'SUPPORT' if h9_diff > 0.05 else 'REJECT'}")
results.append({'H': 'H9', 'Evidence': 'SUPPORT' if h9_diff > 0.05 else 'REJECT'})

print("\nH10: Vendor-level budget cascades")
print("-" * 80)
vendor_daily = df.groupby(['VENDOR_ID', 'date']).agg({'PACING': 'mean', 'CAMPAIGN_ID': 'nunique'}).reset_index()
multi_campaign_vendors = vendor_daily[vendor_daily['CAMPAIGN_ID'] >= 2]
low_pacing_days = len(multi_campaign_vendors[multi_campaign_vendors['PACING'] < 0.5])
total_days = len(multi_campaign_vendors)
h10_cascade = low_pacing_days / total_days if total_days > 0 else 0
print(f"Low-pacing vendor-days: {low_pacing_days:,}")
print(f"Total vendor-days: {total_days:,}")
print(f"Cascade rate: {h10_cascade*100:.2f}%")
print(f"Evidence: {'SUPPORT' if h10_cascade > 0.1 else 'WEAK'}")
results.append({'H': 'H10', 'Evidence': 'SUPPORT' if h10_cascade > 0.1 else 'WEAK'})

print("\n" + "="*80)
print("HYPOTHESIS SUMMARY")
print("="*80)
results_df = pd.DataFrame(results)
print(results_df.to_string(index=False))
print(f"\nSupported: {(results_df['Evidence'] == 'SUPPORT').sum()}/10")
print(f"Weak/Moderate: {results_df['Evidence'].isin(['WEAK', 'MODERATE']).sum()}/10")
print(f"Rejected: {(results_df['Evidence'] == 'REJECT').sum()}/10")
print("\n" + "="*80)

10 HYPOTHESES TESTING

H1: Vendor-level pacing coordination
--------------------------------------------------------------------------------
Within-vendor pacing std: 0.0914
Random baseline std: 0.1604
Ratio: 0.5694
Evidence: SUPPORT

H2: Quality predicts pacing stability
--------------------------------------------------------------------------------
Correlation (Quality, Pacing Std): -0.1335
N campaigns: 42,767
Evidence: SUPPORT

H3: Placement-specific pacing strategies
--------------------------------------------------------------------------------
Pacing range across placements: 0.0752
  Placement 1: 0.8833
  Placement 2: 0.8633
  Placement 3: 0.9384
  Placement 4: 0.9114
  Placement 5: 0.8913
Evidence: SUPPORT

H4: Daily pacing resets
--------------------------------------------------------------------------------
Mean pacing jump (early - late): 0.0563
Late night mean: 0.9033
Early morning mean: 0.9699
Evidence: WEAK

H5: Low-pacing winners are exceptional
-----------------------

## Section 4: Statistical Models

### Model 1: Ranking Regression

In [13]:
print("="*80)
print("STATISTICAL MODELS")
print("="*80)

print("\nMODEL 1: RANKING REGRESSION")
print("-" * 80)
print("\nUnit of Analysis: Individual bid")
print("Dependent Variable: log(RANKING)")
print("Independent Variables: log(FINAL_BID), log(PACING), log(QUALITY)")
print("\nModel Equation:")
print("  log(RANKING) = β₀ + β₁·log(FINAL_BID) + β₂·log(PACING) + β₃·log(QUALITY) + ε")
print("\nPurpose: Decompose what drives ranking in auctions")
print("Coefficient Interpretation: Elasticities (% change in rank for 1% change in X)")
print("Error Term: Captures placement effects, time effects, unmeasured quality")

sample_df = df.sample(min(100000, len(df)), random_state=42)
sample_df = sample_df[(sample_df['FINAL_BID'] > 0) & (sample_df['PACING'] > 0) & (sample_df['QUALITY'] > 0)].copy()

sample_df['log_rank'] = np.log(sample_df['RANKING'])
sample_df['log_bid'] = np.log(sample_df['FINAL_BID'])
sample_df['log_pacing'] = np.log(sample_df['PACING'])
sample_df['log_quality'] = np.log(sample_df['QUALITY'])

X = sample_df[['log_bid', 'log_pacing', 'log_quality']].values
X = np.column_stack([np.ones(len(X)), X])
y = sample_df['log_rank'].values

beta = np.linalg.lstsq(X, y, rcond=None)[0]
resid = y - X @ beta
r2 = 1 - (resid**2).sum() / ((y - y.mean())**2).sum()

print(f"\nSample Size: {len(sample_df):,}")
print(f"R²: {r2:.4f}")
print(f"\nCoefficients (Elasticities):")
print(f"  Intercept:        {beta[0]:8.4f}")
print(f"  log(FINAL_BID):   {beta[1]:8.4f}  (1% ↑ bid → {-beta[1]:.2f}% change in rank)")
print(f"  log(PACING):      {beta[2]:8.4f}  (1% ↑ pacing → {-beta[2]:.2f}% change in rank)")
print(f"  log(QUALITY):     {beta[3]:8.4f}  (1% ↑ quality → {-beta[3]:.2f}% change in rank)")

print("\nInterpretation:")
if abs(beta[2]) > 0.1:
    print(f"  PACING has independent effect on ranking")
else:
    print(f"  PACING has weak effect - FINAL_BID likely already incorporates pacing")

print("\n" + "="*80)

STATISTICAL MODELS

MODEL 1: RANKING REGRESSION
--------------------------------------------------------------------------------

Unit of Analysis: Individual bid
Dependent Variable: log(RANKING)
Independent Variables: log(FINAL_BID), log(PACING), log(QUALITY)

Model Equation:
  log(RANKING) = β₀ + β₁·log(FINAL_BID) + β₂·log(PACING) + β₃·log(QUALITY) + ε

Purpose: Decompose what drives ranking in auctions
Coefficient Interpretation: Elasticities (% change in rank for 1% change in X)
Error Term: Captures placement effects, time effects, unmeasured quality

Sample Size: 95,940
R²: 0.0181

Coefficients (Elasticities):
  Intercept:          2.9033
  log(FINAL_BID):    -0.0765  (1% ↑ bid → 0.08% change in rank)
  log(PACING):        0.0552  (1% ↑ pacing → -0.06% change in rank)
  log(QUALITY):      -0.0533  (1% ↑ quality → 0.05% change in rank)

Interpretation:
  PACING has weak effect - FINAL_BID likely already incorporates pacing



### Model 2: Win Probability Model

In [14]:
print("="*80)
print("MODEL 2: WIN PROBABILITY")
print("="*80)

print("\nUnit of Analysis: Individual bid")
print("Dependent Variable: IS_WINNER (binary)")
print("Model: Logistic Regression")
print("Independent Variables: PACING, FINAL_BID, QUALITY, PLACEMENT")
print("\nPurpose: Estimate how pacing affects win probability")
print("Coefficient Interpretation: Log-odds ratios")

from scipy.special import expit

sample_win = df.sample(min(50000, len(df)), random_state=42)
sample_win = sample_win[(sample_win['FINAL_BID'] > 0) & (sample_win['PACING'] > 0)].copy()

# Normalize features
sample_win['pacing_norm'] = (sample_win['PACING'] - sample_win['PACING'].mean()) / sample_win['PACING'].std()
sample_win['bid_norm'] = (sample_win['FINAL_BID'] - sample_win['FINAL_BID'].mean()) / sample_win['FINAL_BID'].std()
sample_win['quality_norm'] = (sample_win['QUALITY'] - sample_win['QUALITY'].mean()) / sample_win['QUALITY'].std()

X_win = sample_win[['pacing_norm', 'bid_norm', 'quality_norm']].values
X_win = np.column_stack([np.ones(len(X_win)), X_win])
y_win = sample_win['IS_WINNER'].values.astype(float)

# Simple logistic regression via gradient descent (10 iterations for speed)
beta_win = np.zeros(4)
lr = 0.01
for _ in range(10):
    pred = expit(X_win @ beta_win)
    gradient = X_win.T @ (pred - y_win) / len(y_win)
    beta_win -= lr * gradient

pred_final = expit(X_win @ beta_win)
accuracy = ((pred_final > 0.5) == y_win).mean()

print(f"\nSample Size: {len(sample_win):,}")
print(f"Accuracy: {accuracy*100:.2f}%")
print(f"Baseline (always predict win): {y_win.mean()*100:.2f}%")
print(f"\nCoefficients (Log-Odds):")
print(f"  Intercept:  {beta_win[0]:8.4f}")
print(f"  PACING:     {beta_win[1]:8.4f}")
print(f"  FINAL_BID:  {beta_win[2]:8.4f}")
print(f"  QUALITY:    {beta_win[3]:8.4f}")

print("\nMarginal Effects (at means):")
mean_pred = expit(beta_win[0])
for i, var in enumerate(['PACING', 'FINAL_BID', 'QUALITY']):
    marginal = beta_win[i+1] * mean_pred * (1 - mean_pred)
    print(f"  {var}: {marginal:8.4f} (1 std ↑ → {marginal*100:.2f}pp change in win prob)")

print("\n" + "="*80)

MODEL 2: WIN PROBABILITY

Unit of Analysis: Individual bid
Dependent Variable: IS_WINNER (binary)
Model: Logistic Regression
Independent Variables: PACING, FINAL_BID, QUALITY, PLACEMENT

Purpose: Estimate how pacing affects win probability
Coefficient Interpretation: Log-odds ratios

Sample Size: 47,997
Accuracy: 86.04%
Baseline (always predict win): 86.04%

Coefficients (Log-Odds):
  Intercept:    0.0356
  PACING:      -0.0004
  FINAL_BID:    0.0013
  QUALITY:      0.0010

Marginal Effects (at means):
  PACING:  -0.0001 (1 std ↑ → -0.01pp change in win prob)
  FINAL_BID:   0.0003 (1 std ↑ → 0.03pp change in win prob)
  QUALITY:   0.0003 (1 std ↑ → 0.03pp change in win prob)

