# Two-Stage Least Squares (2SLS) Instrumental Variable Analysis

This notebook uses 2SLS regression to address potential endogeneity between validator centralization and market volatility.

**Why 2SLS?** Standard OLS regression may produce biased estimates if:
- Reverse causality exists (volatility affects centralization)
- Omitted variables influence both centralization and volatility

**Instrumental Variable:** We use **interest rates** (13-week Treasury Bill rate) as an instrument because:
1. **Relevance**: Interest rates affect validator economics and entry/exit decisions
2. **Exogeneity**: Interest rates are determined by macroeconomic policy, not ETH-specific factors
3. **Exclusion Restriction**: Interest rates affect volatility primarily through their impact on centralization

## Download Interest Rate Data

Download 13-week Treasury Bill rate (^IRX) from Yahoo Finance as the instrumental variable.

In [21]:
# =============================================================================
# INSTRUMENTAL VARIABLE ANALYSIS USING INTEREST RATES (via yfinance)
# =============================================================================

import pandas as pd
import numpy as np
import yfinance as yf
from statsmodels.regression.linear_model import OLS
from statsmodels.tools import add_constant
from linearmodels.iv import IV2SLS

# Load the data
df = pd.read_csv('../data/processed/daily_regression_data.csv')

print("\n" + "="*80)
print("DOWNLOADING INTEREST RATE DATA")
print("="*80)

# Download interest rate data using yfinance
start_date = pd.to_datetime(df['date'].min())
end_date = pd.to_datetime(df['date'].max())

print(f"Downloading data from {start_date.date()} to {end_date.date()}")

# Download 13-week Treasury Bill rate
treasury_data = yf.download('^IRX', start=start_date, end=end_date, progress=False)

# The Close price is the yield in percentage
treasury_data = treasury_data[['Close']].copy()
treasury_data.columns = ['interest_rate']
treasury_data.index.name = 'date'
treasury_data = treasury_data.reset_index()

print(f"Downloaded {len(treasury_data)} observations of Treasury Bill rates")
print(f"Interest rate range: {treasury_data['interest_rate'].min():.2f}% to {treasury_data['interest_rate'].max():.2f}%")

# Merge with your main dataset
df_merged = df.copy()
df_merged['date'] = pd.to_datetime(df_merged['date'])
df_merged = df_merged.merge(treasury_data, on='date', how='left')

# Forward fill missing values (weekends/holidays)
df_merged['interest_rate'] = df_merged['interest_rate'].fillna(method='ffill')

print(f"\nMerged dataset has {len(df_merged)} observations")
print(f"Missing interest rate values: {df_merged['interest_rate'].isna().sum()}")

# Create 30-day lagged interest rate
df_merged['interest_rate_lag30'] = df_merged['interest_rate'].shift(30)


DOWNLOADING INTEREST RATE DATA
Downloading data from 2020-07-13 to 2025-07-22
Downloaded 1262 observations of Treasury Bill rates
Interest rate range: 0.00% to 5.35%

Merged dataset has 1836 observations
Missing interest rate values: 0


  treasury_data = yf.download('^IRX', start=start_date, end=end_date, progress=False)
  df_merged['interest_rate'] = df_merged['interest_rate'].fillna(method='ffill')


## 2SLS Analysis: Top20_mean

**Endogenous Variable:** `top20_mean` (average share of top 20 validators)

**Model:**
- First Stage: `top20_mean ~ interest_rate_lag30`
- Second Stage: `volatility ~ top20_mean_fitted + controls`

**Controls:** market_return, eth_return, eth_turnover

In [22]:
# =============================================================================
# MANUAL TWO-STAGE LEAST SQUARES - TOP20_MEAN (CLEAN FIRST STAGE)
# =============================================================================

# Prepare data
iv_vars_manual = ['prof_garman_klass_vol', 'top20_mean', 'interest_rate_lag30',
                  'market_return', 'eth_return', 'eth_turnover']
iv_data_manual = df_merged[iv_vars_manual].dropna()

print(f"\nSample size: {len(iv_data_manual)}")

# ============================================================================
# FIRST STAGE: Regress endogenous variable on instrument only
# ============================================================================

print("\n" + "="*80)
print("FIRST STAGE REGRESSION (MANUAL - INSTRUMENT ONLY)")
print("Dependent Variable: top20_mean")
print("Instrument: interest_rate_lag30")
print("="*80)

X_first = add_constant(iv_data_manual[['interest_rate_lag30']])
y_first = iv_data_manual['top20_mean']
first_stage = OLS(y_first, X_first).fit(cov_type='HC3')

print(first_stage.summary())

# Get fitted values
top20_mean_fitted = first_stage.fittedvalues

# Calculate first-stage F-statistic
f_stat = first_stage.f_pvalue
print(f"\nFirst-Stage F-Statistic: {first_stage.fvalue:.2f}")
print(f"First-Stage F p-value: {first_stage.f_pvalue:.4f}")

# ============================================================================
# SECOND STAGE: Use fitted values with controls
# ============================================================================

print("\n" + "="*80)
print("SECOND STAGE REGRESSION (MANUAL - WITH CONTROLS)")
print("Dependent Variable: prof_garman_klass_vol")
print("Endogenous Variable (instrumented): top20_mean")
print("Controls: market_return, eth_return, eth_turnover")
print("="*80)

X_second = add_constant(pd.concat([
    pd.Series(top20_mean_fitted, name='top20_mean', index=iv_data_manual.index),
    iv_data_manual[['market_return', 'eth_return', 'eth_turnover']]
], axis=1))
y_second = iv_data_manual['prof_garman_klass_vol']
second_stage = OLS(y_second, X_second).fit(cov_type='HC3')

print(second_stage.summary())

# ============================================================================
# COMPARE WITH OLS
# ============================================================================

X_ols_manual = add_constant(iv_data_manual[['top20_mean', 'market_return',
                                             'eth_return', 'eth_turnover']])
y_ols_manual = iv_data_manual['prof_garman_klass_vol']
ols_model_manual = OLS(y_ols_manual, X_ols_manual).fit(cov_type='HC3')

print("\n" + "="*80)
print("OLS vs MANUAL 2SLS COMPARISON")
print("="*80)

print("\nEffect of top20_mean on volatility:")
print(f"{'Method':<15} {'Coefficient':>12} {'Std Error':>12} {'p-value':>10}")
print("-"*52)
print(f"{'OLS':<15} {ols_model_manual.params['top20_mean']:>12.6f} {ols_model_manual.bse['top20_mean']:>12.6f} {ols_model_manual.pvalues['top20_mean']:>10.4f}")
print(f"{'Manual 2SLS':<15} {second_stage.params['top20_mean']:>12.6f} {second_stage.bse['top20_mean']:>12.6f} {second_stage.pvalues['top20_mean']:>10.4f}")

coef_diff_manual = second_stage.params['top20_mean'] - ols_model_manual.params['top20_mean']
pct_diff_manual = (coef_diff_manual / ols_model_manual.params['top20_mean']) * 100

print(f"\nDifference: {coef_diff_manual:.6f} ({pct_diff_manual:+.1f}%)")


Sample size: 1806

FIRST STAGE REGRESSION (MANUAL - INSTRUMENT ONLY)
Dependent Variable: top20_mean
Instrument: interest_rate_lag30
                            OLS Regression Results                            
Dep. Variable:             top20_mean   R-squared:                       0.074
Model:                            OLS   Adj. R-squared:                  0.074
Method:                 Least Squares   F-statistic:                     205.6
Date:                Tue, 16 Dec 2025   Prob (F-statistic):           3.02e-44
Time:                        17:00:59   Log-Likelihood:                 690.05
No. Observations:                1806   AIC:                            -1376.
Df Residuals:                    1804   BIC:                            -1365.
Df Model:                           1                                         
Covariance Type:                  HC3                                         
                          coef    std err          z      P>|z|      [0.025  

## 2SLS Analysis: Top20_std

**Endogenous Variable:** `top20_std` (standard deviation of top 20 validator shares)

**Model:**
- First Stage: `top20_std ~ interest_rate_lag30`
- Second Stage: `volatility ~ top20_std_fitted + controls`

**Controls:** market_return, eth_return, eth_turnover

In [23]:
# =============================================================================
# MANUAL TWO-STAGE LEAST SQUARES - TOP20_STD (CLEAN FIRST STAGE)
# =============================================================================

# Prepare data
iv_vars_std_manual = ['prof_garman_klass_vol', 'top20_std', 'interest_rate_lag30',
                      'market_return', 'eth_return', 'eth_turnover']
iv_data_std_manual = df_merged[iv_vars_std_manual].dropna()

print(f"\nSample size: {len(iv_data_std_manual)}")

# ============================================================================
# FIRST STAGE: Regress endogenous variable on instrument only
# ============================================================================

print("\n" + "="*80)
print("FIRST STAGE REGRESSION (MANUAL - INSTRUMENT ONLY)")
print("Dependent Variable: top20_std")
print("Instrument: interest_rate_lag30")
print("="*80)

X_first_std = add_constant(iv_data_std_manual[['interest_rate_lag30']])
y_first_std = iv_data_std_manual['top20_std']
first_stage_std = OLS(y_first_std, X_first_std).fit(cov_type='HC3')

print(first_stage_std.summary())

# Get fitted values
top20_std_fitted = first_stage_std.fittedvalues

# Calculate first-stage F-statistic
print(f"\nFirst-Stage F-Statistic: {first_stage_std.fvalue:.2f}")
print(f"First-Stage F p-value: {first_stage_std.f_pvalue:.4f}")

# ============================================================================
# SECOND STAGE: Use fitted values with controls
# ============================================================================

print("\n" + "="*80)
print("SECOND STAGE REGRESSION (MANUAL - WITH CONTROLS)")
print("Dependent Variable: prof_garman_klass_vol")
print("Endogenous Variable (instrumented): top20_std")
print("Controls: market_return, eth_return, eth_turnover")
print("="*80)

X_second_std = add_constant(pd.concat([
    pd.Series(top20_std_fitted, name='top20_std', index=iv_data_std_manual.index),
    iv_data_std_manual[['market_return', 'eth_return', 'eth_turnover']]
], axis=1))
y_second_std = iv_data_std_manual['prof_garman_klass_vol']
second_stage_std = OLS(y_second_std, X_second_std).fit(cov_type='HC3')

print(second_stage_std.summary())

# ============================================================================
# COMPARE WITH OLS
# ============================================================================

X_ols_std_manual = add_constant(iv_data_std_manual[['top20_std', 'market_return',
                                                     'eth_return', 'eth_turnover']])
y_ols_std_manual = iv_data_std_manual['prof_garman_klass_vol']
ols_model_std_manual = OLS(y_ols_std_manual, X_ols_std_manual).fit(cov_type='HC3')

print("\n" + "="*80)
print("OLS vs MANUAL 2SLS COMPARISON")
print("="*80)

print("\nEffect of top20_std on volatility:")
print(f"{'Method':<15} {'Coefficient':>12} {'Std Error':>12} {'p-value':>10}")
print("-"*52)
print(f"{'OLS':<15} {ols_model_std_manual.params['top20_std']:>12.6f} {ols_model_std_manual.bse['top20_std']:>12.6f} {ols_model_std_manual.pvalues['top20_std']:>10.4f}")
print(f"{'Manual 2SLS':<15} {second_stage_std.params['top20_std']:>12.6f} {second_stage_std.bse['top20_std']:>12.6f} {second_stage_std.pvalues['top20_std']:>10.4f}")

coef_diff_std_manual = second_stage_std.params['top20_std'] - ols_model_std_manual.params['top20_std']
pct_diff_std_manual = (coef_diff_std_manual / ols_model_std_manual.params['top20_std']) * 100

print(f"\nDifference: {coef_diff_std_manual:.6f} ({pct_diff_std_manual:+.1f}%)")


Sample size: 1806

FIRST STAGE REGRESSION (MANUAL - INSTRUMENT ONLY)
Dependent Variable: top20_std
Instrument: interest_rate_lag30
                            OLS Regression Results                            
Dep. Variable:              top20_std   R-squared:                       0.309
Model:                            OLS   Adj. R-squared:                  0.308
Method:                 Least Squares   F-statistic:                     1112.
Date:                Tue, 16 Dec 2025   Prob (F-statistic):          2.06e-190
Time:                        17:00:59   Log-Likelihood:                -3919.7
No. Observations:                1806   AIC:                             7843.
Df Residuals:                    1804   BIC:                             7854.
Df Model:                           1                                         
Covariance Type:                  HC3                                         
                          coef    std err          z      P>|z|      [0.025   

## 2SLS Analysis: Coefficient of Variation

**Endogenous Variable:** `coeff_var` (top20_std / top20_mean - measures relative variability)

**Model:**
- First Stage: `coeff_var ~ interest_rate_lag30`
- Second Stage: `volatility ~ coeff_var_fitted + controls`

**Controls:** market_return, eth_return, eth_turnover

In [24]:
# =============================================================================
# MANUAL TWO-STAGE LEAST SQUARES - Coefficient of Variation
# =============================================================================

df_merged['coeff_var'] = df_merged['top20_std'] / df_merged['top20_mean']

# Prepare data
iv_vars_std_manual = ['prof_garman_klass_vol', 'coeff_var', 'interest_rate_lag30',
                      'market_return', 'eth_return', 'eth_turnover']
iv_data_std_manual = df_merged[iv_vars_std_manual].dropna()

print(f"\nSample size: {len(iv_data_std_manual)}")

# ============================================================================
# FIRST STAGE: Regress endogenous variable on instrument only
# ============================================================================

print("\n" + "="*80)
print("FIRST STAGE REGRESSION (MANUAL - INSTRUMENT ONLY)")
print("Dependent Variable: coeff_var")
print("Instrument: interest_rate_lag30")
print("="*80)

X_first_std = add_constant(iv_data_std_manual[['interest_rate_lag30']])
y_first_std = iv_data_std_manual['coeff_var']
first_stage_std = OLS(y_first_std, X_first_std).fit(cov_type='HC3')

print(first_stage_std.summary())

# Get fitted values
coeff_var_fitted = first_stage_std.fittedvalues

# Calculate first-stage F-statistic
print(f"\nFirst-Stage F-Statistic: {first_stage_std.fvalue:.2f}")
print(f"First-Stage F p-value: {first_stage_std.f_pvalue:.4f}")

# ============================================================================
# SECOND STAGE: Use fitted values with controls
# ============================================================================

print("\n" + "="*80)
print("SECOND STAGE REGRESSION (MANUAL - WITH CONTROLS)")
print("Dependent Variable: prof_garman_klass_vol")
print("Endogenous Variable (instrumented): coeff_var")
print("Controls: market_return, eth_return, eth_turnover")
print("="*80)

X_second_std = add_constant(pd.concat([
    pd.Series(coeff_var_fitted, name='coeff_var', index=iv_data_std_manual.index),
    iv_data_std_manual[['market_return', 'eth_return', 'eth_turnover']]
], axis=1))
y_second_std = iv_data_std_manual['prof_garman_klass_vol']
second_stage_std = OLS(y_second_std, X_second_std).fit(cov_type='HC3')

print(second_stage_std.summary())

# ============================================================================
# COMPARE WITH OLS
# ============================================================================

X_ols_std_manual = add_constant(iv_data_std_manual[['coeff_var', 'market_return',
                                                     'eth_return', 'eth_turnover']])
y_ols_std_manual = iv_data_std_manual['prof_garman_klass_vol']
ols_model_std_manual = OLS(y_ols_std_manual, X_ols_std_manual).fit(cov_type='HC3')

print("\n" + "="*80)
print("OLS vs MANUAL 2SLS COMPARISON")
print("="*80)

print("\nEffect of coeff_var on volatility:")
print(f"{'Method':<15} {'Coefficient':>12} {'Std Error':>12} {'p-value':>10}")
print("-"*52)
print(f"{'OLS':<15} {ols_model_std_manual.params['coeff_var']:>12.6f} {ols_model_std_manual.bse['coeff_var']:>12.6f} {ols_model_std_manual.pvalues['coeff_var']:>10.4f}")
print(f"{'Manual 2SLS':<15} {second_stage_std.params['coeff_var']:>12.6f} {second_stage_std.bse['coeff_var']:>12.6f} {second_stage_std.pvalues['coeff_var']:>10.4f}")

coef_diff_std_manual = second_stage_std.params['coeff_var'] - ols_model_std_manual.params['coeff_var']
pct_diff_std_manual = (coef_diff_std_manual / ols_model_std_manual.params['coeff_var']) * 100

print(f"\nDifference: {coef_diff_std_manual:.6f} ({pct_diff_std_manual:+.1f}%)")


Sample size: 1806

FIRST STAGE REGRESSION (MANUAL - INSTRUMENT ONLY)
Dependent Variable: coeff_var
Instrument: interest_rate_lag30
                            OLS Regression Results                            
Dep. Variable:              coeff_var   R-squared:                       0.377
Model:                            OLS   Adj. R-squared:                  0.376
Method:                 Least Squares   F-statistic:                     1476.
Date:                Tue, 16 Dec 2025   Prob (F-statistic):          1.94e-236
Time:                        17:00:59   Log-Likelihood:                -917.55
No. Observations:                1806   AIC:                             1839.
Df Residuals:                    1804   BIC:                             1850.
Df Model:                           1                                         
Covariance Type:                  HC3                                         
                          coef    std err          z      P>|z|      [0.025   