# Econometrics — Overview

## Purpose
Econometrics is the application of statistical and mathematical methods to economic and financial data. It bridges economic theory with real-world observations, enabling us to:

- **Test hypotheses** — validate or reject theoretical predictions with data
- **Estimate relationships** — quantify how variables affect each other
- **Forecast future values** — predict asset prices, economic indicators, risk metrics
- **Identify causal effects** — distinguish correlation from causation

## Key Questions This Section Answers
1. How do we estimate the relationship between financial variables? (OLS Regression)
2. How do we handle time-dependent data? (Time Series Econometrics)
3. How do we analyze data with both cross-sectional and time dimensions? (Panel Data)
4. How do we address endogeneity problems? (Instrumental Variables)
5. How do we model changing volatility? (GARCH Models)
6. How do we model multiple interrelated time series? (VAR Models)

---

## 1. Ordinary Least Squares (OLS) Regression

OLS is the foundational econometric technique for estimating linear relationships.

### The Linear Regression Model
$$Y_i = \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + ... + \beta_k X_{ki} + \varepsilon_i$$

Or in matrix form:
$$\mathbf{Y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}$$

### OLS Estimator
The OLS estimator minimizes the sum of squared residuals:
$$\hat{\boldsymbol{\beta}} = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y}$$

### Gauss-Markov Assumptions (BLUE)
For OLS to be the **Best Linear Unbiased Estimator (BLUE)**, these assumptions must hold:

1. **Linearity**: The relationship between $Y$ and $X$ is linear in parameters
2. **Random Sampling**: Observations are independently drawn
3. **No Perfect Multicollinearity**: No exact linear relationships among regressors
4. **Zero Conditional Mean**: $E[\varepsilon | X] = 0$ (exogeneity)
5. **Homoskedasticity**: $Var(\varepsilon | X) = \sigma^2$ (constant variance)
6. **No Autocorrelation**: $Cov(\varepsilon_i, \varepsilon_j) = 0$ for $i \neq j$

In [None]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.stats.diagnostic import het_breuschpagan, acorr_breusch_godfrey
from statsmodels.stats.stattools import durbin_watson
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from scipy import stats
import warnings
warnings.filterwarnings('ignore')

np.random.seed(42)
print("Libraries loaded successfully!")

In [None]:
# OLS Example: Simulating a Factor Model (CAPM-style)
# Y (excess return) = alpha + beta * market_return + epsilon

n = 500  # Number of observations
market_return = np.random.normal(0.08, 0.18, n)  # Market excess return
true_alpha = 0.02  # True alpha (manager skill)
true_beta = 1.2    # True beta (market sensitivity)
epsilon = np.random.normal(0, 0.10, n)  # Idiosyncratic risk

# Generate stock excess returns
stock_return = true_alpha + true_beta * market_return + epsilon

# Create DataFrame
df = pd.DataFrame({
    'stock_return': stock_return,
    'market_return': market_return
})

# Fit OLS model
X = sm.add_constant(df['market_return'])
model = sm.OLS(df['stock_return'], X).fit()

print("=" * 60)
print("OLS REGRESSION: CAPM Factor Model")
print("=" * 60)
print(f"\nTrue Alpha: {true_alpha:.4f}, Estimated Alpha: {model.params['const']:.4f}")
print(f"True Beta:  {true_beta:.4f}, Estimated Beta:  {model.params['market_return']:.4f}")
print(f"\nR-squared: {model.rsquared:.4f}")
print(f"Adjusted R-squared: {model.rsquared_adj:.4f}")
print("\n" + "=" * 60)
print(model.summary())

In [None]:
# Visualize OLS Regression Line
fig = go.Figure()

# Scatter plot of actual data
fig.add_trace(go.Scatter(
    x=df['market_return'],
    y=df['stock_return'],
    mode='markers',
    marker=dict(color='steelblue', size=6, opacity=0.6),
    name='Observations'
))

# Regression line
x_line = np.linspace(df['market_return'].min(), df['market_return'].max(), 100)
y_line = model.params['const'] + model.params['market_return'] * x_line

fig.add_trace(go.Scatter(
    x=x_line,
    y=y_line,
    mode='lines',
    line=dict(color='red', width=3),
    name=f'OLS Fit: α={model.params["const"]:.3f}, β={model.params["market_return"]:.3f}'
))

fig.update_layout(
    title='CAPM Regression: Stock Returns vs Market Returns',
    xaxis_title='Market Excess Return',
    yaxis_title='Stock Excess Return',
    template='plotly_white',
    legend=dict(x=0.02, y=0.98),
    hovermode='closest'
)
fig

## 2. OLS Diagnostics

### Key Diagnostic Tests

| Issue | Test | Null Hypothesis | Solution |
|-------|------|-----------------|----------|
| Heteroskedasticity | Breusch-Pagan, White | Constant variance | Robust SEs, WLS |
| Autocorrelation | Durbin-Watson, Breusch-Godfrey | No serial correlation | HAC SEs, GLS |
| Non-Normality | Jarque-Bera | Residuals are normal | Large samples (CLT) |
| Multicollinearity | VIF | Low multicollinearity | Remove/combine vars |

### Heteroskedasticity
When $Var(\varepsilon_i | X_i) \neq \sigma^2$, the OLS standard errors are biased.

### Autocorrelation
When $Cov(\varepsilon_t, \varepsilon_{t-k}) \neq 0$, common in time series data.

In [None]:
# OLS Diagnostics
residuals = model.resid
fitted = model.fittedvalues

# 1. Breusch-Pagan Test for Heteroskedasticity
bp_test = het_breuschpagan(residuals, X)
print("=" * 60)
print("DIAGNOSTIC TESTS")
print("=" * 60)
print(f"\n1. Breusch-Pagan Test (Heteroskedasticity):")
print(f"   LM Statistic: {bp_test[0]:.4f}")
print(f"   p-value: {bp_test[1]:.4f}")
print(f"   Result: {'Reject H0 (Heteroskedasticity detected)' if bp_test[1] < 0.05 else 'Fail to reject H0 (Homoskedasticity)'}")

# 2. Durbin-Watson Test for Autocorrelation
dw_stat = durbin_watson(residuals)
print(f"\n2. Durbin-Watson Test (Autocorrelation):")
print(f"   DW Statistic: {dw_stat:.4f}")
print(f"   Result: {'No autocorrelation' if 1.5 < dw_stat < 2.5 else 'Possible autocorrelation'}")

# 3. Jarque-Bera Test for Normality
jb_stat, jb_pval = stats.jarque_bera(residuals)
print(f"\n3. Jarque-Bera Test (Normality):")
print(f"   JB Statistic: {jb_stat:.4f}")
print(f"   p-value: {jb_pval:.4f}")
print(f"   Result: {'Reject H0 (Non-normal)' if jb_pval < 0.05 else 'Fail to reject H0 (Normal)'}")

In [None]:
# Diagnostic Plots
fig = make_subplots(rows=2, cols=2,
                    subplot_titles=('Residuals vs Fitted', 'Q-Q Plot',
                                   'Scale-Location', 'Residual Distribution'))

# 1. Residuals vs Fitted
fig.add_trace(
    go.Scatter(x=fitted, y=residuals, mode='markers',
               marker=dict(color='steelblue', opacity=0.6),
               showlegend=False),
    row=1, col=1
)
fig.add_hline(y=0, line_dash='dash', line_color='red', row=1, col=1)

# 2. Q-Q Plot
theoretical_q = stats.norm.ppf(np.linspace(0.01, 0.99, len(residuals)))
sample_q = np.sort(residuals)
fig.add_trace(
    go.Scatter(x=theoretical_q, y=sample_q, mode='markers',
               marker=dict(color='steelblue', opacity=0.6),
               showlegend=False),
    row=1, col=2
)
# 45-degree line
fig.add_trace(
    go.Scatter(x=[-3, 3], y=[-3 * residuals.std(), 3 * residuals.std()],
               mode='lines', line=dict(color='red', dash='dash'),
               showlegend=False),
    row=1, col=2
)

# 3. Scale-Location (sqrt of standardized residuals)
std_resid = np.sqrt(np.abs(residuals / residuals.std()))
fig.add_trace(
    go.Scatter(x=fitted, y=std_resid, mode='markers',
               marker=dict(color='steelblue', opacity=0.6),
               showlegend=False),
    row=2, col=1
)

# 4. Histogram of Residuals
fig.add_trace(
    go.Histogram(x=residuals, nbinsx=30,
                 marker_color='steelblue', opacity=0.7,
                 showlegend=False),
    row=2, col=2
)

fig.update_layout(height=600, title_text='OLS Diagnostic Plots', template='plotly_white')
fig.update_xaxes(title_text='Fitted Values', row=1, col=1)
fig.update_yaxes(title_text='Residuals', row=1, col=1)
fig.update_xaxes(title_text='Theoretical Quantiles', row=1, col=2)
fig.update_yaxes(title_text='Sample Quantiles', row=1, col=2)
fig.update_xaxes(title_text='Fitted Values', row=2, col=1)
fig.update_yaxes(title_text='√|Standardized Residuals|', row=2, col=1)
fig.update_xaxes(title_text='Residuals', row=2, col=2)
fig.update_yaxes(title_text='Frequency', row=2, col=2)
fig

## 3. Time Series Econometrics

### Stationarity
A time series is **stationary** if its statistical properties don't change over time:

1. **Constant mean**: $E[Y_t] = \mu$ for all $t$
2. **Constant variance**: $Var(Y_t) = \sigma^2$ for all $t$
3. **Autocovariance depends only on lag**: $Cov(Y_t, Y_{t-k}) = \gamma_k$

### Why Stationarity Matters
- Non-stationary series can lead to **spurious regressions** (high $R^2$ without real relationship)
- Standard inference (t-tests, F-tests) may be invalid
- Many econometric models require stationary data

### Unit Root
A series $Y_t$ has a **unit root** if:
$$Y_t = \rho Y_{t-1} + \varepsilon_t \quad \text{with } \rho = 1$$

This is a **random walk** — the series is non-stationary.

### Augmented Dickey-Fuller (ADF) Test
Tests for unit root:
$$\Delta Y_t = \alpha + \beta t + \gamma Y_{t-1} + \sum_{i=1}^{p} \delta_i \Delta Y_{t-i} + \varepsilon_t$$

- **H₀**: $\gamma = 0$ (unit root exists, non-stationary)
- **H₁**: $\gamma < 0$ (no unit root, stationary)

In [None]:
from statsmodels.tsa.stattools import adfuller, kpss, coint
from statsmodels.tsa.arima.model import ARIMA

# Generate stationary vs non-stationary series
np.random.seed(42)
n = 500

# Stationary: AR(1) with |phi| < 1
phi = 0.7
stationary_series = np.zeros(n)
for t in range(1, n):
    stationary_series[t] = phi * stationary_series[t-1] + np.random.normal(0, 1)

# Non-stationary: Random Walk (unit root)
random_walk = np.cumsum(np.random.normal(0, 1, n))

# Visualize both series
fig = make_subplots(rows=1, cols=2,
                    subplot_titles=('Stationary Series (AR(1), φ=0.7)',
                                   'Non-Stationary Series (Random Walk)'))

fig.add_trace(go.Scatter(y=stationary_series, mode='lines',
                         line=dict(color='green', width=1), showlegend=False),
              row=1, col=1)
fig.add_trace(go.Scatter(y=random_walk, mode='lines',
                         line=dict(color='red', width=1), showlegend=False),
              row=1, col=2)

fig.update_layout(height=400, title_text='Stationary vs Non-Stationary Time Series',
                  template='plotly_white')
fig.update_xaxes(title_text='Time', row=1, col=1)
fig.update_xaxes(title_text='Time', row=1, col=2)
fig.update_yaxes(title_text='Value', row=1, col=1)
fig.update_yaxes(title_text='Value', row=1, col=2)
fig

In [None]:
# Augmented Dickey-Fuller Test
def run_adf_test(series, name):
    result = adfuller(series, autolag='AIC')
    print(f"\n{name}:")
    print(f"  ADF Statistic: {result[0]:.4f}")
    print(f"  p-value: {result[1]:.4f}")
    print(f"  Lags Used: {result[2]}")
    print(f"  Critical Values:")
    for key, value in result[4].items():
        print(f"    {key}: {value:.4f}")
    print(f"  Result: {'Stationary (Reject H0)' if result[1] < 0.05 else 'Non-Stationary (Fail to reject H0)'}")
    return result

print("=" * 60)
print("AUGMENTED DICKEY-FULLER TEST FOR UNIT ROOT")
print("H0: Unit root exists (non-stationary)")
print("H1: No unit root (stationary)")
print("=" * 60)

adf_stationary = run_adf_test(stationary_series, "Stationary AR(1) Series")
adf_rw = run_adf_test(random_walk, "Random Walk (Non-Stationary)")
adf_diff = run_adf_test(np.diff(random_walk), "First Difference of Random Walk")

### Cointegration

Two non-stationary series $X_t$ and $Y_t$ are **cointegrated** if there exists a linear combination that is stationary:

$$Y_t - \beta X_t = \varepsilon_t \quad \text{where } \varepsilon_t \sim I(0)$$

**Interpretation**: The series move together in the long run despite short-term deviations.

**Finance Applications**:
- Pairs trading strategies
- Long-run relationships between asset prices
- Error correction models

In [None]:
# Cointegration Example: Two cointegrated stock prices
np.random.seed(42)
n = 500

# Common stochastic trend
common_trend = np.cumsum(np.random.normal(0, 1, n))

# Two cointegrated series
stock_A = 50 + common_trend + np.random.normal(0, 2, n)  # Stock A
stock_B = 30 + 0.6 * common_trend + np.random.normal(0, 1.5, n)  # Stock B (cointegrated with A)

# Non-cointegrated series (independent random walk)
stock_C = 40 + np.cumsum(np.random.normal(0, 1, n))  # Independent

# Test for cointegration using Engle-Granger method
print("=" * 60)
print("COINTEGRATION TEST (Engle-Granger Method)")
print("H0: No cointegration")
print("H1: Cointegration exists")
print("=" * 60)

# Test Stock A and Stock B (should be cointegrated)
coint_result_AB = coint(stock_A, stock_B)
print(f"\nStock A & Stock B (Cointegrated):")
print(f"  Test Statistic: {coint_result_AB[0]:.4f}")
print(f"  p-value: {coint_result_AB[1]:.4f}")
print(f"  Result: {'Cointegrated' if coint_result_AB[1] < 0.05 else 'Not Cointegrated'}")

# Test Stock A and Stock C (should NOT be cointegrated)
coint_result_AC = coint(stock_A, stock_C)
print(f"\nStock A & Stock C (Not Cointegrated):")
print(f"  Test Statistic: {coint_result_AC[0]:.4f}")
print(f"  p-value: {coint_result_AC[1]:.4f}")
print(f"  Result: {'Cointegrated' if coint_result_AC[1] < 0.05 else 'Not Cointegrated'}")

In [None]:
# Visualize Cointegrated vs Non-Cointegrated Series
fig = make_subplots(rows=2, cols=2,
                    subplot_titles=('Cointegrated Stocks (A & B)',
                                   'Spread: Stock A - 1.67*Stock B',
                                   'Non-Cointegrated Stocks (A & C)',
                                   'Spread: Stock A - Stock C'))

# Cointegrated series
fig.add_trace(go.Scatter(y=stock_A, name='Stock A', line=dict(color='blue')), row=1, col=1)
fig.add_trace(go.Scatter(y=stock_B, name='Stock B', line=dict(color='green')), row=1, col=1)

# Spread of cointegrated series (stationary)
spread_AB = stock_A - 1.67 * stock_B
fig.add_trace(go.Scatter(y=spread_AB, name='Spread A-B',
                         line=dict(color='purple'), showlegend=False), row=1, col=2)
fig.add_hline(y=spread_AB.mean(), line_dash='dash', line_color='red', row=1, col=2)

# Non-cointegrated series
fig.add_trace(go.Scatter(y=stock_A, name='Stock A', line=dict(color='blue'),
                         showlegend=False), row=2, col=1)
fig.add_trace(go.Scatter(y=stock_C, name='Stock C', line=dict(color='orange'),
                         showlegend=False), row=2, col=1)

# Spread of non-cointegrated series (non-stationary)
spread_AC = stock_A - stock_C
fig.add_trace(go.Scatter(y=spread_AC, name='Spread A-C',
                         line=dict(color='red'), showlegend=False), row=2, col=2)

fig.update_layout(height=600, title_text='Cointegration: Mean-Reverting Spread',
                  template='plotly_white')
fig

## 4. Panel Data Models

Panel data has **two dimensions**: cross-sectional (individuals) and time series.

### General Panel Data Model
$$Y_{it} = \alpha + \mathbf{X}_{it}'\boldsymbol{\beta} + u_{it}$$

where $i = 1, ..., N$ (individuals) and $t = 1, ..., T$ (time periods).

### Fixed Effects (FE) Model
Controls for unobserved, time-invariant heterogeneity:
$$Y_{it} = \alpha_i + \mathbf{X}_{it}'\boldsymbol{\beta} + \varepsilon_{it}$$

- Each individual has its own intercept $\alpha_i$
- Uses **within-group variation** (demeaning)
- Allows $\alpha_i$ to be correlated with $\mathbf{X}_{it}$

### Random Effects (RE) Model
Treats individual effects as random:
$$Y_{it} = \alpha + \mathbf{X}_{it}'\boldsymbol{\beta} + u_i + \varepsilon_{it}$$

- $u_i$ is a random individual effect
- Uses **GLS estimation** (both within and between variation)
- Assumes $u_i$ is uncorrelated with $\mathbf{X}_{it}$

### Hausman Test
Tests whether to use FE or RE:
- **H₀**: RE is consistent (effects uncorrelated with regressors)
- **H₁**: FE is needed (effects correlated with regressors)

In [None]:
from linearmodels.panel import PanelOLS, RandomEffects, compare

# Simulate Panel Data: Returns for multiple firms over time
np.random.seed(42)
n_firms = 50
n_years = 10

# Create panel structure
firms = np.repeat(np.arange(n_firms), n_years)
years = np.tile(np.arange(n_years), n_firms)

# Firm-specific fixed effects (unobserved heterogeneity)
firm_effects = np.random.normal(0, 0.02, n_firms)
firm_effects_panel = np.repeat(firm_effects, n_years)

# Generate independent variables
market_beta = 0.8 + 0.4 * np.random.rand(n_firms * n_years)  # Market beta
size = np.random.normal(0, 1, n_firms * n_years)  # Size factor
value = np.random.normal(0, 1, n_firms * n_years)  # Value factor

# Generate returns
true_beta_mkt = 0.06
true_beta_size = 0.02
true_beta_value = 0.03

returns = (0.05 + firm_effects_panel +
           true_beta_mkt * market_beta +
           true_beta_size * size +
           true_beta_value * value +
           np.random.normal(0, 0.05, n_firms * n_years))

# Create Panel DataFrame
panel_df = pd.DataFrame({
    'firm': firms,
    'year': years,
    'returns': returns,
    'market_beta': market_beta,
    'size': size,
    'value': value
})
panel_df = panel_df.set_index(['firm', 'year'])

print("Panel Data Structure:")
print(f"  Number of firms (N): {n_firms}")
print(f"  Number of years (T): {n_years}")
print(f"  Total observations: {len(panel_df)}")
print("\nFirst few rows:")
print(panel_df.head(10))

In [None]:
# Fixed Effects Model
fe_model = PanelOLS(panel_df['returns'],
                    panel_df[['market_beta', 'size', 'value']],
                    entity_effects=True)
fe_results = fe_model.fit()

# Random Effects Model
re_model = RandomEffects(panel_df['returns'],
                         sm.add_constant(panel_df[['market_beta', 'size', 'value']]))
re_results = re_model.fit()

print("=" * 60)
print("PANEL DATA MODELS COMPARISON")
print("=" * 60)
print("\nTrue Coefficients:")
print(f"  Market Beta: {true_beta_mkt:.4f}")
print(f"  Size: {true_beta_size:.4f}")
print(f"  Value: {true_beta_value:.4f}")

print("\n" + "=" * 60)
print("FIXED EFFECTS MODEL")
print("=" * 60)
print(fe_results.summary)

In [None]:
print("=" * 60)
print("RANDOM EFFECTS MODEL")
print("=" * 60)
print(re_results.summary)

In [None]:
# Hausman Test: FE vs RE
from scipy.stats import chi2

# Extract coefficients (excluding constant for RE)
b_fe = fe_results.params
b_re = re_results.params[['market_beta', 'size', 'value']]

# Covariance matrices
var_fe = fe_results.cov
var_re = re_results.cov.loc[['market_beta', 'size', 'value'], ['market_beta', 'size', 'value']]

# Hausman statistic
diff = b_fe - b_re
var_diff = var_fe - var_re
hausman_stat = diff.T @ np.linalg.inv(var_diff) @ diff
hausman_pval = 1 - chi2.cdf(hausman_stat, df=len(b_fe))

print("=" * 60)
print("HAUSMAN TEST: Fixed Effects vs Random Effects")
print("=" * 60)
print(f"\nH0: Random Effects is consistent (use RE)")
print(f"H1: Fixed Effects is needed (use FE)")
print(f"\nHausman Statistic: {hausman_stat:.4f}")
print(f"p-value: {hausman_pval:.4f}")
print(f"Degrees of freedom: {len(b_fe)}")
print(f"\nConclusion: {'Use Fixed Effects (Reject H0)' if hausman_pval < 0.05 else 'Use Random Effects (Fail to reject H0)'}")

## 5. Instrumental Variables and 2SLS

### The Endogeneity Problem
Endogeneity occurs when $Cov(X, \varepsilon) \neq 0$, causing OLS estimates to be **biased** and **inconsistent**.

**Sources of Endogeneity**:
1. **Omitted Variable Bias**: Relevant variable correlated with both $X$ and $Y$
2. **Measurement Error**: Noise in measuring $X$
3. **Simultaneity**: $X$ affects $Y$ and $Y$ affects $X$

### Instrumental Variables (IV)
An instrument $Z$ must satisfy:
1. **Relevance**: $Cov(Z, X) \neq 0$ — instrument is correlated with endogenous variable
2. **Exogeneity**: $Cov(Z, \varepsilon) = 0$ — instrument is uncorrelated with error

### Two-Stage Least Squares (2SLS)

**Stage 1**: Regress endogenous $X$ on instrument $Z$:
$$X = \pi_0 + \pi_1 Z + v$$
Get fitted values $\hat{X}$

**Stage 2**: Regress $Y$ on $\hat{X}$:
$$Y = \beta_0 + \beta_1 \hat{X} + \varepsilon$$

The IV estimator:
$$\hat{\beta}_{IV} = \frac{Cov(Z, Y)}{Cov(Z, X)}$$

In [None]:
from linearmodels.iv import IV2SLS

# Simulate Endogeneity Problem
np.random.seed(42)
n = 1000

# Unobserved confounder (affects both X and Y)
confounder = np.random.normal(0, 1, n)

# Instrument (affects X but NOT Y directly)
instrument = np.random.normal(0, 1, n)

# Endogenous variable X (affected by instrument and confounder)
true_pi = 0.5  # Effect of instrument on X
X = 2 + true_pi * instrument + 0.7 * confounder + np.random.normal(0, 0.5, n)

# Outcome Y (affected by X and confounder, but NOT directly by instrument)
true_beta = 1.5  # True causal effect of X on Y
Y = 1 + true_beta * X + 0.8 * confounder + np.random.normal(0, 1, n)

# Create DataFrame
iv_df = pd.DataFrame({'Y': Y, 'X': X, 'Z': instrument})

print("=" * 60)
print("INSTRUMENTAL VARIABLES EXAMPLE")
print("=" * 60)
print(f"\nTrue causal effect (β): {true_beta}")
print("\nProblem: X is endogenous (correlated with unobserved confounder)")
print("Solution: Use instrument Z that affects Y only through X")

In [None]:
# Compare OLS vs 2SLS

# OLS (biased due to endogeneity)
ols_model = sm.OLS(iv_df['Y'], sm.add_constant(iv_df['X'])).fit()
print("\n" + "=" * 60)
print("OLS ESTIMATION (BIASED)")
print("=" * 60)
print(f"OLS estimate of β: {ols_model.params['X']:.4f}")
print(f"True β: {true_beta}")
print(f"Bias: {ols_model.params['X'] - true_beta:.4f}")

# 2SLS (consistent)
iv_model = IV2SLS(iv_df['Y'], exog=None, endog=iv_df['X'], instruments=iv_df['Z']).fit()
print("\n" + "=" * 60)
print("2SLS ESTIMATION (CONSISTENT)")
print("=" * 60)
print(f"2SLS estimate of β: {iv_model.params['X']:.4f}")
print(f"True β: {true_beta}")
print(f"Difference: {iv_model.params['X'] - true_beta:.4f}")

print("\n" + "=" * 60)
print("FIRST STAGE REGRESSION")
print("=" * 60)
first_stage = sm.OLS(iv_df['X'], sm.add_constant(iv_df['Z'])).fit()
print(f"Instrument coefficient: {first_stage.params['Z']:.4f}")
print(f"F-statistic: {first_stage.fvalue:.2f}")
print(f"Rule of thumb: F > 10 indicates strong instrument")
print(f"Result: {'Strong instrument' if first_stage.fvalue > 10 else 'Weak instrument warning'}")

In [None]:
# Visualize OLS vs IV estimates
fig = go.Figure()

# Data points
fig.add_trace(go.Scatter(
    x=iv_df['X'], y=iv_df['Y'],
    mode='markers', marker=dict(color='steelblue', size=5, opacity=0.4),
    name='Data'
))

# OLS line (biased)
x_range = np.linspace(iv_df['X'].min(), iv_df['X'].max(), 100)
fig.add_trace(go.Scatter(
    x=x_range, y=ols_model.params['const'] + ols_model.params['X'] * x_range,
    mode='lines', line=dict(color='red', width=3),
    name=f'OLS (β={ols_model.params["X"]:.3f}) - BIASED'
))

# IV line (consistent)
fig.add_trace(go.Scatter(
    x=x_range, y=iv_model.params['X'] * x_range,
    mode='lines', line=dict(color='green', width=3),
    name=f'2SLS (β={iv_model.params["X"]:.3f}) - CONSISTENT'
))

# True relationship
fig.add_trace(go.Scatter(
    x=x_range, y=1 + true_beta * x_range,
    mode='lines', line=dict(color='black', width=2, dash='dash'),
    name=f'True (β={true_beta})'
))

fig.update_layout(
    title='OLS vs 2SLS: Correcting for Endogeneity',
    xaxis_title='X (Endogenous Variable)',
    yaxis_title='Y',
    template='plotly_white',
    legend=dict(x=0.02, y=0.98)
)
fig

## 6. GARCH Models for Volatility

Financial returns exhibit **volatility clustering** — periods of high volatility tend to be followed by high volatility.

### ARCH (Autoregressive Conditional Heteroskedasticity)
$$\sigma_t^2 = \omega + \alpha_1 \varepsilon_{t-1}^2$$

Volatility depends on past squared shocks.

### GARCH(1,1) (Generalized ARCH)
$$\sigma_t^2 = \omega + \alpha \varepsilon_{t-1}^2 + \beta \sigma_{t-1}^2$$

Where:
- $\omega$ = baseline volatility (constant)
- $\alpha$ = reaction to recent shocks
- $\beta$ = persistence of volatility
- $\alpha + \beta$ = persistence coefficient (should be < 1 for stationarity)

### Key Properties
- **Volatility Clustering**: Captured by $\beta$
- **Fat Tails**: GARCH generates leptokurtic distributions
- **Mean Reversion**: Long-run variance = $\frac{\omega}{1 - \alpha - \beta}$

In [None]:
from arch import arch_model

# Simulate GARCH(1,1) process
np.random.seed(42)
n = 2000

# GARCH parameters
omega = 0.00001  # Baseline variance
alpha = 0.1      # Shock impact
beta = 0.85      # Persistence

# Initialize
returns = np.zeros(n)
sigma2 = np.zeros(n)
sigma2[0] = omega / (1 - alpha - beta)  # Unconditional variance

# Simulate
for t in range(1, n):
    sigma2[t] = omega + alpha * returns[t-1]**2 + beta * sigma2[t-1]
    returns[t] = np.sqrt(sigma2[t]) * np.random.normal()

# Create time series
returns_series = pd.Series(returns * 100, name='returns')  # Convert to percentage

print("=" * 60)
print("GARCH(1,1) SIMULATION")
print("=" * 60)
print(f"\nTrue Parameters:")
print(f"  ω (omega): {omega:.6f}")
print(f"  α (alpha): {alpha:.4f}")
print(f"  β (beta):  {beta:.4f}")
print(f"  α + β:     {alpha + beta:.4f}")
print(f"  Long-run variance: {omega/(1-alpha-beta):.6f}")
print(f"  Long-run volatility (daily %): {np.sqrt(omega/(1-alpha-beta))*100:.4f}%")

In [None]:
# Visualize Volatility Clustering
fig = make_subplots(rows=2, cols=1,
                    subplot_titles=('Returns (%)', 'Conditional Volatility'),
                    row_heights=[0.5, 0.5])

# Returns
fig.add_trace(
    go.Scatter(y=returns_series, mode='lines',
               line=dict(color='steelblue', width=0.8),
               showlegend=False),
    row=1, col=1
)

# Volatility
fig.add_trace(
    go.Scatter(y=np.sqrt(sigma2) * 100, mode='lines',
               line=dict(color='red', width=1),
               showlegend=False),
    row=2, col=1
)

fig.update_layout(height=500, title_text='GARCH(1,1): Volatility Clustering',
                  template='plotly_white')
fig.update_xaxes(title_text='Time', row=2, col=1)
fig.update_yaxes(title_text='Return (%)', row=1, col=1)
fig.update_yaxes(title_text='Volatility (%)', row=2, col=1)
fig

In [None]:
# Fit GARCH(1,1) Model to Simulated Data
garch_model = arch_model(returns_series, vol='Garch', p=1, q=1, rescale=False)
garch_fit = garch_model.fit(disp='off')

print("\n" + "=" * 60)
print("GARCH(1,1) ESTIMATION RESULTS")
print("=" * 60)
print(garch_fit.summary())

In [None]:
# Compare True vs Estimated Volatility
estimated_vol = garch_fit.conditional_volatility
true_vol = np.sqrt(sigma2) * 100

fig = go.Figure()

fig.add_trace(go.Scatter(
    y=true_vol[100:], name='True Volatility',
    line=dict(color='blue', width=1.5)
))

fig.add_trace(go.Scatter(
    y=estimated_vol[100:], name='Estimated Volatility',
    line=dict(color='red', width=1.5, dash='dot')
))

fig.update_layout(
    title='GARCH(1,1): True vs Estimated Conditional Volatility',
    xaxis_title='Time',
    yaxis_title='Volatility (%)',
    template='plotly_white',
    legend=dict(x=0.02, y=0.98)
)
fig.show()

# Correlation between true and estimated
corr = np.corrcoef(true_vol[100:], estimated_vol[100:])[0, 1]
print(f"\nCorrelation between true and estimated volatility: {corr:.4f}")

## 7. Vector Autoregression (VAR)

VAR models capture **dynamic relationships** among multiple time series.

### VAR(p) Model
$$\mathbf{Y}_t = \mathbf{c} + \mathbf{A}_1 \mathbf{Y}_{t-1} + \mathbf{A}_2 \mathbf{Y}_{t-2} + ... + \mathbf{A}_p \mathbf{Y}_{t-p} + \boldsymbol{\varepsilon}_t$$

For two variables:
$$\begin{bmatrix} Y_{1t} \\ Y_{2t} \end{bmatrix} = \begin{bmatrix} c_1 \\ c_2 \end{bmatrix} + \begin{bmatrix} a_{11} & a_{12} \\ a_{21} & a_{22} \end{bmatrix} \begin{bmatrix} Y_{1,t-1} \\ Y_{2,t-1} \end{bmatrix} + \begin{bmatrix} \varepsilon_{1t} \\ \varepsilon_{2t} \end{bmatrix}$$

### Key Concepts

**Granger Causality**: $X$ Granger-causes $Y$ if past values of $X$ help predict $Y$.

**Impulse Response Functions (IRF)**: Track how a shock to one variable propagates through the system.

**Forecast Error Variance Decomposition (FEVD)**: Shows what fraction of forecast error variance is due to each shock.

### Finance Applications
- Stock-bond return relationships
- Macroeconomic forecasting
- Risk spillover analysis

In [None]:
from statsmodels.tsa.api import VAR
from statsmodels.tsa.stattools import grangercausalitytests

# Simulate VAR(1) process: Stock Returns and Bond Returns
np.random.seed(42)
n = 500

# VAR coefficient matrix
A = np.array([[0.5, -0.2],   # Stock: AR(1) + negative bond effect
              [0.1, 0.6]])   # Bond: slight stock spillover + AR(1)

# Simulate
stock_returns = np.zeros(n)
bond_returns = np.zeros(n)

for t in range(1, n):
    stock_returns[t] = A[0,0] * stock_returns[t-1] + A[0,1] * bond_returns[t-1] + np.random.normal(0, 1)
    bond_returns[t] = A[1,0] * stock_returns[t-1] + A[1,1] * bond_returns[t-1] + np.random.normal(0, 0.5)

# Create DataFrame
var_df = pd.DataFrame({
    'Stock_Returns': stock_returns,
    'Bond_Returns': bond_returns
})

# Visualize the series
fig = make_subplots(rows=2, cols=1, subplot_titles=('Stock Returns', 'Bond Returns'))

fig.add_trace(go.Scatter(y=stock_returns, mode='lines',
                         line=dict(color='blue', width=1), showlegend=False),
              row=1, col=1)
fig.add_trace(go.Scatter(y=bond_returns, mode='lines',
                         line=dict(color='green', width=1), showlegend=False),
              row=2, col=1)

fig.update_layout(height=400, title_text='Simulated Stock and Bond Returns',
                  template='plotly_white')
fig

In [None]:
# Fit VAR Model
var_model = VAR(var_df)

# Select optimal lag using information criteria
lag_order = var_model.select_order(maxlags=10)
print("=" * 60)
print("VAR LAG ORDER SELECTION")
print("=" * 60)
print(lag_order.summary())

In [None]:
# Fit VAR(1)
var_results = var_model.fit(1)
print("\n" + "=" * 60)
print("VAR(1) ESTIMATION RESULTS")
print("=" * 60)
print(var_results.summary())

In [None]:
# Granger Causality Tests
print("\n" + "=" * 60)
print("GRANGER CAUSALITY TESTS")
print("=" * 60)

print("\n--- Does Bond Granger-cause Stock? ---")
gc_bond_stock = grangercausalitytests(var_df[['Stock_Returns', 'Bond_Returns']], maxlag=2, verbose=True)

print("\n--- Does Stock Granger-cause Bond? ---")
gc_stock_bond = grangercausalitytests(var_df[['Bond_Returns', 'Stock_Returns']], maxlag=2, verbose=True)

In [None]:
# Impulse Response Functions
irf = var_results.irf(periods=20)

# Plot IRFs
fig = make_subplots(rows=2, cols=2,
                    subplot_titles=('Stock → Stock', 'Bond → Stock',
                                   'Stock → Bond', 'Bond → Bond'))

periods = np.arange(21)

# Stock shock on Stock
fig.add_trace(go.Scatter(x=periods, y=irf.irfs[:, 0, 0],
                         line=dict(color='blue', width=2), showlegend=False),
              row=1, col=1)

# Bond shock on Stock
fig.add_trace(go.Scatter(x=periods, y=irf.irfs[:, 0, 1],
                         line=dict(color='green', width=2), showlegend=False),
              row=1, col=2)

# Stock shock on Bond
fig.add_trace(go.Scatter(x=periods, y=irf.irfs[:, 1, 0],
                         line=dict(color='blue', width=2), showlegend=False),
              row=2, col=1)

# Bond shock on Bond
fig.add_trace(go.Scatter(x=periods, y=irf.irfs[:, 1, 1],
                         line=dict(color='green', width=2), showlegend=False),
              row=2, col=2)

# Add zero lines
for row in [1, 2]:
    for col in [1, 2]:
        fig.add_hline(y=0, line_dash='dash', line_color='gray', row=row, col=col)

fig.update_layout(height=500, title_text='Impulse Response Functions',
                  template='plotly_white')
fig.update_xaxes(title_text='Periods', row=2, col=1)
fig.update_xaxes(title_text='Periods', row=2, col=2)
fig

## 8. Practical Finance Applications

### Fama-French Factor Model
The classic multi-factor model:
$$R_i - R_f = \alpha_i + \beta_{i,mkt}(R_m - R_f) + \beta_{i,smb} SMB + \beta_{i,hml} HML + \varepsilon_i$$

Where:
- $R_m - R_f$ = Market excess return
- $SMB$ = Small Minus Big (size factor)
- $HML$ = High Minus Low (value factor)

In [None]:
# Simulate Fama-French 3-Factor Model
np.random.seed(42)
n = 252 * 10  # 10 years of daily data

# Factor returns (simulated)
mkt_rf = np.random.normal(0.0003, 0.01, n)  # Market excess return
smb = np.random.normal(0.0001, 0.005, n)    # Size factor
hml = np.random.normal(0.0001, 0.006, n)    # Value factor

# True factor loadings for a hypothetical stock
true_alpha = 0.0002  # Daily alpha (~5% annual)
true_beta_mkt = 1.2
true_beta_smb = 0.5   # Small-cap tilt
true_beta_hml = -0.3  # Growth tilt (negative HML)

# Generate stock excess returns
stock_excess = (true_alpha + 
                true_beta_mkt * mkt_rf +
                true_beta_smb * smb +
                true_beta_hml * hml +
                np.random.normal(0, 0.008, n))  # Idiosyncratic risk

# Create DataFrame
ff_df = pd.DataFrame({
    'excess_return': stock_excess,
    'mkt_rf': mkt_rf,
    'smb': smb,
    'hml': hml
})

# Fit Fama-French model
X_ff = sm.add_constant(ff_df[['mkt_rf', 'smb', 'hml']])
ff_model = sm.OLS(ff_df['excess_return'], X_ff).fit(cov_type='HC1')  # Robust SEs

print("=" * 60)
print("FAMA-FRENCH 3-FACTOR MODEL")
print("=" * 60)
print("\nTrue Factor Loadings:")
print(f"  Alpha (daily): {true_alpha:.6f} (~{true_alpha*252*100:.2f}% annual)")
print(f"  Market Beta:   {true_beta_mkt:.4f}")
print(f"  SMB Beta:      {true_beta_smb:.4f}")
print(f"  HML Beta:      {true_beta_hml:.4f}")
print("\n" + "=" * 60)
print(ff_model.summary())

In [None]:
# Visualize Factor Exposures
factor_names = ['Alpha', 'Market', 'SMB', 'HML']
true_values = [true_alpha * 252, true_beta_mkt, true_beta_smb, true_beta_hml]  # Annualize alpha
estimated = [ff_model.params['const'] * 252, 
             ff_model.params['mkt_rf'],
             ff_model.params['smb'],
             ff_model.params['hml']]
conf_int = ff_model.conf_int()
errors = [1.96 * ff_model.bse['const'] * 252,
          1.96 * ff_model.bse['mkt_rf'],
          1.96 * ff_model.bse['smb'],
          1.96 * ff_model.bse['hml']]

fig = go.Figure()

fig.add_trace(go.Bar(
    name='Estimated',
    x=factor_names,
    y=estimated,
    error_y=dict(type='data', array=errors),
    marker_color='steelblue'
))

fig.add_trace(go.Scatter(
    name='True Value',
    x=factor_names,
    y=true_values,
    mode='markers',
    marker=dict(color='red', size=12, symbol='diamond')
))

fig.update_layout(
    title='Fama-French Factor Exposures: Estimated vs True',
    yaxis_title='Factor Loading',
    template='plotly_white',
    barmode='group',
    legend=dict(x=0.75, y=0.95)
)
fig.add_hline(y=0, line_dash='dash', line_color='gray')
fig

### Risk Model: Value at Risk with GARCH

GARCH-based VaR accounts for time-varying volatility:
$$VaR_{t+1}^{\alpha} = -\mu_{t+1} + \sigma_{t+1} \cdot z_{\alpha}$$

Where $\sigma_{t+1}$ is the GARCH-forecasted volatility.

In [None]:
# GARCH-based VaR

# Use our previously simulated returns
# Forecast volatility
forecast = garch_fit.forecast(horizon=1)
forecasted_vol = np.sqrt(forecast.variance.values[-1, 0])

# VaR at different confidence levels
confidence_levels = [0.90, 0.95, 0.99]
z_scores = [stats.norm.ppf(1 - cl) for cl in confidence_levels]

print("=" * 60)
print("VALUE AT RISK (VaR) - GARCH APPROACH")
print("=" * 60)
print(f"\nForecasted 1-day volatility: {forecasted_vol:.4f}%")
print(f"\nFor $1,000,000 portfolio:")
print("-" * 40)

portfolio_value = 1_000_000

for cl, z in zip(confidence_levels, z_scores):
    var = forecasted_vol * abs(z) / 100 * portfolio_value
    print(f"  {int(cl*100)}% VaR: ${var:,.0f}")
    print(f"    Interpretation: {int(cl*100)}% confident loss won't exceed ${var:,.0f}")

In [None]:
# Rolling VaR Visualization
window = 252  # 1 year rolling window

# Calculate rolling VaR using historical simulation and GARCH
rolling_var_95 = []
garch_var_95 = []

for i in range(window, len(returns_series)):
    # Historical simulation VaR
    hist_var = -np.percentile(returns_series[i-window:i], 5)
    rolling_var_95.append(hist_var)
    
    # GARCH VaR
    garch_vol = estimated_vol[i]
    garch_var_95.append(garch_vol * 1.645)  # 95% z-score

fig = make_subplots(rows=2, cols=1,
                    subplot_titles=('Returns with VaR Breaches', '95% VaR: Historical vs GARCH'),
                    row_heights=[0.5, 0.5])

# Returns
fig.add_trace(
    go.Scatter(y=returns_series[window:], mode='lines',
               line=dict(color='steelblue', width=0.8), name='Returns'),
    row=1, col=1
)

# Negative VaR line
fig.add_trace(
    go.Scatter(y=[-v for v in garch_var_95], mode='lines',
               line=dict(color='red', width=1.5), name='-VaR (95%)'),
    row=1, col=1
)

# VaR comparison
fig.add_trace(
    go.Scatter(y=rolling_var_95, mode='lines',
               line=dict(color='blue', width=1.5), name='Historical VaR'),
    row=2, col=1
)
fig.add_trace(
    go.Scatter(y=garch_var_95, mode='lines',
               line=dict(color='red', width=1.5), name='GARCH VaR'),
    row=2, col=1
)

fig.update_layout(height=500, title_text='Value at Risk: Historical vs GARCH',
                  template='plotly_white')
fig.update_xaxes(title_text='Time', row=2, col=1)
fig.update_yaxes(title_text='Return (%)', row=1, col=1)
fig.update_yaxes(title_text='VaR (%)', row=2, col=1)
fig

## 9. Summary: Econometric Toolkit for Finance

| Method | Use Case | Key Assumptions | Python Package |
|--------|----------|-----------------|----------------|
| **OLS** | Cross-sectional relationships, factor models | Exogeneity, homoskedasticity | `statsmodels` |
| **ADF/Unit Root** | Stationarity testing | - | `statsmodels.tsa` |
| **Cointegration** | Long-run relationships, pairs trading | Non-stationary series | `statsmodels.tsa` |
| **Panel FE/RE** | Cross-sectional + time series | Entity effects | `linearmodels` |
| **2SLS/IV** | Endogeneity correction | Valid instruments | `linearmodels` |
| **GARCH** | Volatility modeling, VaR | Volatility clustering | `arch` |
| **VAR** | Multivariate dynamics, forecasting | Stationarity | `statsmodels.tsa` |

### Best Practices

1. **Always check assumptions**: Run diagnostic tests before interpreting results
2. **Use robust standard errors**: When heteroskedasticity or autocorrelation suspected
3. **Test for stationarity**: Before running time series regressions
4. **Be cautious with causality**: Econometric relationships are correlational by default
5. **Out-of-sample testing**: Validate models on unseen data

## 10. Key Formulas Reference

### OLS Estimator
$$\hat{\boldsymbol{\beta}} = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y}$$

### R-squared
$$R^2 = 1 - \frac{SSR}{SST} = 1 - \frac{\sum(Y_i - \hat{Y}_i)^2}{\sum(Y_i - \bar{Y})^2}$$

### GARCH(1,1)
$$\sigma_t^2 = \omega + \alpha \varepsilon_{t-1}^2 + \beta \sigma_{t-1}^2$$

### Instrumental Variables
$$\hat{\beta}_{IV} = \frac{Cov(Z, Y)}{Cov(Z, X)}$$

### Hausman Statistic
$$H = (\hat{\beta}_{FE} - \hat{\beta}_{RE})'[Var(\hat{\beta}_{FE}) - Var(\hat{\beta}_{RE})]^{-1}(\hat{\beta}_{FE} - \hat{\beta}_{RE}) \sim \chi^2_k$$

### Durbin-Watson
$$DW = \frac{\sum_{t=2}^{n}(e_t - e_{t-1})^2}{\sum_{t=1}^{n}e_t^2}$$

- DW ≈ 2: No autocorrelation
- DW < 2: Positive autocorrelation
- DW > 2: Negative autocorrelation

In [None]:
# Required packages summary
print("=" * 60)
print("REQUIRED PACKAGES FOR ECONOMETRICS")
print("=" * 60)
print("""
# Core packages
pip install numpy pandas scipy

# Econometrics
pip install statsmodels
pip install linearmodels
pip install arch

# Visualization
pip install plotly
""")