# causers Basic Examples

This notebook demonstrates the four main API functions in the `causers` package:

1. **Linear Regression** with clustered standard errors
2. **Logistic Regression** with clustered standard errors  
3. **Synthetic Control** for single treated unit
4. **Synthetic Difference-in-Differences** for panel data

All examples use synthetic data with fixed random seeds for reproducibility.

**Requirements**: `causers`, `polars`, `numpy`

In [1]:
"""Import required packages."""
import numpy as np
import polars as pl
import causers

print(f"causers version: {causers.__version__}")

causers version: 0.6.0


## 1. Linear Regression with Clustered Standard Errors

This example demonstrates:
- Multiple covariate regression (3 predictors)
- Clustered standard errors for panel/grouped data
- Non-trivial R² (between 0 and 1)

**Data Generating Process (DGP)**:
- 100 observations across 10 clusters (10 obs per cluster)
- True coefficients: β₁=2.0, β₂=-1.5, β₃=0.8, intercept=5.0
- Within-cluster correlation via cluster-specific intercepts
- Gaussian noise with σ=2.0

In [2]:
"""Linear regression with clustered standard errors."""
# Set seed for reproducibility
np.random.seed(42)

# Data parameters
n_obs = 100
n_clusters = 10
obs_per_cluster = n_obs // n_clusters

# True coefficients
beta_1, beta_2, beta_3, intercept = 2.0, -1.5, 0.8, 5.0

# Generate cluster IDs and cluster-specific intercepts
cluster_ids = np.repeat(np.arange(n_clusters), obs_per_cluster)
cluster_effects = np.random.normal(0, 1.5, n_clusters)  # Within-cluster correlation

# Generate covariates
x1 = np.random.normal(0, 1, n_obs)
x2 = np.random.normal(0, 1, n_obs)
x3 = np.random.normal(0, 1, n_obs)

# Generate outcome with cluster effects and noise
y = (intercept 
     + beta_1 * x1 
     + beta_2 * x2 
     + beta_3 * x3 
     + cluster_effects[cluster_ids]  # Cluster-specific intercepts
     + np.random.normal(0, 2.0, n_obs))  # Idiosyncratic noise

# Create DataFrame
df = pl.DataFrame({
    "x1": x1,
    "x2": x2,
    "x3": x3,
    "y": y,
    "cluster_id": cluster_ids.astype(int)
})

print(f"Data shape: {df.shape}")
print(f"Clusters: {df['cluster_id'].n_unique()}")
print(f"Observations per cluster: {obs_per_cluster}")
print()

# Run regression with clustered standard errors
result = causers.linear_regression(
    df, 
    x_cols=["x1", "x2", "x3"], 
    y_col="y", 
    cluster="cluster_id",
    seed=42
)

# Display results
print("=" * 50)
print("LINEAR REGRESSION RESULTS")
print("=" * 50)
print(f"Observations: {result.n_samples}")
print(f"Clusters: {result.n_clusters}")
print(f"SE Type: {result.cluster_se_type}")
print()
print("Coefficients:")
print(f"  x1: {result.coefficients[0]:8.4f} ± {result.standard_errors[0]:.4f}  (true: {beta_1})")
print(f"  x2: {result.coefficients[1]:8.4f} ± {result.standard_errors[1]:.4f}  (true: {beta_2})")
print(f"  x3: {result.coefficients[2]:8.4f} ± {result.standard_errors[2]:.4f}  (true: {beta_3})")
print(f"  Intercept: {result.intercept:8.4f} ± {result.intercept_se:.4f}  (true: {intercept})")
print()
print(f"R-squared: {result.r_squared:.4f}")

# Verify constraints for DoD
assert 0.0 < result.r_squared < 1.0, "R² should be strictly between 0 and 1"
assert all(se > 0 for se in result.standard_errors), "All SEs should be > 0"
assert result.n_clusters >= 5, "Should have at least 5 clusters"
print("\n✅ All validation checks passed")

Data shape: (100, 5)
Clusters: 10
Observations per cluster: 10

LINEAR REGRESSION RESULTS
Observations: 100
Clusters: 10
SE Type: analytical

Coefficients:
  x1:   1.5363 ± 0.1723  (true: 2.0)
  x2:  -1.6280 ± 0.2134  (true: -1.5)
  x3:   0.9113 ± 0.1609  (true: 0.8)
  Intercept:   5.7515 ± 0.3909  (true: 5.0)

R-squared: 0.6339

✅ All validation checks passed


  result = causers.linear_regression(


## 2. Logistic Regression with Clustered Standard Errors

This example demonstrates:
- Binary outcome regression (0/1)
- Multiple covariates (2 predictors)
- Clustered standard errors
- Balanced classes (~50/50 split)

**Data Generating Process (DGP)**:
- 100 observations across 10 clusters
- True coefficients: β₁=1.0, β₂=-0.5, intercept=-0.2
- Outcome via logistic transformation: P(y=1|x) = 1/(1+exp(-xβ))
- Class balance maintained via intercept calibration

In [3]:
"""Logistic regression with clustered standard errors."""
# Set seed for reproducibility
np.random.seed(42)

# Data parameters
n_obs = 100
n_clusters = 10
obs_per_cluster = n_obs // n_clusters

# True coefficients (chosen for ~balanced classes)
beta_1, beta_2, intercept = 1.0, -0.5, -0.2

# Generate cluster IDs
cluster_ids = np.repeat(np.arange(n_clusters), obs_per_cluster)

# Generate covariates
x1 = np.random.normal(0, 1, n_obs)
x2 = np.random.normal(0, 1, n_obs)

# Compute linear predictor
linear_pred = intercept + beta_1 * x1 + beta_2 * x2

# Generate binary outcome via logistic model
prob = 1 / (1 + np.exp(-linear_pred))
y = (np.random.uniform(0, 1, n_obs) < prob).astype(float)

# Create DataFrame
df = pl.DataFrame({
    "x1": x1,
    "x2": x2,
    "y": y,
    "cluster_id": cluster_ids.astype(int)
})

# Check class balance
class_1_pct = y.mean() * 100
print(f"Data shape: {df.shape}")
print(f"Clusters: {df['cluster_id'].n_unique()}")
print(f"Class balance: {class_1_pct:.1f}% positive (target: 30-70%)")
print()

# Run logistic regression with clustered standard errors
result = causers.logistic_regression(
    df,
    x_cols=["x1", "x2"],
    y_col="y",
    cluster="cluster_id",
    seed=42
)

# Display results
print("=" * 50)
print("LOGISTIC REGRESSION RESULTS")
print("=" * 50)
print(f"Observations: {result.n_samples}")
print(f"Clusters: {result.n_clusters}")
print(f"SE Type: {result.cluster_se_type}")
print(f"Converged: {result.converged} ({result.iterations} iterations)")
print()
print("Coefficients (log-odds):")
print(f"  x1: {result.coefficients[0]:8.4f} ± {result.standard_errors[0]:.4f}  (true: {beta_1})")
print(f"  x2: {result.coefficients[1]:8.4f} ± {result.standard_errors[1]:.4f}  (true: {beta_2})")
print(f"  Intercept: {result.intercept:8.4f} ± {result.intercept_se:.4f}  (true: {intercept})")
print()
print(f"Pseudo R-squared (McFadden): {result.pseudo_r_squared:.4f}")
print(f"Log-likelihood: {result.log_likelihood:.4f}")

# Verify constraints
assert result.converged, "Model should converge"
assert all(se > 0 for se in result.standard_errors), "All SEs should be > 0"
assert result.n_clusters >= 5, "Should have at least 5 clusters"
assert 30 <= class_1_pct <= 70, "Classes should be balanced (30-70%)"
print("\n✅ All validation checks passed")

Data shape: (100, 4)
Clusters: 10
Class balance: 44.0% positive (target: 30-70%)

LOGISTIC REGRESSION RESULTS
Observations: 100
Clusters: 10
SE Type: analytical
Converged: True (6 iterations)

Coefficients (log-odds):
  x1:   1.2492 ± 0.4258  (true: 1.0)
  x2:  -1.0098 ± 0.2344  (true: -0.5)
  Intercept:  -0.2660 ± 0.2005  (true: -0.2)

Pseudo R-squared (McFadden): 0.2402
Log-likelihood: -52.1204

✅ All validation checks passed


  result = causers.logistic_regression(


## 3. Synthetic Control (Single Treated Unit)

This example demonstrates:
- Panel data structure (units × time periods)
- Single treated unit with multiple controls
- Pre-treatment fit quality (RMSE > 0)
- In-space placebo standard errors

**Data Generating Process (DGP)**:
- 10 units (1 treated, 9 controls) observed over 8 periods
- Treatment begins in period 6 (3 post-treatment periods)
- Common time trend + unit-specific levels + noise
- Treatment effect: ATT = 5.0

In [4]:
"""Synthetic control with single treated unit."""
# Set seed for reproducibility
np.random.seed(42)

# Panel dimensions
n_units = 10
n_periods = 8
n_pre = 5       # Pre-treatment periods (1-5)
n_post = 3      # Post-treatment periods (6-8)
treated_unit = 0  # Unit 0 is treated

# True treatment effect
true_att = 5.0

# Unit-specific fixed effects (different levels)
unit_effects = np.random.uniform(5, 15, n_units)

# Time trend (common to all units)
time_trend = np.arange(n_periods) * 0.5

# Generate panel data
data = {"unit": [], "time": [], "outcome": [], "treated": []}

for unit in range(n_units):
    for t in range(n_periods):
        outcome = unit_effects[unit] + time_trend[t] + np.random.normal(0, 0.5)
        
        # Add treatment effect for treated unit in post-period
        is_treated = (unit == treated_unit) and (t >= n_pre)
        if is_treated:
            outcome += true_att
        
        data["unit"].append(unit)
        data["time"].append(t)
        data["outcome"].append(outcome)
        data["treated"].append(1 if is_treated else 0)

df = pl.DataFrame(data)

# Summary stats
n_control = n_units - 1
print(f"Panel dimensions: {n_units} units × {n_periods} periods = {len(df)} obs")
print(f"Treated units: 1, Control units: {n_control}")
print(f"Pre-treatment periods: {n_pre}, Post-treatment periods: {n_post}")
print()

# Run synthetic control
result = causers.synthetic_control(
    df,
    unit_col="unit",
    time_col="time",
    outcome_col="outcome",
    treatment_col="treated",
    method="traditional",
    seed=42
)

# Display results
print("=" * 50)
print("SYNTHETIC CONTROL RESULTS")
print("=" * 50)
print(f"Method: {result.method}")
print(f"Control units: {result.n_units_control}")
print(f"Pre-treatment periods: {result.n_periods_pre}")
print(f"Post-treatment periods: {result.n_periods_post}")
print()
print(f"ATT: {result.att:.4f} ± {result.standard_error:.4f}  (true: {true_att})")
print(f"Pre-treatment RMSE: {result.pre_treatment_rmse:.4f}")
print()
print(f"Unit weights (top 3):")
weights_sorted = sorted(enumerate(result.unit_weights), key=lambda x: -x[1])
for idx, weight in weights_sorted[:3]:
    print(f"  Control unit {idx}: {weight:.4f}")
print()
print(f"Solver converged: {result.solver_converged} ({result.solver_iterations} iterations)")

# Verify constraints
assert result.standard_error > 0, "SE should be > 0"
assert result.pre_treatment_rmse > 0, "Pre-treatment RMSE should be > 0"
assert result.n_units_control >= 3, "Should have at least 3 control units"
assert result.n_periods_pre >= 3, "Should have at least 3 pre-treatment periods"
print("\n✅ All validation checks passed")

Panel dimensions: 10 units × 8 periods = 80 obs
Treated units: 1, Control units: 9
Pre-treatment periods: 5, Post-treatment periods: 3

SYNTHETIC CONTROL RESULTS
Method: traditional
Control units: 9
Pre-treatment periods: 5
Post-treatment periods: 3

ATT: 4.5243 ± 0.8091  (true: 5.0)
Pre-treatment RMSE: 0.1575

Unit weights (top 3):
  Control unit 5: 0.3733
  Control unit 0: 0.2903
  Control unit 3: 0.2736

Solver converged: True (1000 iterations)

✅ All validation checks passed


## 4. Synthetic Difference-in-Differences

This example demonstrates:
- Panel data with multiple treated units
- Time and unit weight optimization
- Placebo bootstrap standard errors
- Pre-treatment fit quality (RMSE > 0)

**Data Generating Process (DGP)**:
- 12 units (3 treated, 9 controls) observed over 6 periods
- Treatment begins in period 4 (3 post-treatment periods)
- Heterogeneous unit trends + common shocks + noise
- Treatment effect: ATT = 3.0

In [5]:
"""Synthetic Difference-in-Differences with multiple treated units."""
# Set seed for reproducibility
np.random.seed(42)

# Panel dimensions
n_units = 12
n_periods = 6
n_pre = 3       # Pre-treatment periods (0-2)
n_post = 3      # Post-treatment periods (3-5)
n_treated = 3   # Units 0, 1, 2 are treated
n_control = n_units - n_treated

# True treatment effect
true_att = 3.0

# Unit-specific trends (heterogeneous growth rates)
unit_trends = np.random.uniform(0.3, 0.7, n_units)

# Unit-specific intercepts
unit_intercepts = np.random.uniform(10, 20, n_units)

# Common time shocks (affects all units)
time_shocks = np.random.normal(0, 0.3, n_periods)

# Generate panel data
data = {"unit": [], "time": [], "outcome": [], "treated": []}

for unit in range(n_units):
    for t in range(n_periods):
        # Base outcome: intercept + trend + time shock + noise
        outcome = (unit_intercepts[unit] 
                   + unit_trends[unit] * t 
                   + time_shocks[t]
                   + np.random.normal(0, 0.5))
        
        # Add treatment effect for treated units in post-period
        is_treated = (unit < n_treated) and (t >= n_pre)
        if is_treated:
            outcome += true_att
        
        data["unit"].append(unit)
        data["time"].append(t)
        data["outcome"].append(outcome)
        data["treated"].append(1 if is_treated else 0)

df = pl.DataFrame(data)

# Summary stats
print(f"Panel dimensions: {n_units} units × {n_periods} periods = {len(df)} obs")
print(f"Treated units: {n_treated}, Control units: {n_control}")
print(f"Pre-treatment periods: {n_pre}, Post-treatment periods: {n_post}")
print()

# Run Synthetic DID
result = causers.synthetic_did(
    df,
    unit_col="unit",
    time_col="time",
    outcome_col="outcome",
    treatment_col="treated",
    bootstrap_iterations=200,
    seed=42
)

# Display results
print("=" * 50)
print("SYNTHETIC DID RESULTS")
print("=" * 50)
print(f"Control units: {result.n_units_control}")
print(f"Treated units: {result.n_units_treated}")
print(f"Pre-treatment periods: {result.n_periods_pre}")
print(f"Post-treatment periods: {result.n_periods_post}")
print()
print(f"ATT: {result.att:.4f} ± {result.standard_error:.4f}  (true: {true_att})")
print(f"Pre-treatment fit (RMSE): {result.pre_treatment_fit:.4f}")
print(f"Bootstrap iterations used: {result.bootstrap_iterations_used}")
print()
print(f"Solver converged: {result.solver_converged}")
print(f"Solver iterations: {result.solver_iterations}")

# Show weight distributions
print()
print(f"Unit weights (control): min={min(result.unit_weights):.4f}, max={max(result.unit_weights):.4f}")
print(f"Time weights (pre-period): min={min(result.time_weights):.4f}, max={max(result.time_weights):.4f}")

# Verify constraints
assert result.standard_error > 0, "SE should be > 0"
assert result.pre_treatment_fit > 0, "Pre-treatment RMSE should be > 0"
assert result.n_units_control >= 3, "Should have at least 3 control units"
assert result.n_periods_pre >= 3, "Should have at least 3 pre-treatment periods"
assert result.bootstrap_iterations_used >= 100, "Should use at least 100 bootstrap iterations"
print("\n✅ All validation checks passed")

Panel dimensions: 12 units × 6 periods = 72 obs
Treated units: 3, Control units: 9
Pre-treatment periods: 3, Post-treatment periods: 3

SYNTHETIC DID RESULTS
Control units: 9
Treated units: 3
Pre-treatment periods: 3
Post-treatment periods: 3

ATT: 2.9655 ± 0.1958  (true: 3.0)
Pre-treatment fit (RMSE): 0.0780
Bootstrap iterations used: 200

Solver converged: True
Solver iterations: (200, 200)

Unit weights (control): min=0.0483, max=0.2270
Time weights (pre-period): min=0.0000, max=1.0000

✅ All validation checks passed


  result = causers.synthetic_did(
