# Homework Reflection 5-8

## Week 5: Simulated Dataset

- Lightning storms frighten away deer and bears, decreasing their population, and cause flowers to grow, increasing their population.

- Bears eat deer, decreasing their population.

- Deer eat flowers, decreasing their population.

- Write a dataset that simulates this situation.  (Show the code.) Include noise / randomness in all cases.

- Identify a backdoor path with one or more confounders for the relationship between deer and flowers.


In [2]:
import numpy as np
import pandas as pd

np.random.seed(42)  # reproducibility
n = 1000  # number of observations

lightning = np.random.normal(loc=5, scale=2, size=n)
bears = 50 - 2*lightning + np.random.normal(0, 3, n)
deer = 100 - 1.5*lightning - 0.5*bears + np.random.normal(0, 5, n)
flowers = 200 + 3*lightning - 2*deer + np.random.normal(0, 10, n)

df = pd.DataFrame({
    "lightning": lightning,
    "bears": bears,
    "deer": deer,
    "flowers": flowers
})

print(df.head())

   lightning      bears       deer     flowers
0   5.993428  42.211210  66.528361   65.845487
1   4.723471  43.326958  70.528720   64.509123
2   6.295377  37.588137  67.800766   79.148543
3   8.046060  31.967070  70.407568  102.199920
4   4.531693  43.031283  62.218745   94.723121


## Week 6

1. What is a potential problem with computing the Marginal Treatment Effect simply by comparing each untreated item to its counterfactual and taking the maximum difference?  (Hint: think of statistics here.  Consider that only the most extreme item ends up being used to estimate the MTE.  That's not necessarily a bad thing; the MTE is supposed to come from the untreated item that will produce the maximum effect.  But there is nevertheless a problem.)
Possible answer: We are likely to find the item with the most extreme difference, which may be high simply due to randomness.
(Please explain / justify this answer, or give a different one if you can think of one.)

2. Propose a solution that remedies this problem and write some code that implements your solution.  It's very important here that you clearly explain what your solution will do.
Possible answer: maybe we could take the 90th percentile of the treatment effect and use it as a proxy for the Marginal Treatment Effect.
(Either code this answer or choose a different one.)


In [9]:
import numpy as np
import pandas as pd
from sklearn.neighbors import NearestNeighbors

def untreated_effects(df):
    treated = df[df['X'] == 1].reset_index(drop=True)
    untreated = df[df['X'] == 0].reset_index(drop=True)

    nn = NearestNeighbors(n_neighbors=1)
    nn.fit(treated[['Z']])
    distances, indices = nn.kneighbors(untreated[['Z']])

    matched_treated = treated.iloc[indices.flatten()].reset_index(drop=True)
    te_untreated = matched_treated['Y'].values - untreated['Y'].values  # TE for untreated items
    return te_untreated

def mte_quantile(te_array, q=0.90):
    return np.quantile(te_array, q)

def bootstrap_ci(te_array, q=0.90, n_boot=1000, seed=42):
    rng = np.random.default_rng(seed)
    n = len(te_array)
    qs = []
    for _ in range(n_boot):
        sample = te_array[rng.integers(0, n, size=n)]
        qs.append(np.quantile(sample, q))
    lower, upper = np.quantile(qs, [0.05, 0.95])
    return lower, upper

# --- Run estimation ---
df = pd.read_csv("homework_6.1.csv")
te_untreated = untreated_effects(df)
mte_90 = mte_quantile(te_untreated, q=0.90)
ci_low, ci_high = bootstrap_ci(te_untreated, q=0.90, n_boot=1000, seed=42)

print(f"MTE proxy (90th percentile): {mte_90:.4f}")
print(f"Bootstrap 90th percentile CI (90% central): [{ci_low:.4f}, {ci_high:.4f}]")

MTE proxy (90th percentile): 1.9280
Bootstrap 90th percentile CI (90% central): [1.8962, 1.9566]


## Week 7 

1. Create a linear regression model involving a confounder that is left out of the model.  Show whether the true correlation between X and Y is overestimated, underestimated, or neither.  Explain in words why this is the case for the given coefficients you have chosen.

2. Perform a linear regression analysis in which one of the coefficients is zero, e.g.
•	W = [noise]
•	X = [noise]
•	Y = 2 * X + [noise]
•	And compute the p-value of a coefficient - in this case, the coefficient of W.  
•	(This is the likelihood that the estimated coefficient would be as high or low as it is, given that the actual coefficient is zero.)
•	If the p-value is less than 0.05, this ordinarily means that we judge the coefficient to be nonzero (incorrectly, in this case.)
•	Run the analysis 1000 times and report the best (smallest) p-value.  
•	If the p-value is less than 0.05, does this mean the coefficient actually is nonzero?  What is the problem with repeating the analysis?


In [12]:
import numpy as np
import pandas as pd
import statsmodels.api as sm

np.random.seed(42)

# Generate data
n = 10000
Z = np.random.normal(0, 1, n)
u = np.random.normal(0, 1, n)
X = 1.5 * Z + u
error = np.random.normal(0, 1, n)
Y = 2 * X + 3 * Z + error

df = pd.DataFrame({'Y': Y, 'X': X, 'Z': Z})

# Regression omitting Z
X_only = sm.add_constant(df[['X']])
model_omit = sm.OLS(df['Y'], X_only).fit()

# Regression including Z
XZ = sm.add_constant(df[['X', 'Z']])
model_full = sm.OLS(df['Y'], XZ).fit()

print("Omitting Z (biased):", model_omit.params)
print("Including Z (unbiased):", model_full.params)

Omitting Z (biased): const   -0.033439
X        3.410282
dtype: float64
Including Z (unbiased): const   -0.012768
X        2.026782
Z        2.986459
dtype: float64


## Week 7 Cont

- Perform a linear regression analysis in which one of the coefficients is zero, e.g. •	W = [noise] •	X = [noise] •	Y = 2 * X + [noise] •	And compute the p-value of a coefficient - in this case, the coefficient of W.
•	(This is the likelihood that the estimated coefficient would be as high or low as it is, given that the actual coefficient is zero.) •	If the p-value is less than 0.05, this ordinarily means that we judge the coefficient to be nonzero (incorrectly, in this case.) •	Run the analysis 1000 times and report the best (smallest) p-value.
•	If the p-value is less than 0.05, does this mean the coefficient actually is nonzero?  What is the problem with repeating the analysis?


In [13]:
import numpy as np
import pandas as pd
import statsmodels.api as sm

np.random.seed(42)

def run_once(n=500):
    W = np.random.normal(0, 1, n)
    X = np.random.normal(0, 1, n)
    eps = np.random.normal(0, 1, n)
    Y = 2 * X + eps  # W does not affect Y

    df = pd.DataFrame({'Y': Y, 'X': X, 'W': W})
    XW = sm.add_constant(df[['X', 'W']])
    model = sm.OLS(df['Y'], XW).fit()
    p_w = model.pvalues['W']  # p-value for coefficient on W
    return p_w

# Run 1000 simulations and record the smallest p-value
p_values = [run_once(n=500) for _ in range(1000)]
min_p = np.min(p_values)

print(f"Smallest p-value across 1000 runs: {min_p:.6f}")
# Optional: how many runs are "significant" at 0.05
sig_count = np.sum(np.array(p_values) < 0.05)
print(f"Number of runs with p < 0.05: {sig_count} (out of 1000)")

Smallest p-value across 1000 runs: 0.000245
Number of runs with p < 0.05: 55 (out of 1000)


# Week 8

I just copy-pasted all my code from my Homework Notebook here.

In [None]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
from statsmodels.formula.api import ols
from sklearn.linear_model import LogisticRegression
from scipy.spatial.distance import mahalanobis

In [None]:
df1 = pd.read_csv('homework_8.1.csv', index_col=0)
df2 = pd.read_csv('homework_8.2.csv', index_col=0)

In [None]:
# Estimate propensity scores P(X=1 | Z)
logit = LogisticRegression(solver='liblinear')
logit.fit(df1[['Z']], df1['X'])
df1['propensity'] = logit.predict_proba(df1[['Z']])[:, 1]

# Compute the inverse probability weights
df1['weight'] = np.where( df1['X'] == 1,
                          1 / df1['propensity'], 
                          1 / (1 - df1['propensity']))

# Weighted means of Y for treated and control groups
treated_mean = np.average(df1.loc[df1['X'] == 1, 'Y'], weights=df1.loc[df1['X'] == 1, 'weight'])
control_mean = np.average(df1.loc[df1['X'] == 0, 'Y'], weights=df1.loc[df1['X'] == 0, 'weight'])

# ATE
ate = treated_mean - control_mean
print(f'Estimated ATE using IPW: {ate}')

Estimated ATE using IPW: 2.2743275711428588


In [None]:
print("Propensity scores of the first three items:")
print(df1['propensity'].head(3).values)


Propensity scores of the first three items:
[0.84011371 0.58464597 0.71108245]


In [None]:
# Split into treted and control groups
treated = df2[df2['X'] == 1].reset_index(drop=True)
control = df2[df2['X'] == 0].reset_index(drop=True)

# Build the covariance matrix for Mahalanobis distance
Z = df2[['Z1', 'Z2']].values
cov = np.cov(Z.T)
inv_cov = np.linalg.inv(cov)

# Find the nearest control for each treated unit
matches = []

for i in range(len(treated)):
    z_t = treated.loc[i, ['Z1', 'Z2']].values
    dists = control[['Z1', 'Z2']].apply(lambda row: mahalanobis(z_t, row.values, inv_cov), axis=1)
    j = dists.idxmin()
    matched_control = control.loc[j]
    matches.append({'treated_Y': treated.loc[i, 'Y'],
                    'control_Y': matched_control['Y'],
                    'diff': treated.loc[i, 'Y'] - matched_control['Y']})
    
matches_df = pd.DataFrame(matches)

ate = matches_df['diff'].mean()
print(f'Estimated ATE using Nearest Neighbor Matching: {ate}')

Estimated ATE using Nearest Neighbor Matching: 3.4376789979126094


In [None]:
import pandas as pd
import numpy as np
from scipy.spatial.distance import mahalanobis

# Load your dataset
df2 = pd.read_csv("homework_8.2.csv")

# Split into treated and control groups
treated = df2[df2['X'] == 1].reset_index(drop=True)
control = df2[df2['X'] == 0].reset_index(drop=True)

# Build the covariate matrix (Z1, Z2)
Z = df2[['Z1', 'Z2']].values
cov = np.cov(Z.T)
inv_cov = np.linalg.inv(cov)

# For each treated unit, find its nearest control and record the distance
results = []
for i in range(len(treated)):
    z_t = treated.loc[i, ['Z1', 'Z2']].values
    dists = control[['Z1', 'Z2']].apply(
        lambda row: mahalanobis(z_t, row.values, inv_cov), axis=1
    )
    j = dists.idxmin()
    results.append({
        'treated_index': i,
        'treated_Z1': treated.loc[i, 'Z1'],
        'treated_Z2': treated.loc[i, 'Z2'],
        'control_index': j,
        'control_Z1': control.loc[j, 'Z1'],
        'control_Z2': control.loc[j, 'Z2'],
        'min_distance': dists.min()
    })

results_df = pd.DataFrame(results)

# Find the treated unit with the *largest* minimum distance (least common support)
worst_match = results_df.loc[results_df['min_distance'].idxmax()]

print("Treated unit with least common support:")
print(f"  Z1 = {worst_match['treated_Z1']}, Z2 = {worst_match['treated_Z2']}")
print("Nearest control unit:")
print(f"  Z1 = {worst_match['control_Z1']}, Z2 = {worst_match['control_Z2']}")
print(f"Mahalanobis distance = {worst_match['min_distance']}")

Treated unit with least common support:
  Z1 = 2.69622405256358, Z2 = 0.5381554886023228
Nearest control unit:
  Z1 = 1.5199948607657727, Z2 = -1.2822079376259403
Mahalanobis distance = 1.3830045328325056
