# Linear Regression via Ordinary Least Squares (OLS)

The following thought experiment underlies the standard linear regression model:
1. For each case $i$, Mother Nature establishes the values of each of the predictors
2. Mother Nature then multiplies each predictor by its corresponding coefficient
3. Mother Nature sums the products from (2) and adds the value of the constant
4. Mother Nature then adds a random perturbation

The goal of the data analyst is to back-out the coefficients that Mother Nature used to generate the data. OLS is the the estimator used to accomplish the task.

First Order Conditions

- The predictors used by Mother Nature are known
- The transformations used by Mother Nature are known
- The predictors are known to be combined in a linear fashion
- The predictors are available in the data

Second Order Conditions

- Each perturbation is realized idepedently of all other perturbations and all come from the same distribution with expectation 0 $$\epsilon_i \sim NIID(0, \sigma^2)$$

Distributions of the estimated coefficients is approximately normal when the sample size is large. How large is large though?

## Treatment Only

In [None]:
sample_size = 1000
treatment = np.random.choice([0, 1], sample_size)
test_scores = 75 + 10*treatment + np.random.normal(scale=5.0, size=sample_size)

treatment_only_df = pd.DataFrame()
treatment_only_df["treated"] = treatment
treatment_only_df["score"] = test_scores

fig, ax = plt.subplots()
sns.distplot(test_scores, kde=False, rug=True, norm_hist=True, ax=ax)
ax.axvline(np.mean(test_scores), 0, 1, color="r")
ax.set_xlim(50, 100)
ax.set_title("Distribution of Test Scores".format(round(t, 2), round(p_val, 2)))
ax.set_xlabel("Score")
ax.set_ylabel("Density")

In [None]:
model = smf.ols(formula="score ~ 1 + treatment", data=treatment_only_df)
result = model.fit(use_t=True)
result.summary()

In [None]:
treated = test_scores[np.where(treatment == 1)]
untreated = test_scores[np.where(treatment == 0)]

t_stat, p_val = ttest_ind(treated, untreated)

fig, ax = plt.subplots()
sns.distplot(treated, kde=False, label="Treated", ax=ax)
sns.distplot(untreated, kde=False, label="Non-Treated", ax=ax)
ax.axvline(np.mean(treated), 0, 1, color="black")
ax.axvline(np.mean(untreated), 0, 1, color="black")
ax.set_xlim(50, 100)
ax.set_title("Distribution of Test Scores: t = {} ({})".format(round(t_stat, 3), round(p_val, 3)))
ax.set_xlabel("Score")
ax.set_ylabel("Density")
ax.legend()
ax.annotate('{}'.format(round(np.mean(treated) - np.mean(untreated), 2)),
            xy=(np.mean(untreated),50), xytext=(np.mean(treated) + 0.4,49.4), 
            arrowprops=dict(arrowstyle='<->', color="black"))

- $H_0$: The test scores for treated and non-treated individuals are the same
- $H_A$: The test scores for treated and non-treated individuals are different
- The test-statistic from a two sample t-test of the means is identical to the z score reported in the regression summary with only treatment and an intercept

## Omitted Variables

In [None]:
sample_size = 1000
treatment = np.random.choice([0, 1], sample_size)

random_cov = make_spd_matrix(2)
random_a, random_b = np.random.multivariate_normal([0, 0], random_cov, sample_size).T

test_scores = 75 + 5*treatment + random_a + random_b + np.random.normal(scale=5.0, size=sample_size)

ovb_df = pd.DataFrame()
ovb_df["treatment"] = treatment
ovb_df["a"] = random_a
ovb_df["b"] = random_b
ovb_df["score"] = test_scores

fig, ax = plt.subplots()
sns.distplot(test_scores, kde=False, rug=True, norm_hist=True, ax=ax)
ax.axvline(np.mean(test_scores), 0, 1, color="r")
ax.set_xlim(50, 100)
ax.set_title("Distribution of Test Scores".format(round(t, 2), round(p_val, 2)))
ax.set_xlabel("Score")
ax.set_ylabel("Density")

In [None]:
if (random_cov[0][0] > random_cov[1][1]):
    formula = "score ~ 1 + treatment + random_b"
else:
    formula = "score ~ 1 + treatment + random_a"
    
model = smf.ols(formula=formula, data=ovb_df)
result = model.fit(use_t=True)
result.summary()

In [None]:
model = smf.ols(formula="score ~ 1 + treatment + random_a + random_b", data=ovb_df)
result = model.fit(use_t=True)
result.summary()