<a href="https://colab.research.google.com/github/vijaygwu/causal/blob/main/OLS2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This code demonstrates a simple linear regression analysis with a small simulated dataset. Here's a breakdown:

1. **Library Imports**:
   - `numpy`: For numerical operations
   - `pandas`: For data manipulation and analysis
   - `statsmodels`: For statistical modeling (both the general API and formula API)
   - `itertools.combinations`: For generating combinations (though not used in this code)
   - `plotnine`: For data visualization (though not used in this code)

2. **SSL Configuration**:
   - Modifies the SSL context to allow for unverified HTTPS connections
   - This is typically used when downloading data from sources with SSL certificate issues

3. **Data Loading Function**:
   - Defines a function `read_data` to load Stata files from a GitHub repository
   - Though defined, this function isn't actually called in the code

4. **Data Generation**:
   - Creates a small DataFrame with only 10 observations
   - Generates random values from normal distributions for:
     - `x`: Scaled by 9 (higher variance)
     - `u`: Scaled by 36 (much higher variance)
   - Creates a dependent variable `y` using the formula: y = 3x + 2u
   - This simulates a relationship where `y` depends on `x` with error term `u`

5. **Regression Analysis**:
   - Fits an Ordinary Least Squares (OLS) regression model with `y` as the dependent variable and `x` as the independent variable
   - The true relationship is y = 3x + 2u, but the model is only estimating y = β₀ + β₁x

6. **Model Outputs**:
   - Adds predicted values to the DataFrame as `yhat1`
   - Adds residuals (the difference between actual and predicted y values) as `uhat1`

7. **Summary Statistics**:
   - Displays descriptive statistics for all columns with `tb.describe()`

The key thing to note is that this example illustrates the concept of omitted variable bias. The true data-generating process includes both `x` and `u`, but the regression only includes `x`. Since `u` is random and independent of `x`, this won't bias the coefficient on `x`, but it will reduce the precision of the estimate since there's substantial unexplained variance from the `u` term.

With only 10 observations and high variance in both variables, the estimated coefficient for `x` might be quite different from the true value of 3.

In [1]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
from itertools import combinations
import plotnine as p

# read data
import ssl
ssl._create_default_https_context = ssl._create_unverified_context
def read_data(file):
    return pd.read_stata("https://github.com/scunning1975/mixtape/raw/master/" + file)


tb = pd.DataFrame({
    'x': 9*np.random.normal(size=10),
    'u': 36*np.random.normal(size=10)})
tb['y'] = 3*tb['x'].values + 2*tb['u'].values

reg_tb = sm.OLS.from_formula('y ~ x', data=tb).fit()

tb['yhat1'] = reg_tb.predict(tb)
tb['uhat1'] = reg_tb.resid

tb.describe()

Unnamed: 0,x,u,y,yhat1,uhat1
count,10.0,10.0,10.0,10.0,10.0
mean,-3.547534,15.891298,21.139994,21.139994,9.947598e-15
std,7.389319,37.074314,79.493196,29.605061,73.77471
min,-18.264138,-23.87783,-102.548074,-37.821595,-75.14413
25%,-5.80847,-16.454372,-33.85976,12.081628,-62.68513
50%,-2.406128,9.197679,5.619352,25.713,-20.48416
75%,-1.281618,45.591828,79.30296,30.218312,61.47343
max,8.222042,82.522376,155.414603,68.294413,132.9225
