# Static



By now you're familiar with the typical use of linear regression: you're given a dataframe with a bunch of features and a set of observations for each feature. You then take one feature (your y) and try to predict it based on other features (your x's). 

This generally works to a certain extent. This morning, though, we're going to set up a rather strange problem for linear regression. We will see what happens when y and the x's are totally unrelated.

Recall that regression follows the formula 
$$
y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \epsilon
$$ 
When we "fit" an OLS model we are giving the model all of the x's and the y and asking it to find the best betas.

This time, however, we're going to set all the betas to zero 
$$
\beta_0=\beta_1=\dots=0
$$
and then fit an OLS model.

Before we do this, stop and think for a second:

- What do you expect the model to do? What will the betas/r-squared/p-values that it finds look like?
> All the betas will be 0.
- What do you think the model *should* do? Is that different from what you think it *will* do?
> Straight line

![](https://upload.wikimedia.org/wikipedia/commons/5/5a/No_Signal_23.JPG)

# Part 1

Generate simulation data. We want to have 200 points (observations) for the y feature and for 20 x features. In other words make sure `y.shape == (200,1)` and `x.shape == (200,20)`. The x's should be randomly generated independent of each other. And the y should be randomly generated independent of the x's. 

Use statsmodels to fit an OLS model to your data. Are the results as you expected? Do you have any betas with a $p<0.05$? If not, re-run the model until you do.

In [44]:
# Import necessary packages
import numpy as np
import pandas as pd

## Creating the DataFrames

In [63]:
n_points = 200
n_features = 20
y = np.random.randn(n_points, 1)
x = np.random.randn(n_points, n_features)

In [45]:
# Creating y DataFrame
y = np.random.randint(low=1, high=150, size=200)
df_y = pd.DataFrame({'Y': y})
df_y.head()

Unnamed: 0,Y
0,4
1,80
2,122
3,109
4,126


In [57]:
df_x2 = np.random.randint(0,1000,size=(200,20))


200

In [46]:
# Creating x DataFrame
df_x = pd.DataFrame()
for i, item in enumerate(list(range(20))):
    df_x['X'+str(i+1)] = np.random.randint(low=1,high=200,size=200)  # Can just do size=(200,20)
df_x

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,X11,X12,X13,X14,X15,X16,X17,X18,X19,X20
0,117,63,63,169,169,180,115,114,116,36,3,161,159,158,137,170,195,186,120,38
1,149,7,162,148,78,162,77,90,110,197,48,53,87,164,59,157,168,8,5,87
2,175,148,120,92,165,14,112,27,178,41,138,78,37,144,127,103,98,90,134,30
3,53,164,128,61,63,40,48,147,14,49,169,28,92,102,95,177,103,197,52,118
4,163,109,85,188,67,143,56,182,169,65,10,141,195,108,146,139,65,22,104,52
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
195,137,106,136,186,59,33,53,39,168,61,71,33,56,67,6,185,44,158,168,176
196,166,142,113,139,49,180,84,17,62,16,167,72,156,6,196,87,77,147,184,181
197,87,94,136,111,171,97,120,180,81,70,179,75,186,191,124,10,23,54,42,186
198,51,25,82,76,37,99,142,39,25,76,116,145,187,110,38,155,188,99,198,133


In [50]:
# Combining the DataFrames
df = df_y.join(df_x, how='left')
df

Unnamed: 0,Y,X1,X2,X3,X4,X5,X6,X7,X8,X9,...,X11,X12,X13,X14,X15,X16,X17,X18,X19,X20
0,4,117,63,63,169,169,180,115,114,116,...,3,161,159,158,137,170,195,186,120,38
1,80,149,7,162,148,78,162,77,90,110,...,48,53,87,164,59,157,168,8,5,87
2,122,175,148,120,92,165,14,112,27,178,...,138,78,37,144,127,103,98,90,134,30
3,109,53,164,128,61,63,40,48,147,14,...,169,28,92,102,95,177,103,197,52,118
4,126,163,109,85,188,67,143,56,182,169,...,10,141,195,108,146,139,65,22,104,52
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
195,124,137,106,136,186,59,33,53,39,168,...,71,33,56,67,6,185,44,158,168,176
196,129,166,142,113,139,49,180,84,17,62,...,167,72,156,6,196,87,77,147,184,181
197,116,87,94,136,111,171,97,120,180,81,...,179,75,186,191,124,10,23,54,42,186
198,73,51,25,82,76,37,99,142,39,25,...,116,145,187,110,38,155,188,99,198,133


## OLS with StatsModels

In [51]:
import statsmodels.api as sm                 # This formula is the same as the one below
import statsmodels.formula.api as smf        # This formula is the same as the one below
import patsy

In [53]:
y, X = patsy.dmatrices("""
Y ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8 + X9 + X10 + X11 + X12 + X13 + X14 + X15 + X16 + X18 + X19
""", data=df, return_type="dataframe")


# Create your model
model = sm.OLS(y, X)

# Fit your model to your training set
fit = model.fit()

# Print summary statistics of the model's performance
fit.summary()

0,1,2,3
Dep. Variable:,Y,R-squared:,0.096
Model:,OLS,Adj. R-squared:,0.006
Method:,Least Squares,F-statistic:,1.069
Date:,"Thu, 17 Oct 2019",Prob (F-statistic):,0.386
Time:,09:56:51,Log-Likelihood:,-1026.8
No. Observations:,200,AIC:,2092.0
Df Residuals:,181,BIC:,2154.0
Df Model:,18,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,62.9356,25.452,2.473,0.014,12.715,113.157
X1,0.1449,0.053,2.711,0.007,0.039,0.250
X2,0.0660,0.055,1.190,0.236,-0.043,0.175
X3,-0.0117,0.057,-0.205,0.838,-0.125,0.101
X4,0.0202,0.054,0.373,0.710,-0.087,0.127
X5,0.0688,0.055,1.245,0.215,-0.040,0.178
X6,-0.0080,0.055,-0.145,0.885,-0.116,0.100
X7,-0.0620,0.057,-1.089,0.278,-0.174,0.050
X8,-0.0448,0.055,-0.821,0.412,-0.153,0.063

0,1,2,3
Omnibus:,30.979,Durbin-Watson:,2.044
Prob(Omnibus):,0.0,Jarque-Bera (JB):,8.003
Skew:,-0.015,Prob(JB):,0.0183
Kurtosis:,2.02,Cond. No.,3630.0


# Part 2

Now, automate the process! Run the above analysis but vary the number of x's from 1 to 200. Log the r2 and r2-adj for each case and plot them