# Lab: Simple Linear Regression
## CMSE 381 - Fall 2022
## Lecture 5 - Sept 12, 2022

In the today's lectures, we are starting focused on simple linear regression, that is, fitting models of the form 
$$
Y =  \beta_0 +  \beta_1 X_1 + \varepsilon
$$
In this lab, we will use two different tools for linear regression. 
- [Scikit learn](https://scikit-learn.org/stable/index.html) is arguably the most used tool for machine learning in python 
- [Statsmodels](https://www.statsmodels.org) provides many of the statisitcial tests we've been learning in class

# 1. The Dataset

In [None]:
# As always, we start with our favorite standard imports. 

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
%matplotlib inline
import seaborn as sns


In this module, we will be using the `Diabetes` data set again, which you looked into from the last class. In case you've forgotten, there is information about the data set [in the documentation](https://scikit-learn.org/stable/datasets/toy_dataset.html#diabetes-dataset).

In [None]:
from sklearn.datasets import load_diabetes
diabetes = load_diabetes(as_frame=True)
diabetes_df = pd.DataFrame(diabetes.data, columns = diabetes.feature_names)
diabetes_df['target'] = pd.Series(diabetes.target)

diabetes_df

------

# 1. Simple Linear Regression

Like last time, we're now going to fit to a simple linear regression to the models
$$
\texttt{target} = \beta_0 + \beta_1 \cdot\texttt{s1}
$$
and 
$$
\texttt{target} = \beta_0 + \beta_1 \cdot\texttt{s5}
$$
where the variables are 
- $\texttt{s1}$: tc, total serum cholesterol

- $\texttt{s5}$: ltg, possibly log of serum triglycerides level. 

Let's start by looking at using `s5` to predict `target`.




In [None]:
from sklearn.linear_model import LinearRegression


# sklearn actually likes being handed numpy arrays more than 
# pandas dataframes, so we'll extract the bits we want and just pass it that. 
X = diabetes_df['s5'].values
X = X.reshape([len(X),1])
y = diabetes_df['target'].values
y = y.reshape([len(y),1])

# This code works by first creating an instance of 
# the linear regression class
reg = LinearRegression()
# Then we pass in the data we want it to use to fit.
reg.fit(X,y)

# and we can get the coefficients we want out of the model from the following code.

print(reg.coef_)
print(reg.intercept_)

# I can do some fancy printing if I really want to
lineString = str(round(reg.coef_[0,0],4)) +  "x_1 + " +  str(round(reg.intercept_[0],4))
print( 'y = ', lineString)

# 2. Assessing Coefficient Estimate Accuracy

To get the statistical test information, we will use the `statsmodels` package. You can take a look at the documentation here: www.statsmodels.org

In [None]:
import statsmodels.formula.api as smf

In [None]:
# Notice that the code is intentially written to look
# more like R than like python, but it still works!
# Double check..... the coefficients here should be
# about the same as those found by scikit-learn
est = smf.ols('target ~ s5', diabetes_df).fit()
est.summary().tables[1]

&#9989; **<font color=red>Q:</font>** What is $SE(\hat \beta_0)$ and $SE(\hat \beta_1)$?

Your answer here. 

&#9989; **<font color=red>Q:</font>** If we instead use `s1` to predict the target, are $SE(\hat \beta_0)$ and $SE(\hat \beta_1)$ higher or lower than what you found for the `s5` prediction? Is this reasonable? Try plotting your predictions against scatter plots of the data to compare. 

In [None]:
# Your code here. 

&#9989; **<font color=red>Q:</font>** What are the confidence intervals for  $\hat \beta_1$ in the two cases (the prediction using `s1` and the prediction using `s5`)? Which is wider and why?  

Your answer here. 

&#9989; **<font color=red>Q:</font>** What is the conclusion of the hypothesis test 
$$H_0: \text{ There is no relationship between $X$ and $Y$}$$
$$H_a: \text{ There is some relationship between $X$ and $Y$}$$
at a confidence level of $\alpha = 0.05$?

Your answer here

In [None]:
916.13 + 1.96 * 63

Oh hey look, there's another table with information stored by the statsmodel class. 

In [None]:
est.summary().tables[0]

&#9989; **<font color=red>Q:</font>** What is $R^2$ for the two models?

![Stop Icon](https://upload.wikimedia.org/wikipedia/commons/thumb/1/1e/Vienna_Convention_road_sign_B2a.svg/180px-Vienna_Convention_road_sign_B2a.svg.png)

Great, you got to here! Hang out for a bit, there's more lecture before we go on to the next portion. 

# 3.  Simulating data 
Ok, let's run an example like was shown in class where we see the distribution of possible values. 

In [None]:
# Here's code that decides on my function 
def myFunc(x, b0 = 2, b1 = 5): 
    return b0 + b1 * x


# Here's a command that generates 100 random data points from f(x) + epsilon
def makeData(n = 100):
    X = np.random.uniform(-2,2,n)
    y = myFunc(X) + np.random.normal(size = n)
    return X,y

In [None]:
# Everytime you run this cell, you get slightly different data

X,y = makeData()

plt.scatter(X,y)

In [None]:
# Which means that every time you run this cell, you get a slightly different choice of coefficients
# for the model learned

X,y = makeData()
X = X.reshape([len(X),1])
y = y.reshape([len(y),1])
reg = LinearRegression()
reg.fit(X,y)
print( 'y=' + str(round(reg.coef_[0,0],4)) +  "x_1 + " +  str(round(reg.intercept_[0],4)) )


In [None]:
# So now, lets just train our linear model 1000 times, and using CIs to checks how many times the CI will
# cover the true b1? You can use [b1 - 2 SE(b1), b1 + 2SE(b1)], or other CI

for i in range(100):

# 4. Multiple linear regression 
Next we get some code up and running that can do linear regression with multiple input variables, that is when the model is of the form
$$
Y =  \beta_0 +  \beta_1 X_1 +  \beta_2 X_2 + \cdots +  \beta_pX_p + \varepsilon
$$

We first model `target = beta_0 + beta_1 *s1 + beta_2 * s5` using `scikitlearn`.

In [None]:
X = diabetes_df[['s1','s5']].values
y = diabetes_df['target'].values

multireg = LinearRegression() #<----- notice I'm using exactly the same command as above
multireg.fit(X,y)

print(multireg.coef_)
print(multireg.intercept_)

&#9989; **<font color=red>Q:</font>** What are the values for $\beta_0$, $\beta_1$, and $\beta_2$? Write an interpretation for the $\beta_2$ value in this data set. 

Your answer here

In [None]:
diabetes_df.var()

We next model `target = beta_0 + beta_1 *s1 + beta_2 * s5` using `statsmodels`. Do you get the same model?

In [None]:
# multiple least squares with statsmodel
multiple_est = smf.ols('target ~ s1 + s5', diabetes_df).fit()
multiple_est.summary()

&#9989; **<font color=red>Q:</font>** What is the predicted model? How much trust can we place in the estimates?

*Your answer here*

&#9989; **<font color=red>Q:</font>** Run the linear regression to predict `target` using all the other variables. What do you notice about the different terms? Are some more related than others? 

*Your answer here*

&#9989; **<font color=red>Q:</font>** Earlier you determined the p-value for the `s1` variable when we only used `s1` to predict `target`. What changed about the p-value for `s1` now where it is part of a regression using all the variables. Why?

In [None]:
# Your answer here



-----
### Congratulations, we're done!

Written by Dr. Liz Munch, Michigan State University
<a rel="license" href="http://creativecommons.org/licenses/by-nc/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc/4.0/">Creative Commons Attribution-NonCommercial 4.0 International License</a>.

In [None]:
A = np.array([[1, 2, 3], [4, 5, 6]])
B = np.array([[7, 8], [10, 11]])
A1 = np.concatenate([A,B],axis = 1)

#Same number of rows w/ result [[1, 2, 3, 7, 8],[4, 5, 6, 10, 11]]
print('shape of A is ', np.shape(A), '; shape of B is ', np.shape(B))

In [None]:
C = np.array([[1, 2, 3],[4, 5, 6]])
D = np.array([3, 3, 3])
#np.concatenate([C,D], axis = 0) 
print('shape of C is ', np.shape(C), '; shape of D is ', np.shape(D))

In [None]:
C = np.array([[1, 2, 3],[4, 5, 6]])
D = np.array([[3, 3, 3]])

print('shape of C is ', np.shape(C), '; shape of D is ', np.shape(D))
np.concatenate([C,D], axis = 0) 

In [None]:
A[0]