## Introduction to Linear Regression

Previously we have look at linear regression in terms of summarizing a relationship between to quantitative variables.  Now we are going to take a deeper dive into linear regression as a model.

We are going to use a new package called 's k learn' though if you need to install it you use 'scikit-learn'.  This package will have many of the models that we will use going forward.

In [69]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import scipy.stats
import statsmodels.api as sm 
import pylab as py 

# sklearn is new and you may have to install it but the code is 
# pip3 install scikit-learn
from sklearn.linear_model import LinearRegression

We are going to start with the monkey data.

In [None]:
# read in the monkey data
monkey = pd.read_csv("https://webpages.charlotte.edu/mschuck1/classes/DTSC2301/Data/monkey.csv")
# get info about these data
monkey.info()



Reminder that these data are the age of the monkeys in years (*age*) and the number of primordial follicles that a female monkey has (*pf*).

Next we will plot the data.

In [None]:
plt.scatter( monkey['age'],monkey['pf'], color="blue")

# Add labels and title
plt.xlabel('Age in years')
plt.ylabel('Number of primordial follicles')
plt.title('Plot of age versus number of primordial follicles for monkeys')
plt.show()

The relationship here is negative and seems linear.

### Fitting the model

Below we have the code for specifying the model, then fitting the model to the data.

In [None]:
# In sklearn we first need to create a model object 
# and here it is a linear regression
model= LinearRegression()
# note below that the x needs to be a two dimensional array so we 
# need the double brackets here
x=monkey[['age']]
# y needs to be a one dimensional array so single brackets work
y=monkey['pf']
model.fit(x, y)

### Assessment 
There are two types of assessment for a model.  First we assess whether or not the data is appropriate for the model requirements/conditions or assumptions.  Second, we evaluate how well the model performs.  

For the former when we are using a the linear model, the relationship with our variables should be linear and the variability about the line should be consistent.  There is another condition that the errors/residuals should be approximately Gaussian or Normally distributed.  This last condition is only important if the number of observations is small.

#### Model Conditions

To evaluate the model conditions above we use the residuals.  The word residual means leftover and in regression we use it to me the values for the target/response, y, which are left over after getting the predicted values, $\hat{y}$, for y. So mathematically, the
residuals, usually denoted by $e$ are calculated to be $y -\hat{y}$.  

The first plot we look at is a plot of residuals versus fitted values. In this plot we are looking for no relationship between the residuals (y-axis) and the predicted values (x-axis).  Having no relationship means that there is no relationship in the values that are left after we fit the model.  Additionally in this plot it is important that we should have the variability in the vertical direction being roughly the same across the different predicted values.  
  

In [None]:
from sklearn.metrics import PredictionErrorDisplay
# the code below get the predict values for all of the values in x
y_hat = model.predict(x)
# below makes a 
display = PredictionErrorDisplay(y_true=y, y_pred=y_hat)
display.plot()
plt.show()

In the graph above the only trend seems to be a flat one and the variability seems to be the same across the predicted values.


The second plot to consider is something called a *qqplot* which is short for quantile quantile plot.  This is a plot that allows us to see if the residuals follow roughly a Normal distribution.  The details are that we plot the quantiles would expect if the residuals *perfectly* followed a Normal distribution ('Theoretical Quantiles') and plot those against the quantiles from the actual residuals ('Sample Quantiles').  We want the relationship to be linear roughly and that would imply 'Normality' of the residuals.  As with many statistical things, the 'Normality' matters less as the sample size increases.  

In [None]:
# this is code for making the qqplot

# get the predicted values from the model
y_hat = model.predict(x)  
# calculate the residuals 
residuals = y -y_hat
# generate the qq plot and put a line through the points to help us visualize the relationship here    
sm.qqplot(residuals, line ='s') 
# 
py.show() 


### Model Summary

So the other way that we assess a model is how well does it fit the data.  There are several different measurements that tell us how well the model fits.  

#### Correlation
The first of these we've already seen which is the correlation.  Python has several ways to calculate the correlation, below we'll see two of them.   Note that the usual correlation is sometimes called Pearson's correlation coefficient and the usual notation is $r$.  

In [None]:
#here we are using the numpy package
r= np.corrcoef(monkey['age'], monkey['pf'])[0, 1]
print (r)

In [None]:
# here we are using the scipy package
corr, pvalue=scipy.stats.pearsonr(monkey['age'], monkey['pf']) 
print(corr)

In [None]:
# here we are using pandas
monkey['age'].corr(monkey['pf'])

#### R-squared
The next measure of model fit is the 'coefficient of determination' or more colloquially 'r-squared' because the calculation
is to take the correlation, $r$, and square it.  Now this is a mathematical nicety that it works out that way.  $r^2$ has an 
important interpretation and that is 'the percent of the variation in the target that is explained by the linear model with x'.  

In [82]:
print(corr*corr)

0.8716326106866872


So taking the 'monkey' data we get an $r^2$ of $0.872$ or $87.2\%$ which means that 87.2 percent of the variation in the number of primordial follicles that a monkey has can be explained by their age.  

In [81]:
#Here's another method from the sklearn package 
from sklearn.metrics import r2_score
r2_score(y, y_hat)

0.8716326106866871

#### Root Mean Squared Error
The next measure of how well a model does is the 'root mean squared error' or RMSE.  To understand this metric, we need to go back to the calculation of the standard deviation.  That calculation is
$$s =\sqrt{ \frac{1}{n-1}\sum_{i=1}^n (y_i - \bar{y})^2}. $$

And that quantity, $s$, we interpret as the average difference from the mean.  

For a linear regression with a single predictor, the root mean squared error is $$s_e =\sqrt{\frac{1}{n-2}\sum_{i=1}^n (y_i-\hat{y_i})^2}. $$  A couple of things here: First, the part that is being squared is the residual.  Second, the part under the square root is a  sum that we are dividing by $n-2$ which is usually close to $n$ so it is like the mean of the squared errors.  Third, we are taking the square root, so putting those three together we get the 'root mean squared error' or RMSE.

We interpret the RMSE as the average difference between the observed values and the predicted values from our regression line.  So this is a measure as the average size of a residual or the average difference between an observation and the prediction line.

In [83]:
from sklearn.metrics import root_mean_squared_error
root_mean_squared_error(y, y_hat)

5.05583596603833

### Predictions


In [None]:
# make a dataframe for predictions at age is 2, 3 and 5
x_pred = pd.DataFrame({'age': [ 2, 3, 5]})
# code to have the model give us the predicted values at the ages in x_pred
model.predict(x_pred)