## Introduction to Linear Regression

Previously we have look at linear regression in terms of summarizing a relationship between to quantitative variables.  Now we are going to take a deeper dive into linear regression as a model.

We are going to use a new package called 's k learn' though if you need to install it you use 'scikit-learn'.  This package will have many of the models that we will use going forward.

In [516]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
 
import scipy.stats as st
import statsmodels.api as sm 
import pylab as py 

# sklearn is new and you may have to install it,  the code is 
# pip3 install scikit-learn
from sklearn.linear_model import LinearRegression

We are going to start with the monkey data.

In [None]:
# read in the monkey data
monkey = pd.read_csv("https://webpages.charlotte.edu/mschuck1/classes/DTSC2301/Data/monkey.csv")
# get info about these data
monkey.info()



Reminder that these data are the age of the monkeys in years (*age*) and the number of primordial follicles that a female monkey has (*pf*).

Next we will plot the data.

In [None]:
plt.scatter( monkey['age'],monkey['pf'], color="blue")

# Add labels and title
plt.xlabel('Age in years')
plt.ylabel('Number of primordial follicles')
plt.title('Plot of age versus number of primordial follicles for monkeys')
plt.show()

The relationship here is negative and seems linear.

### Fitting the model

Below we have the code for specifying the model, then fitting the model to the data.

In [None]:
# In sklearn we first need to create a model object 
# and here it is a linear regression
model= LinearRegression()
# note below that the x needs to be a two dimensional array so we 
# need the double brackets here
x=monkey[['age']]
# y needs to be a one dimensional array so single brackets work
y=monkey['pf']
model.fit(x, y)

### Assessment 
There are two types of assessment for a model.  First we assess whether or not the data is appropriate for the model requirements/conditions or assumptions.  Second, we evaluate how well the model performs.  

For the former when we are using a the linear model, the relationship with our variables should be linear and the variability about the line should be consistent.  There is another condition that the errors/residuals should be approximately Gaussian or Normally distributed.  This last condition is only important if the number of observations is small.

#### Model Conditions

To evaluate the model conditions above we use the residuals.  The word residual means leftover and in regression we use it to me the values for the target/response, y, which are left over after getting the predicted values, $\hat{y}$, for y. So mathematically, the
residuals, usually denoted by $e$ are calculated to be $ e=y -\hat{y}$.  

The first plot we will look at is a plot of residuals versus fitted values. In this plot we are looking for no relationship between the residuals (y-axis) and the predicted values (x-axis).  Having no relationship means that there is no relationship in the values that are left after we fit the model.  Additionally in this plot it is important that we should have the variability in the vertical direction being roughly the same across the different predicted values.  
  

In [None]:
from sklearn.metrics import PredictionErrorDisplay
# the code below get the predict values for all of the values in x
y_hat = model.predict(x)
# below makes a 
display = PredictionErrorDisplay(y_true=y, y_pred=y_hat)
display.plot()
plt.show()

In the graph above the only trend seems to be a flat one and the variability seems to be the same across the predicted values.  This means that our model is appropriate.


The second plot to consider is something called a *qqplot* which is short for quantile quantile plot.  This is a plot that allows us to see if the residuals follow roughly a Normal distribution.  The details are that we plot the quantiles would expect if the residuals *perfectly* followed a Normal distribution ('Theoretical Quantiles') and plot those against the quantiles from the actual residuals ('Sample Quantiles').  We want the relationship of the points, the blue dots below,  to be a straight line/linear roughly and that would imply 'Normality' of the residuals.  Small deviations from a straight line (among the points) in a qqplot is not a big deal.  As with many statistical things, the 'Normality' matters less as the sample size increases.   

In [None]:
# this is code for making the qqplot

# get the predicted values from the model
y_hat = model.predict(x)  
# calculate the residuals 
residuals = y -y_hat
# generate the qq plot and put a line through the points to help us visualize the relationship here    
sm.qqplot(residuals, line ='s') 
# 
py.show() 


### Model Summary

So the other way that we assess a model is how well does it fit the data.  There are several different measurements that tell us how well the model fits.  

#### Correlation
The first of these we've already seen which is the correlation.  Python has several ways to calculate the correlation, below we'll see two of them.   Note that the usual correlation is sometimes called Pearson's correlation coefficient and the usual notation is $r$.  

In [None]:
#here we are using the numpy package
r= np.corrcoef(monkey['age'], monkey['pf'])[0, 1]
print (r)

In [None]:
# here we are using the scipy package
corr, pvalue=scipy.stats.pearsonr(monkey['age'], monkey['pf']) 
print(corr)

In [None]:
# here we are using pandas
monkey['age'].corr(monkey['pf'])

#### R-squared
The next measure of model fit is the 'coefficient of determination' or more colloquially 'r-squared' because the calculation
is to take the correlation, $r$, and square it.  Now this is a mathematical nicety that it works out that way.  $r^2$ has an 
important interpretation and that is 'the percent of the variation in the target that is explained by the linear model with x'.  

In [None]:
print(corr*corr)

So taking the 'monkey' data we get an $r^2$ of $0.872$ or $87.2\%$ which means that 87.2 percent of the variation in the number of primordial follicles that a monkey has can be explained by their age.  

In [None]:
#Here's another method from the sklearn package 
from sklearn.metrics import r2_score
r2_score(y, y_hat)

#### Root Mean Squared Error
The next measure of how well a model does is the 'root mean squared error' or RMSE.  To understand this metric, we need to go back to the calculation of the standard deviation.  That calculation is
$$s =\sqrt{ \frac{1}{n-1}\sum_{i=1}^n (y_i - \bar{y})^2}. $$

And that quantity, $s$, we interpret as the average difference from the mean.  

For a linear regression with a single predictor, the root mean squared error is $$s_e =\sqrt{\frac{1}{n-2}\sum_{i=1}^n (y_i-\hat{y_i})^2}. $$  A couple of things here: First, the part that is being squared is the residual.  Second, the part under the square root is a  sum that we are dividing by $n-2$ which is usually close to $n$ so it is like the mean of the squared errors.  Third, we are taking the square root, so putting those three together we get the 'root mean squared error' or RMSE.

We interpret the RMSE as the average difference between the observed values and the predicted values from our regression line.  So this is a measure as the average size of a residual or the average difference between an observation and the prediction line.

In [None]:
from sklearn.metrics import root_mean_squared_error
root_mean_squared_error(y, y_hat)

So, on average, the difference between the predicted number of primordial follicles and the observed number of primordial follicles is 5.06.  

### Predictions

Above we saw how to get the predicted values for the data that we observed.

In [None]:
# get the predicted values from the model
monkey['y_hat'] = model.predict(x) 
print(monkey)

Now if we want to get predicted *pf* for when a monkey is 2.1 years of age or 3.0 or 5.9.

In [None]:
# make a dataframe for predictions at age is 2, 3 and 5
x_pred = pd.DataFrame({'age': [ 2.1, 3.0, 5.9]})
# code to have the model give us the predicted values at the ages in x_pred
model.predict(x_pred)

These are the predicted values for age=2.1, 3.0 and 5.9 respectively.  

#### Extrapolation

One thing to be careful of is that linear regression like all models is not intelligent.  We can get the model to give us predictions that are not reasonable.  Extrapolation is the idea that we are extending the model beyond the range of our features.  In particular, we don't know that the linear relationship that we have when age is between 1.3 and 8.4 continues to hold for values of age outside that range.  It is likely that our predictions will be good when we move slightly beyond that range.  

In [None]:
x_pred2=pd.DataFrame({'age':[32,712,-4]})
model.predict(x_pred2)

First the model gives us predictions for all three of the values for age.  The first value, $32$, might be a large age but it is clearly outside the range of our data and so that prediction is one that we should approach with skepticism.  An age of $712$ and an age of $-4$ both seem to be impossible for a monkey, and yet, the model gives us a value. 



### Inference about predictors

Make confidence intervals and doing hypothesis tests on the slope and y-intercept of our model are sometimes important for a linear regression with a single predictor.  

In [None]:
# we need another package to get this output
import statsmodels.api as sm

# for this particular model formulation we need to add a 
# column of 1's to the feature array
#add constant to predictor variables
x2 = sm.add_constant(x)

#fit linear regression model
model2 = sm.OLS(y, x2).fit()

#view model summary
print(model2.summary())

There is quite a bit of output here but we only are interested in some pieces.  Note that the 'R-squared' here is what
we saw above.  The number of observations, $20$, is given by 'No. Observations.'  

The other things we will care a good deal about, for now, are in the table in the middle between the "====" lines that starts with *coef*, *std err*.  This table is a summary of the slope and intercept. The rows labels are 'const' and 'age' which correspond to the y-intercept and the slope respectively.  The column headings are 'coef', 'std err', 't', 'P>|t|', '[0.025', and '0.975]'.  

These are:

    *coef* is the estimate of the parameter

    *std err* is the standard error
    
    *t* is the test statistics for a hypothesis test of $=0$ vs $\neq 0$
    
    *P>|t|* is the p-value for the hypothesis test above
    
    *[0.025* is the lower end of a $95\%$ confidence interval for the parameter
    
    *0.975]* is the upper end of a $95\%$ confidence interval for the parameter

So that a $95\%$ confidence interval for the slope would be $(-8.13,-5.53)$.


Below is the code for making a general confidence interval for the slope:
        for df we use the *Df Residuals* from the output above
        for loc we use the *coef* for age from above
        for scale we use the *std err* for age from above



In [None]:

# confidence interval for a slope
lower, upper = st.t.interval(confidence=0.99, 
              df=18, 
              loc=-6.8301,  
              scale= 0.618) 
print(round(lower,2), round(upper,2))

Our interpretation is that for each additional year of age for a monkey we expect (or we predict) that the number of primordial follicles that the monkey has will drop between 8.61 and 5.05 follicles with 99% confidence.

Some notes:
    *Df Residuals* stands for degrees of freedom for residuals

    *coef* is short for coefficient which is a mathematical term for the quantity in front of a variable
    
    *std err* is short for standard error which is the estimated standard deviation

### Blue Jays

We'll now look at some data about Blue Jays, the birds.  

Details on the data can be found at this link:
[<https://rdrr.io/rforge/Stat2Data/man/BlueJays.html>]

We'll focus on predicting Blue Jay body mass in grams (*Mass*) from skull size in mm (*Skull*). 

In [None]:
bluejay = pd.read_csv("https://webpages.charlotte.edu/mschuck1/classes/DTSC2301/Data/BlueJays.csv", na_values=['NA'])
# remove rows with missing data
bluejay.dropna(inplace=True)
bluejay.head()



In [None]:
# In sklearn we first need to create a model object 
# and here it is a linear regression
bluejay_model1= LinearRegression()
# note below that the x needs to be a two dimensional array so we 
# need the double brackets here
bluejay_x=bluejay[['Skull']]
# y needs to be a one dimensional array so single brackets work
bluejay_y=bluejay['Mass']
bluejay_model1.fit(bluejay_x, bluejay_y)

bluejay_y_hat = bluejay_model1.predict(bluejay_x)
# below makes a 
display = PredictionErrorDisplay(y_true=bluejay_y, y_pred=bluejay_y_hat)
display.plot()
plt.show()

The above plot is pretty good.    So we can continue to use and evaluate this model.   


In [None]:
# this is code for making the qqplot

# get the predicted values from the model
bluejay_y_hat = bluejay_model1.predict(bluejay_x)  
# calculate the residuals 
bluejay_residuals = bluejay_y -bluejay_y_hat
# generate the qq plot and put a line through the points to help us visualize the relationship here    
sm.qqplot(bluejay_residuals, line ='s') 
# 
py.show() 

From the above qqplot, the points fall pretty closely along the line so that condition for using the model seems to be met.

So next we will get the slope and y-intercept.

In [None]:
# for this particular model formulation we need to add a 
# column of 1's to the feature array
#add constant to predictor variables
bluejay_x2 = sm.add_constant(bluejay_x)

#fit linear regression model
bluejay_model2 = sm.OLS(bluejay_y, bluejay_x2).fit()

#view model summary
print(bluejay_model2.summary())

So from the above output for our regresion, we get some useful information.

First, the prediction equation is $\hat{y} = -17.20 + 2.88$ bill_length_mm.  So our estimated slope is 2.88 and our estimated y-intercept is -17.20.  

This means that we would predict the body mass of a blue jay  with a skull size of zero mm to be -17.2g.  And for each additional millimeter of skull size that a blue jay has, we would predict that their body mass would be 2.88 grams larger.

The $r^2$ value here is 0.306 which indicates that $30.6\%$ of the variability in body mass of a blue jay can be explained by the relationship with their skull size.  

A couple of other things to highlight here: The p-value for the hypothesis test that the y-intercept is zero is $0.160$ which is large and so we can reasonably conclude that the y-intercept is not discernibly different from zero.  

Turning to the hypothesis test for the slope, we can reject the null hypothesis that the slope is zero since the p-value is small, $0.000$.  Thus, we can conclude that the slope is statistically discernible from zero.




### Another example with Ames Housing Data

Let's return to the Ames Housing data

In [None]:
# read in the data to dataframe called ames
ames = pd.read_csv("https://webpages.charlotte.edu/mschuck1/classes/DTSC2301/Data/Ames_house_prices.csv", na_values=['?'])
# replace the ? in the data with NaN for missing values
ames.replace([' ?'],np.nan)
# show information about the dataframe
#ames.info()

In [None]:
plt.scatter( ames['GrLivArea'],ames['SalePrice'], color="green")

# Add labels and title
plt.xlabel('Above Ground Living Area')
plt.ylabel('Sale Price')
plt.title('Sale Price vs Living Area for Ames Iowa')

# Show the plot
plt.show()

Before we described this 

In [None]:
# In sklearn we first need to create a model object 
# and here it is a linear regression
ames_model= LinearRegression()
# note below that the x needs to be a two dimensional array so we 
# need the double brackets here
ames_x=ames[['GrLivArea']]
# y needs to be a one dimensional array so single brackets work
ames_y=ames['SalePrice']
ames_model.fit(ames_x, ames_y)

ames_y_hat = ames_model.predict(ames_x)
# below makes a 
display = PredictionErrorDisplay(y_true=ames_y, y_pred=ames_y_hat)
display.plot()
plt.show()

From the residual plot above we can see that the distribution of the residuals changes substantially for different 'Predicted values'.  This would suggest that this type of model is not appropriate, since one of the conditions for a linear regression model is to have consistent variability.  

This type of changing variability is called 'heteroskedastic'.  If the variability is non-changing as it was for the monkey data, then we say the residuals are 'homoskedastic'.

Since the conditions are not appropriate for this model, we will not use it further.

### Tasks

1. Using the Blue Jays data, fit a regression model to predict body mass using head size, and plot the residual plot and the qqplot.  What do they tell you about the regression model.

2. Find and interpret the slope and intercept for this regression model in the context of these data.

3. Predict the body mass of a blue with a head size of 57mm and a head 53 mm.

4. Find $r^2$ for this model and interpret in the context of these data.

5. Create a $99\%$ confidence interval for the slope and interpret it.

6. Find the p-value for the hypothesis test of the slope.  What do you conclude from that?

7. Find an interpret the $RMSE$ for this model.