## Quadratic Regression and multicollinearity

This Jupyter notebook is about two topics in multiple regression.  The first is quadratic regresion and the second is a concept called multicollinearity and results when the features/predictors are correlated.  


### Quadratic Regression

When we have a non-linear relationship among our data, then using a quadratic model for prediction can be useful.  

The model for a quadratic regression is:

$$Y= \beta_0 + \beta_1 X_1 + \beta_2 X_1^2 + \epsilon.$$



In [31]:
# reading in the libraries and functions that we will need as we do this work.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
 
import scipy.stats as st
import statsmodels.api as sm 
import pylab as py 

# here are some of the tools we will use for our analyses
from sklearn.linear_model import LinearRegression
from sklearn.metrics import PredictionErrorDisplay
from sklearn.metrics import root_mean_squared_error
from sklearn.metrics import r2_score

from statsmodels.stats.outliers_influence import variance_inflation_factor



We are going to start with the penguins data and a relationship that we looked at previously, the relationship between flipper length (*flipper_length_mm*) and body mass (*body_mass_g*).  

In [None]:
penguins = pd.read_csv("https://webpages.charlotte.edu/mschuck1/classes/DTSC2301/Data/penguins.csv", na_values=['NA'])
# remove rows with missing data
penguins.dropna(inplace=True)
penguins.head()

In [None]:
plt.scatter( penguins['flipper_length_mm'],penguins['body_mass_g'], color="blue")

# Add labels and title
plt.xlabel('flipper length in mm')
plt.ylabel('body mass in g ')
plt.title('Plot of flipper length vs body mass')

# Show the plot
plt.show()

In [None]:
When we looked at this plot before, we decided it was roughly linear, positive and moderate to strong.  
Let's fit a linear model to these data and look at the residual plot.

In [None]:
X = penguins[['flipper_length_mm']]  
y = penguins['body_mass_g']  


# Create a linear regression model
p_model = LinearRegression()

# Fit the model on the  data
p_model.fit(X, y)

# Make predictions on the  data
y_hat = p_model.predict(X)

# Evaluate the model performance
rmse = root_mean_squared_error(y, y_hat)
print('Root Mean Squared Error:', rmse)

# Get the coefficients and intercept
print('Coefficients:', p_model.coef_)
print('Intercept:', p_model.intercept_)

In [None]:
x2 = sm.add_constant(X)

#fit linear regression model
model2 = sm.OLS(y, x2).fit()

#view model summary
print(model2.summary())

In [None]:

# below makes a residual vs predicted values plot
display = PredictionErrorDisplay(y_true=y, y_pred=y_hat)
display.plot()
plt.show()

Looking at this plot, there is some slight indications that we have a bit of pattern to these data.  
What I see if I look carefully is a slight downward pattern then an upward pattern.  It makes me 
want to consider adding a quadratic term to this model.

In [None]:
penguins['flipper_length_sq']= penguins['flipper_length_mm']*penguins['flipper_length_mm']
X = penguins[['flipper_length_mm', 'flipper_length_sq']]  
y = penguins['body_mass_g']  


# Create a linear regression model
p_model = LinearRegression()

# Fit the model on the  data
p_model.fit(X, y)

# Make predictions on the  data
y_hat = p_model.predict(X)

# Evaluate the model performance
rmse = root_mean_squared_error(y, y_hat)
print('Root Mean Squared Error:', rmse)

# Get the coefficients and intercept
print('Coefficients:', p_model.coef_)
print('Intercept:', p_model.intercept_)

In [None]:
x2 = sm.add_constant(X)

#fit linear regression model
model2 = sm.OLS(y, x2).fit()

#view model summary
print(model2.summary())

In [None]:
# below makes a residual vs predicted values plot
display = PredictionErrorDisplay(y_true=y, y_pred=y_hat)
display.plot()
plt.show()

So that is a better residual plot.  The RMSE has dropped slightly and the $r^2$ is just slightly higher.
Further looking at the model output we see that the quadratic term is discernibly different from zero.
This suggests that the quadratic model is better.  

To summarize, using a quadratic model is useful 1) when there is a clear quadratic pattern to the relationship
between the target/response and the feature/predictor or 2) when there is 
a quadratic pattern in the residuals vs fitted plot after fitting a linear model.

In [None]:
### Multicollinearity

Multicollinearity is something that happens when there is substantial correlatino between the  

We will use some old cigareete 

[<http://jse.amstat.org/v2n1/datasets.mcintyre.html>]

In [None]:
cigs = pd.read_csv("https://webpages.charlotte.edu/mschuck1/classes/DTSC2301/Data/cigarettes.csv", na_values=['NA'])
cigs.head()

Our goal in this analysis is to predict carbon monoxide from the other predictors.  We will start by building three
separate linear regressions with Tar, Nicotine and Weight as the predictors in each.  

In [None]:
X=cigs[['Tar']]
y=cigs['CO']
x2 = sm.add_constant(X)

#fit linear regression model
model2 = sm.OLS(y, x2).fit()

#view model summary
print(model2.summary())

In [None]:
X=cigs[['Nicotine']]
y=cigs['CO']
x2 = sm.add_constant(X)

#fit linear regression model
model2 = sm.OLS(y, x2).fit()

#view model summary
print(model2.summary())

In [None]:
X=cigs[['Weight']]
y=cigs['CO']
x2 = sm.add_constant(X)

#fit linear regression model
model2 = sm.OLS(y, x2).fit()

#view model summary
print(model2.summary())

So all of the predictors are significant on their own.  Tar and Nicotine are particularly strong predictors of CO.

Next we will put them all into a model and look at the result.

In [None]:
X=cigs[['Tar', 'Nicotine','Weight']]
y=cigs['CO']
x2 = sm.add_constant(X)

#fit linear regression model
model2 = sm.OLS(y, x2).fit()

#view model summary
print(model2.summary())

A couple things to note here:
1.  r^2 is not higher appreciably higher than it was for the model with Tar along
2.  In each of the individual models, all of the predictors were significant.
3.  For the full model, only Tar is significant.  

Below is the code to tell about all the correlations among our predictors

In [None]:
# make a correlation matrix with predictors for 
r = X.corr()
print (r)

The next set of code is for calculating VIF, variance inflation factor.  VIF is a way of detecting if multicollinearity is a problem. 

In [None]:


# VIF dataframe
vif_data = pd.DataFrame()
vif_data["feature"] = x2.columns

# calculating VIF for each feature
vif_data["VIF"] = [variance_inflation_factor(x2.values, i)
                          for i in range(len(x2.columns))]

print(vif_data[1:len(x2.columns)])

The output above suggests that there is significant multicollinearity and we should be worried about Tar and Nicotine since their VIF values are $\gt$ 10.

### Tasks

1. Repeat the above analysis for multicollinearity with Fit a multiple regression model for predicting penguin body mass using flipper length, bill length and bill depth.  Determine if multicollinearity is a problem with this model.  (For now ignore the quadratic term for flipper length).