In [None]:
# reading in the libraries and functions that we will need as we do this work.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
 
import scipy.stats as st
import statsmodels.api as sm 
import pylab as py 

# here are some of the tools we will use for our analyses
from sklearn.linear_model import LinearRegression
from sklearn.metrics import PredictionErrorDisplay
from sklearn.metrics import root_mean_squared_error
from sklearn.metrics import r2_score

from statsmodels.stats.outliers_influence import variance_inflation_factor

### Outliers, Leverage, Influence

In this Jupyter Notebook we will look at outliers, leverage points and influence points.  Leverage
is a function of where a value or observation falls relative to the other data.  If it is far from other
observations then it potentially has leverage to change our prediction equation.

An influential point is one that has both a large leverage and large residual (when that point 
is removed) so that it seems to be *influencing* the prediction equation.

We'll start with the blue jay data below.



In [None]:
# read in the blue jay data
bluejay = pd.read_csv("https://webpages.charlotte.edu/mschuck1/classes/DTSC2301/Data/BlueJays.csv", na_values=['NA'])
# remove rows with missing data
bluejay.dropna(inplace=True)
bluejay.head()

In [None]:

# below we build a multiple regression model with three predictors
#  Predictors here are Head, BillDepth, and BillLength
# Our target variable will be the Mass of the blue jay 

X = bluejay[['Head', 'BillDepth', 'BillLength']]  
y = bluejay['Mass']  


# Create a linear regression model
blue_model = LinearRegression()

# Fit the model on the  data
blue_model.fit(X, y)

# Make predictions on the  data
y_hat = blue_model.predict(X)

# Evaluate the model performance
rmse = root_mean_squared_error(y, y_hat)
print('Root Mean Squared Error:', rmse)

# Get the coefficients and intercept
print('Coefficients:', blue_model.coef_)
print('Intercept:', blue_model.intercept_)

In [None]:
# for this particular model formulation we need to add a 
# column of 1's to the feature array
#add constant to predictor variables
x2 = sm.add_constant(X)

#fit linear regression model using OLS
blue_model2 = sm.OLS(y, x2).fit()

#view model summary
print(blue_model2.summary())

In [None]:
# VIF dataframe
vif_data = pd.DataFrame()
vif_data["feature"] = x2.columns

# calculating VIF for each feature
vif_data["VIF"] = [variance_inflation_factor(x2.values, i)
                          for i in range(len(x2.columns))]

print(vif_data[1:len(x2.columns)])

Multicollinearity does not seem to be a problem here.

In [None]:
# Calculate leverage statistics
leverage = blue_model2.get_influence().hat_matrix_diag
print(leverage)



The above are our leverage values.  We're only going to worry if any of them are more
than $2*(k+1)/n =2*4/123 = 0.0650$ where $k$ is the number of predictors.  

In [None]:
# Calculate Cook's distance
cook_distance = blue_model2.get_influence().cooks_distance[0]
print(np.round(cook_distance,4))

The above are our cooks distance values.  We will be concerned if any of them are more than
$0.5$ and we will be very concerned if any are more than $1.0$.

In [None]:
# Calculate studentized residuals
# recall that standardized residuals are residuals divided by standard deviation of all the residuals.
# 
# studentized residuals are residuals divided by standard deviation of the residuals if 
# the particular residual is not included.

studentized_residuals = blue_model2.get_influence().resid_studentized_external

print(studentized_residuals)

In [None]:

# here we will identify any that are of concern


# Identify large leverage points
leverage_points = np.where(leverage > np.mean(leverage) + 2 * np.std(leverage))
print(leverage_points)

# Identify influential observations based on Cook's distance
influential_observations = np.where(cook_distance > 0.5)
print(influential_observations)

x=bluejay['Head']
y=bluejay['Mass']


The leverage points are 7, 17, 69 and 81 which correspond to the row numbers for those data.

There do not appear to be any influential points.  Let's look at them.

In [None]:
print(bluejay.loc[leverage_points])

Hard to identify what these are leverage points without some additional information.  Let's look at the mean 
and standard deviation for the variables that are predictors/features.

In [None]:
print("Mean of Bill Depth is ", np.round(np.mean(bluejay['BillDepth']),2))
print("Standard deviation of Bill Depth is ", np.round(np.std(bluejay['BillDepth']),2))
print("Mean of Head is ", np.round(np.mean(bluejay['Head']),2))
print("Standard deviation of Head is ", np.round(np.std(bluejay['Head']),2))
print("Mean of Bill Length is ", np.round(np.mean(bluejay['BillLength']),2))
print("Standard deviation of Bill Length is ", np.round(np.std(bluejay['BillLength']),2))

Now we can get a sense of why the four values that were chosen were leverage points.  Below large(small) is measured
in number of standard deviations above(below) the mean.

For the first one, the head size is very large as is the bill length.

For the second one, head size seems to be very large.

For the third one, all of the predictors are very small.

For the fourth one, bill length seems to be the variable that likely is leading to a large leverage.

In [None]:
# read in the data to dataframe called ames
ames = pd.read_csv("https://webpages.charlotte.edu/mschuck1/classes/DTSC2301/Data/Ames_house_prices.csv", na_values=['?'])
# replace the ? in the data with NaN for missing values
ames.replace([' ?'],np.nan)
# show information about the dataframe
ames.info()

In [None]:

model1=LinearRegression()

X = ames[['LotArea', 'GrLivArea', 'BsmtFinSF1']]
# going to make a transformation of the SalePrice by
# taking the natural logarithm of it.
y = np.log(ames['SalePrice'])

# fit the linear regression to the data.
model1.fit(X,y)

# make the residual vs fitted plot
y_hat = model1.predict(X)
# below makes a 
display = PredictionErrorDisplay(y_true=y, y_pred=y_hat)
display.plot()
plt.show()


In [None]:
x2 = sm.add_constant(X)

#fit linear regression model using OLS
model1 = sm.OLS(y, x2).fit()

In [None]:
# Calculate leverage statistics
leverage = model1.get_influence().hat_matrix_diag
cook_distance = model1.get_influence().cooks_distance[0]
studentized_residuals = model1.get_influence().resid_studentized_external



# Identify large leverage points
leverage_points = np.where(leverage > np.mean(leverage) + 2 * np.std(leverage))
print("These are the leverage points")
print(leverage_points)

# Identify influential observations based on Cook's distance
influential_observations = np.where(cook_distance > 0.5)
print("These are the influential points")
print(influential_observations)


In [None]:
So here we have some leverage points, eight of them. And one influential point, in row 1298.

In [None]:
print(ames.loc[influential_observations])


That SalePrice seems low but it is hard to tell from the output we got.  Let's make it bit more 
targeted.

In [None]:
print(ames[['LotArea', 'GrLivArea', 'BsmtFinSF1','SalePrice']].loc[influential_observations])

So the SalePrice seems low for such a large house.  Let's look at the studentized residual for this.

In [None]:
print(studentized_residuals[influential_observations])

That's quite a negative residual and, no doubt, the reason that there are 

### Tasks

1. Fit a multiple regression to the penguins data with body mass as the response and flipper length and 
bill length as features.

2. Determine if there are any unusual residuals, any leverage points or any influential points in the regression
that you made in the previous task.