# Day 21 - Multivariate Stats (continued)

Last week we talked about the basics of multilinear regression. Here are some last minute assumptions and items you should think about in order to fine tune your regression models.

In [2]:
import numpy as np
import pandas as pd
import statsmodels.api as sm

df = pd.read_csv('data/insurance.csv')

y = df['charges']                    
X = df.select_dtypes(np.number).assign(const=1)
X = X.drop(columns=['charges'])
X.head()

model = sm.OLS(y, X)
results = model.fit()

print(results.summary())

                            OLS Regression Results                            
Dep. Variable:                charges   R-squared:                       0.120
Model:                            OLS   Adj. R-squared:                  0.118
Method:                 Least Squares   F-statistic:                     60.69
Date:                Sun, 07 Apr 2024   Prob (F-statistic):           8.80e-37
Time:                        22:56:45   Log-Likelihood:                -14392.
No. Observations:                1338   AIC:                         2.879e+04
Df Residuals:                    1334   BIC:                         2.881e+04
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
age          239.9945     22.289     10.767      0.0

### Multicollinearity

Multicollinearity refers to the intercorrelation among features. MLR assumes that there is minimal to no correlation among features. High correlations among features will bias the estimates. When high multicollinearity exists, predictions will be very sensitive to (i.e., highly fluctuate with) very small variations in the features. The condition number (Cond. No.) is one example of a test of multicollinearity that is included in the StatsModels MLR results. Smaller numbers are better here. Generally speaking, values < 1 are ideal. But that assumes that the data are standardized—which ours was not (yet).

Regardless, Condition number is not the best test of multicollinearity available today. Variance inflation factor (VIF) is a much better indicator of multicollinearity. It is basically a measure of how much each of the X features overlap each other. We calculate VIF by ignoring the label and treating each X feature as the label in a model based on all remaining X features. The R2 of each model is used in a simple equation (1 / (1 - R2)) to calculate VIF. 

VIF scores below 3 (great), 5 (good), 10 (okay) have each been argued to indicate acceptable levels of multicollinearity depending on the strictness desired. O'Brien (2007) provides a helpful discussion of which level is best. However, if your goal is to model a prediction where parsimony and limiting over-fitting is desired, I would recommend using 3 as the cutoff.

In [6]:
#The VIF score for each X feature and print the scores in a DataFrame

df_vif = pd.DataFrame(columns=['VIF'])

# Loop through the X features only to generate VIF score for each
for col in X.drop(columns=['const']):
    y = X[col] # Each X feature takes a turn being the y
    # All remaining X features are used to predict that y
    vifX = X.drop(columns=[col]).assign(const=1)
    
    r_squared = sm.OLS(y, vifX).fit().rsquared # Record the R squared from the model
    
    if r_squared < 1: # Prevent division by zero runtime error
      vif = 1/(1 - r_squared) 
    else:
      vif = 100
    df_vif.loc[col] = vif

# Print out the list of VIF scores sorted from highest (worst) to lowest (best)
df_vif.sort_values(by=['VIF'], ascending=False)

Unnamed: 0,VIF
age,1.013816
bmi,1.012152
children,1.001874


### Feature Scaling

Feature scaling is a method used to adjust the range of features values to the same scale. Although some algorithms (e.g., MLR) do not depend on feature scaling to produce results, other algorithms will produce biased results if features have different ranges. Another reason to scale is that many algorithms will converge faster (e.g., find the minimum sum of squared residuals) if all features are on the same scale, including MLR, logistic regression, nearest neighbors, neural networks, support vector machines, principal components analysis, and linear discriminant analysis. You may not know what all of those mean yet, but just know that scaling is a good idea.

It is important to note that scaling does not change the shape or distribution of the data, only the range of the values. What is the implication of that? It means that problems with normalcy (e.g., skewness, kurtosis, heteroscedasticity) will not change after scaling. Those issues are resolved prior to scaling with mathematical adjustments

In [7]:
from sklearn import preprocessing

The scikit-learn package has four built-in forms of scaling: StandardScaler (z-score), MinMaxScaler, RobustScaler, and Normalizer. However, the most common are the first two. We will use MinMax.

In [11]:
# Min-Max Normalization
#preproccessing. (Other methods)

df2 = df.select_dtypes(np.number)

df_minmax = pd.DataFrame(preprocessing.MinMaxScaler().fit_transform(df2), columns=df2.columns)

df_minmax.describe()

Unnamed: 0,age,bmi,children,charges
count,1338.0,1338.0,1338.0,1338.0
mean,0.461022,0.395572,0.218984,0.193916
std,0.305434,0.164062,0.241099,0.193301
min,0.0,0.0,0.0,0.0
25%,0.195652,0.27808,0.0,0.057757
50%,0.456522,0.388485,0.2,0.131849
75%,0.717391,0.504002,0.4,0.2477
max,1.0,1.0,1.0,1.0


In [12]:
df2.describe()

Unnamed: 0,age,bmi,children,charges
count,1338.0,1338.0,1338.0,1338.0
mean,39.207025,30.663397,1.094918,13270.422265
std,14.04996,6.098187,1.205493,12110.011237
min,18.0,15.96,0.0,1121.8739
25%,27.0,26.29625,0.0,4740.28715
50%,39.0,30.4,1.0,9382.033
75%,51.0,34.69375,2.0,16639.912515
max,64.0,53.13,5.0,63770.42801


Running the MLR with the new standardized variables.

In [15]:
y = df_minmax['charges']                    
X = df_minmax.assign(const=1)
X = X.drop(columns=['charges'])

model2 = sm.OLS(y, X)
results2 = model2.fit()

print(results2.summary())

                            OLS Regression Results                            
Dep. Variable:                charges   R-squared:                       0.120
Model:                            OLS   Adj. R-squared:                  0.118
Method:                 Least Squares   F-statistic:                     60.69
Date:                Sun, 07 Apr 2024   Prob (F-statistic):           8.80e-37
Time:                        23:37:38   Log-Likelihood:                 386.57
No. Observations:                1338   AIC:                            -765.1
Df Residuals:                    1334   BIC:                            -744.3
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
age            0.1762      0.016     10.767      0.0