Calculate the Variance Inflation Factor (VIF) to estimate multicollinearity

You need to import two functions: variance_inflation_factor and add_constant (both from statsmodels package)

Start by constructing a pandas data frame with the following structure:

- Attribute 'a': [1, 1, 2, 3, 4],
- Attribute 'b': [2, 2, 3, 2, 1],
- Attribute 'c': [4, 6, 7, 8, 9],
- Attribute 'd': [4, 3, 4, 5, 4]

Add a constant (see function above) to your data frame (this is a mandatory parameter of the VIF-function)

Calculate VIF factors for each attribute (a,b,c,d,constant)


In [None]:
#Calculate Variance Inflation Factor (VIF) to estimate multicollinearity
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant
import pandas as pd

df = pd.DataFrame(
    {'a': [1, 1, 2, 3, 4],
     'b': [2, 2, 3, 2, 1],
     'c': [4, 6, 7, 8, 9],
     'd': [4, 3, 4, 5, 4]}
)

x = add_constant(df)

vif_factors= pd.Series([variance_inflation_factor(x.values, i) for i in range(x.shape[1])], index=x.columns)
print("VIF: \n",vif_factors)

Now we slightly change our data frame by creating some structural multicollinearity: The values of attributes a,b and c remain the same. The value of attribute d however should be calculated as the product of a-values and b-values.

Check the effect of the adapted values on the VIFs!

In [None]:
#Construct a data frame with some structural multicollinearity:
# c=a*b

df_sm = pd.DataFrame(
    {'a': [1, 1, 2, 3, 4],
     'b': [2, 2, 3, 2, 1],
     'c': [4, 6, 7, 8, 9]}
)

df_sm['d']=df_sm['a']*df_sm['b']

x = add_constant(df_sm)
vif_factors= pd.Series([variance_inflation_factor(x.values, i) for i in range(x.shape[1])], index=x.columns)
print("VIF: \n",vif_factors)


Use the same data frame with some structural multicollinearity you just created.

Center the values of the features to decrease structural multicollinearity

Hint: there is a very helpful pandas function called "mean" that could be useful :)

In [None]:
#Center the values of the features to decrease structural multicollinearity:
# c=a*b
print("Original values: \n",df_sm)
#print("Mean values: \n",df_sm.mean())

df_centered = df_sm-df_sm.mean()

print("\n Centered values: \n",df_centered)
    
x = add_constant(df_centered)
vif_factors= pd.Series([variance_inflation_factor(x.values, i) for i in range(x.shape[1])], index=x.columns)
print("\nVIF - centered: \n",vif_factors)

Any effect on the VIF values? If you re-calculated the values of attribute "d" just as the other ones (subtracting the mean value from each row) there should be no effect.

What we need to do with attribute "d" is recalculte its values from the centered a and b values!

Do this and check the resulting VIF again!

In [None]:
df_centered['d']=df_centered['a']*df_centered['b']

x = add_constant(df_centered)
vif_factors= pd.Series([variance_inflation_factor(x.values, i) for i in range(x.shape[1])], index=x.columns)
print("\nVIF - centered: \n",vif_factors)

Another solution to the multicollinearity problemis the usage of standardized values (remember feature scaling lab).

Convert your data frame to a numpy array and standardize the values

Again you need to recalculate the values for attribute "d".

Check the effect on the VIFs!

In [None]:
from sklearn.preprocessing import StandardScaler

array_sm=df_sm.values

std_scaler = StandardScaler()
std_scaler.fit(array_sm)

scaled_array_sm=std_scaler.transform(array_sm)

df_standardized=pd.DataFrame(scaled_array_sm,columns=df_sm.columns)
df_standardized['d']=df_standardized['a']*df_standardized['b']

x = add_constant(df_standardized)
vif_factors= pd.Series([variance_inflation_factor(x.values, i) for i in range(x.shape[1])], index=x.columns)
print("\nVIF - centered: \n",vif_factors)