## Interpreting your coefficients with multicollinearity

**Multicollinearity** is when one predictor variable in your regression model can be accurately predicted from the others.


In [1]:
import seaborn as sns
import pandas as pd
import numpy as np

#Load the data set
iris = sns.load_dataset("iris")
iris.head()
iris = iris[iris['species']!='setosa']
iris['species'] = iris['species'].apply(lambda x: 1 if x == 'versicolor' else 0)

  import pandas.util.testing as tm


In [2]:
y = iris['species']
X = iris[['sepal_length','sepal_width','petal_length','petal_width']]

In [3]:
import statsmodels.api as sm
from statsmodels.tools.tools import add_constant

X = add_constant(X)

model = sm.Logit(y, X)
 
result = model.fit()

print(result.summary())

Optimization terminated successfully.
         Current function value: 0.059493
         Iterations 12
                           Logit Regression Results                           
Dep. Variable:                species   No. Observations:                  100
Model:                          Logit   Df Residuals:                       95
Method:                           MLE   Df Model:                            4
Date:                Thu, 16 Jul 2020   Pseudo R-squ.:                  0.9142
Time:                        15:15:24   Log-Likelihood:                -5.9493
converged:                       True   LL-Null:                       -69.315
Covariance Type:            nonrobust   LLR p-value:                 1.947e-26
                   coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------
const           42.6378     25.708      1.659      0.097      -7.748      93.024
sepal_length     2.465

**Variance Inflation Factor (VIF)** tells us the extent to which we have multicollinearity in our result - a factor of 5 to 10 or more is considered high and tells us to be wary of the model coefficients.

In [4]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

vif = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

In [5]:
pd.DataFrame(list(zip(list(X.columns),vif)))

Unnamed: 0,0,1
0,const,125.170277
1,sepal_length,3.990113
2,sepal_width,1.721954
3,petal_length,7.252447
4,petal_width,3.948354
