# Advanced logistic regression: the definitive guide to collinearity

Among the [most popular machine learning algorithms](https://www.kdnuggets.com/2016/08/10-algorithms-machine-learning-engineers.html), [Logistic Regression](https://en.wikipedia.org/wiki/Logistic_regression) is definitely one you should have a handle on.  There's a lot of online courses pushing neural networks and deep learning, but chances are, you probably won't encounter either of these in interviews.

Instead, you should have a strong handle on the basics.  This includes models such as logistic regression and linear regression.

In this article we'll focus on logistic regression, a linear classifier.  More specifically, we'll focus on one of the major problems with linear models, collinearity.

We'll walk through a number of examples, along with some Python code.  After reading through this article you'll have a better grasp on collinearity, so that you can ace your next data science interview.

## Logistic Regression Overview
To start, I'm going to give a brief review of logistic regression.  I won't get into too much detail, because I want to focus on advanced topics you might encounter in interviews.  

Logistic regression is a supervised learning model.  The model predicts the probability of an outcome using the logistic function.

<img src="log_function3.jpg">

The model "learns", or estimates, the parameters typically using the [maximum likelihood estimation](https://stats.stackexchange.com/questions/112451/maximum-likelihood-estimation-mle-in-layman-terms).  I'll leave these details for another post, just be aware that this is happening behind the scenes.

It's typically used for classification problems with binary outcomes, but can also be used for classification problems with multiple outcomes.

Here are some [typical applications](https://www.quora.com/What-are-applications-of-linear-and-logistic-regression):
1. Customer churn 
2. Geographic image processing 
3. Handwriting recognition 
4. Healthcare analytics, such as risk of heart attack

Logistic regression is ["linear"](https://stats.stackexchange.com/questions/93569/why-is-logistic-regression-a-linear-classifier) because the decision boundary separating the classes is linear.  

<img src="linsep_new.png">

This is a common point of confusion, because when you look at the S shaped logistic function, it's certainly not linear.  

<img src="log_fun_lin.jpg">

What makes logistic regression "linear" is the linear combination of the features. 

<img src="log_odds.jpg">

Logistic regression is a commonly used model, but it shouldn't be used for highly non-linear data.  

This is the classic example of the [bias-variance tradeoff](https://machinelearningmastery.com/gentle-introduction-to-the-bias-variance-trade-off-in-machine-learning/).  Because logistic regression is a linear classifier, it has a high bias towards this type of fit.  Likewise, it's lack of flexibility means that it has very low-variance, so it won't be able to predict non-linear outcomes very well.

What we gain in model interpretability from a simple linear model, we lose in flexibility to fit to non-linear data.  This bias-variance tradeoff is always a consideration.


## Collinearity 

Lets get started with some advanced topics using a hypothetical example.

You work in the risk department for a mortgage lender, and it's your job to pre-screen people for risk of default on loans.  

<img src="accept_reject.jpg">

There's a number of factors you're looking at, including income, credit rating, and credit limit.

Looking at all of these factors, you're trying to distinguish which ones are providing useful, somewhat unique, information.  You decide to take a closer look at credit score, and credit limit.  Here's what the data looks like:

<img src="rating_limit.jpg">

Obviously this data is highly correlated.  A person with a higher credit score usually has a higher credit limit.  

People with higher credit scores are probably less risky.  Likewise, people with high credit limits are probably less risky.  

Both of these measures seem to be providing useful information about credit, but it's essentially the same information.  The question is, which one is really contributing information to my model?

This is a classic case of collinearity.  Now let's dive deeper into some details, and see just exactly why it can be a problem.

## Model interpretability

Linear models are a great tool because they're easy to interpret.  If we can predict the coefficients of the features, we can interpret how a change in that feature will affect the outcome of our model.  

Let's run through an example to illustrate what I mean.  I'll be using Python's [statsmodels](http://www.statsmodels.org/stable/index.html) library to fit the logistic regression models in this section.

First, start by downloading the dataset, default.csv (link to github).  Now import the libraries and read in the csv file.

In [4]:
import pandas as pd
import statsmodels.api as sm
import numpy as np

# Reading in and taking a look at the data
df = pd.read_csv("default.csv")
print df.head()
print df.tail()

   income  limit  rating  default
0   17456   5154     495        0
1   19103   5658     527        0
2   21531   4214     502        0
3   21711   6341     555        0
4   25455   4862     452        0
     income  limit  rating  default
342   55480    852      39        1
343   62202   3022     335        1
344   64063   2906     350        1
345   64663   2844     350        1
346   76462   4661     489        1


We'll also need to add a dummy variable of 1.0 so that a coefficient can be fit for the intercept.

In [5]:
# adding a column of ones to fit the intercept
df['intercept'] = 1.0
print df.head()
print df.tail()

   income  limit  rating  default  intercept
0   17456   5154     495        0        1.0
1   19103   5658     527        0        1.0
2   21531   4214     502        0        1.0
3   21711   6341     555        0        1.0
4   25455   4862     452        0        1.0
     income  limit  rating  default  intercept
342   55480    852      39        1        1.0
343   62202   3022     335        1        1.0
344   64063   2906     350        1        1.0
345   64663   2844     350        1        1.0
346   76462   4661     489        1        1.0


Next I'm going to split my data into three folds for training and testing the model.  This is known as k-fold cross-validation, which in my case is three-fold cross-validation.  This is a useful technique because it hides the test data from the model to try and prevent overfitting.

*image of 3 fold cross validation*

In [6]:
# feaure variables
features = ['intercept','income','limit','rating']

# creating 3 folds
fold1 = df[features][0:116]
y_fold1 = df['default'][0:116]

fold2 = df[features][116:231]
y_fold2 = df['default'][116:231]

fold3 = df[features][231:]
y_fold3 = df['default'][231:]

print "Fold 1 features:"
print fold1.head()
print "Fold 1 outcomes:"
print y_fold1.head()

Fold 1 features:
   intercept  income  limit  rating
0        1.0   17456   5154     495
1        1.0   19103   5658     527
2        1.0   21531   4214     502
3        1.0   21711   6341     555
4        1.0   25455   4862     452
Fold 1 outcomes:
0    0
1    0
2    0
3    0
4    0
Name: default, dtype: int64


Lets start by fitting a model to fold 1.  We'll keep all of the features to start.  

In [7]:
# fit the model on fold 1
logit = sm.Logit(y_fold1, fold1)
result = logit.fit()

# summary
result.summary()

Optimization terminated successfully.
         Current function value: 0.479695
         Iterations 7


0,1,2,3
Dep. Variable:,default,No. Observations:,116.0
Model:,Logit,Df Residuals:,112.0
Method:,MLE,Df Model:,3.0
Date:,"Wed, 27 Dec 2017",Pseudo R-squ.:,0.3078
Time:,10:42:29,Log-Likelihood:,-55.645
converged:,True,LL-Null:,-80.388
,,LLR p-value:,1.028e-10

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
intercept,5.6647,1.202,4.712,0.000,3.309,8.021
income,-4.94e-05,1.23e-05,-4.007,0.000,-7.36e-05,-2.52e-05
limit,-0.0005,0.000,-1.746,0.081,-0.001,6.64e-05
rating,-0.0026,0.004,-0.739,0.460,-0.009,0.004


Now we can take a look at the results and see what's going on.  

To assess our model, we'll use the p-value (https://en.wikipedia.org/wiki/P-value#Basic_concepts).  The p-value is essentially a threshold which we use to determine if we have reached a level of significance (http://blog.minitab.com/blog/adventures-in-statistics-2/how-to-interpret-regression-analysis-results-p-values-and-coefficients) for which we can reject the null hypothesis.  Typically we reject the null hypothesis if p < 0.05, but we can set the threshold even tighter, to 0.01 or 0.005.

In terms of our model, the null hypothesis is that the regression coefficient is equal to zero.  Basically this means that if we don't have a low p-value, we don't have strong enough statistical evidence to include that variable in our model.  
*show an image... null hypothesis*
*p < 0.05, reject the null... include the variable*

Both the intercept and income term have a p-value below 0.05, but limit and rating do not.  In this case, we don't have strong enough statistical evidence to justify keeping limit and rating in the model at the same time.

Now lets take a look at a summary of results for the seoncd fold.

In [8]:
# fit the model on the second fold
logit = sm.Logit(y_fold2, fold2)
result = logit.fit()

# summary
result.summary()

Optimization terminated successfully.
         Current function value: 0.452194
         Iterations 7


0,1,2,3
Dep. Variable:,default,No. Observations:,115.0
Model:,Logit,Df Residuals:,111.0
Method:,MLE,Df Model:,3.0
Date:,"Wed, 27 Dec 2017",Pseudo R-squ.:,0.3476
Time:,10:42:31,Log-Likelihood:,-52.002
converged:,True,LL-Null:,-79.708
,,LLR p-value:,5.612e-12

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
intercept,5.8389,1.264,4.621,0.000,3.362,8.316
income,-5.85e-05,1.42e-05,-4.126,0.000,-8.63e-05,-3.07e-05
limit,-4.223e-05,0.000,-0.116,0.908,-0.001,0.001
rating,-0.0077,0.005,-1.637,0.102,-0.017,0.002


Taking a look at these new results, we notice that the coefficient for limit has changed by an order of magnitude, and rating has changed by a factor of 3.  

On the other hand, intercept and income which still have p-values well below 0.05, have not changed much.  Finally, lets take a look at a result of the estimates on the third fold.

In [9]:
# fit the model on the third fold
logit = sm.Logit(y_fold3, fold3)
result = logit.fit()

# summary
result.summary()

Optimization terminated successfully.
         Current function value: 0.451895
         Iterations 7


0,1,2,3
Dep. Variable:,default,No. Observations:,116.0
Model:,Logit,Df Residuals:,112.0
Method:,MLE,Df Model:,3.0
Date:,"Wed, 27 Dec 2017",Pseudo R-squ.:,0.3479
Time:,10:42:32,Log-Likelihood:,-52.42
converged:,True,LL-Null:,-80.388
,,LLR p-value:,4.335e-12

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
intercept,6.2255,1.239,5.024,0.000,3.797,8.654
income,-6.343e-05,1.42e-05,-4.457,0.000,-9.13e-05,-3.55e-05
limit,2.61e-05,0.000,0.081,0.936,-0.001,0.001
rating,-0.0091,0.004,-2.086,0.037,-0.018,-0.001


Here's a summary of the coefficient estimates on all three folds:

| fold | income     | limit      | rating  |
|------|------------|------------|---------|
| 1    | -0.0000494 | -0.0005    | -0.0026 |
| 2    | -0.0000585 | -0.0000422 | -0.0077 |
| 3    | -0.0000634 | 0.0000261  | -0.0091 |

Based on this experiment, we can see that there's considerable instability in the coefficient estimates of limit and rating.  The numbers are jumping around all over the place.  If you were trying to interpret how a change in limit or rating affected your outcome, it would be nearly impossible.

Income on the other hand, is fairly stable.  The value changes slightly from fold to fold, but not nearly as drastically as limit and rating.  This feature is much more interpretable.

This example illustrates the issue of collinearity.  When two collinear variables are included in a model, it's very difficult to interpret either ones affect on the outcome of the model.

In the next section, we'll look into detecting collinearity.

## Detecting collinearity
Now that we know that collinearity can cause issues with the coefficient estimates, lets take a look at two common techniques for [detecting collinearity](https://stats.idre.ucla.edu/stata/webbooks/logistic/chapter3/lesson-3-logistic-regression-diagnostics/).

### Correlation
One way to potentially detect collinearity is with correlation.  Lets take a look at the correlation matrix of the features from the example in the previous section.

In [12]:
# correlation
print np.corrcoef([df['income'],df['limit'],df['rating']])

[[ 1.          0.18344745  0.24482012]
 [ 0.18344745  1.          0.84179196]
 [ 0.24482012  0.84179196  1.        ]]


*picture of the correlation matrix*
We can see that our collinear variables, rating and limit, have a correlation of 0.84.  Typically anything above 0.7 is considered high.

On the other hand, income, which is not collinear with either of the other variables, does not have a high correlation.  

If variables have a high correlation, there is likely collinearity.  Unfortunately you can also have collinear variables that do not have a high correlation, so correlation alone should not be the only method for detecting collinearity.

### Variance inflation factor
Another more reliable way to detect collinearity is the variance inflation factor (VIF) https://en.wikipedia.org/wiki/Variance_inflation_factor.   

If there's no collinearity at all, the VIF is equal to 1.  There isn't any hard and fast rule for VIF, but typically a value above 5 or 10 indicates pretty strong collinearity.  Let's take a look at the VIFs for our model.  We'll need to add in the intercept term since Python doesn't do this by default.  We'll also drop the default variable since this is our original categorical variable that we're trying to predict.

In [13]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Reading in the data
df = pd.read_csv("default.csv")

# adding a column of ones to fit the intercept, and dropping the 'default' feature
df['intercept'] = 1.0
df.pop('default')
df.head()

# changing to ndarray type for statsmodels library
df = df.values

# variance inflation factors of the features
# don't need to include the intercept
vif = [variance_inflation_factor(df, i) for i in range(df.shape[1])] # vif 
print "income:  %s \n limit:  %s \n rating: %s" %(vif[0],vif[1], vif[2])

income:  1.06575266304 
 limit:  3.43830427058 
 rating: 3.53443837822


Income is close to one, so there isn't strong colliearity with the other features.  Limit and rating are around 3, so there is probably some mild collinearity.  

To show strong collinearity, lets create a new variable called income_squared, where we'll square the income variable.

In [19]:
# Reading in the data
df = pd.read_csv("default.csv")

# adding a column of ones to fit the intercept, and dropping the 'default' feature
df['intercept'] = 1.0
df.pop('default')

# creating income_squared variable
df['income_squared'] = df['income']**2
df.head()

Unnamed: 0,income,limit,rating,intercept,income_squared
0,17456,5154,495,1.0,304711936
1,19103,5658,527,1.0,364924609
2,21531,4214,502,1.0,463583961
3,21711,6341,555,1.0,471367521
4,25455,4862,452,1.0,647957025


Now lets take a look at the variance inflation factors.

In [20]:
# changing to ndarray type for statsmodels library
df = df.values

# variance inflation factors of the features
# don't need to include the intercept
vif = [variance_inflation_factor(df, i) for i in range(df.shape[1])] # vif 
print "income:  %s \n limit:  %s \n rating: %s \n income_squared: %s" %(vif[0],vif[1], vif[2], vif[4])

income:  12.3889199368 
 limit:  3.46314456614 
 rating: 3.54893993398 
 income_squared: 12.3267258883


The VIF for income has jumped from around 1 to 12.  The new variable income_squared also has a VIF around 12.  Since this new variable is based on the original variable we would expect there to be strong collinearity.

Sometimes we'll want to transform a variable by squaring it, or using some other function.  It's good practice to keep the original variable as well, so there will likely be collinearity.  This isn't always a bad thing though, as we'll see later on.

In this final example, we'll take a look at perfect colliearity.  Let's create another variable by adding limit and rating to create a new variable, limit_rating.

In [21]:
# Reading in the data
df = pd.read_csv("default.csv")

# adding a column of ones to fit the intercept, and dropping the 'default' feature
df['intercept'] = 1.0
df.pop('default')

# creating income_squared variable
df['limit_rating'] = df['limit'] + df['rating']
df.head()

Unnamed: 0,income,limit,rating,intercept,limit_rating
0,17456,5154,495,1.0,5649
1,19103,5658,527,1.0,6185
2,21531,4214,502,1.0,4716
3,21711,6341,555,1.0,6896
4,25455,4862,452,1.0,5314


In [22]:
# changing to ndarray type for statsmodels library
df = df.values

# variance inflation factors of the features
# don't need to include the intercept
vif = [variance_inflation_factor(df, i) for i in range(df.shape[1])] # vif 
print "income:  %s \n limit:  %s \n rating: %s \n limit_rating: %s" %(vif[0],vif[1], vif[2], vif[4])

income:  1.06575266304 
 limit:  inf 
 rating: inf 
 limit_rating: inf


The result for limit, rating, and limit_rating is `inf`.  Since the model couldn't compute a VIF, there must be perfect collinearity.  This is typical for a transform that is a sum of two other variables.

## Sources of collinearity

Up to this point, we've seen a couple of different [sources of collinearity](https://stats.stackexchange.com/questions/221902/what-is-an-example-of-perfect-multicollinearity).  The first was a squared transform of the `income` variable.  Another example was the summation of the `limit` and `rating` features to create a new feature `limit_rating`.

There's quite a few different examples of collinearity, but here's a short list of some of the more common ones.

1. Multiple of another variable (x2 = 2x1)
2. Add a constant to another variable (x2 = x1 + 100)
3. Transformation of another variable (sqrt, ^2, log)
4. "Dummy Variable trap" red, blue, green (3 instead of n-1, which should be 2)
5. Multicollinearity (x1 and x2 are collinear, x3 = x2 + x1)

We've already seen that collinearity can make it difficult to interpret a model, but sometimes collinearity is unavoidable.  

When we transform a variable, we should keep the original variable in our model.  This is always going to cause some level of collinearity, but this might not be an issue.

In the next section we'll look at several ways to deal with collinearity.

## Dealing with collinearity
If our model has collinear variables, we have several options:

1. Remove collinear variables
2. Center or standardize collinear variables
3. Ridge regression
4. Do nothing (more on this later...)

Lets take a look at the first three in more detail.  I'll leave option four for discussion later on.

### Remove collinear variables 
Going back to our original example, lets take a look at what happens when we remove collinear variables.  

Lets start by reading in the data again, and slicing it into three folds.

In [23]:
# Reading in the data
df = pd.read_csv("default.csv")

# adding a column of ones to fit the intercept
df['intercept'] = 1.0

# feaure variables
features = ['intercept','income','limit','rating']

# creating 3 folds
fold1 = df[features][0:116]
y_fold1 = df['default'][0:116]

fold2 = df[features][116:231]
y_fold2 = df['default'][116:231]

fold3 = df[features][231:]
y_fold3 = df['default'][231:]

Lets take another quick look at a summary of the model.

In [24]:
# fit the model on fold 1
logit = sm.Logit(y_fold1, fold1)
result = logit.fit()

# summary
result.summary()

Optimization terminated successfully.
         Current function value: 0.479695
         Iterations 7


0,1,2,3
Dep. Variable:,default,No. Observations:,116.0
Model:,Logit,Df Residuals:,112.0
Method:,MLE,Df Model:,3.0
Date:,"Wed, 27 Dec 2017",Pseudo R-squ.:,0.3078
Time:,10:43:07,Log-Likelihood:,-55.645
converged:,True,LL-Null:,-80.388
,,LLR p-value:,1.028e-10

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
intercept,5.6647,1.202,4.712,0.000,3.309,8.021
income,-4.94e-05,1.23e-05,-4.007,0.000,-7.36e-05,-2.52e-05
limit,-0.0005,0.000,-1.746,0.081,-0.001,6.64e-05
rating,-0.0026,0.004,-0.739,0.460,-0.009,0.004


From the results we see that the p-value for both `limit` and `rating` are above our threshold of 0.05.  Lets take a look at what happens when we remove `rating` from our model.

In [25]:
# feaure variables
features = ['intercept','income','limit']

# creating 3 folds
fold1 = df[features][0:116]
y_fold1 = df['default'][0:116]

fold2 = df[features][116:231]
y_fold2 = df['default'][116:231]

fold3 = df[features][231:]
y_fold3 = df['default'][231:]


# fit the model on fold 1
logit = sm.Logit(y_fold1, fold1)
result = logit.fit()

# summary
result.summary()

Optimization terminated successfully.
         Current function value: 0.482046
         Iterations 7


0,1,2,3
Dep. Variable:,default,No. Observations:,116.0
Model:,Logit,Df Residuals:,113.0
Method:,MLE,Df Model:,2.0
Date:,"Wed, 27 Dec 2017",Pseudo R-squ.:,0.3044
Time:,10:43:22,Log-Likelihood:,-55.917
converged:,True,LL-Null:,-80.388
,,LLR p-value:,2.358e-11

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
intercept,5.3920,1.122,4.805,0.000,3.193,7.591
income,-5.002e-05,1.23e-05,-4.077,0.000,-7.41e-05,-2.6e-05
limit,-0.0007,0.000,-3.627,0.000,-0.001,-0.000


The `limit` feature now has a p-value well below 0.05.  This means that we can safely include this variable in the model, and we can expect far less instability of the coefficient estimate.  

If we include `rating` and remove `limit` we achieve a similar result.  This shows that removing a collinear variable will improve the stability of the coefficient estimates.

### Center the collinear variables
Another method to remove collinearity is to center the collinear variables.  This is done by subtracting the mean from a feature before performing a transform.  I'll walk us through another example.

Lets take a look at our previous example where we created the `income_squared` feature.  Again, we'll read in the data, and create the new feature.

In [26]:
# Reading in the data
df = pd.read_csv("default.csv")

# adding a column of ones to fit the intercept, and dropping the 'default' feature
df['intercept'] = 1.0
df.pop('default')

# creating income_squared variable
df['income_squared'] = df['income']**2
df.head()

Unnamed: 0,income,limit,rating,intercept,income_squared
0,17456,5154,495,1.0,304711936
1,19103,5658,527,1.0,364924609
2,21531,4214,502,1.0,463583961
3,21711,6341,555,1.0,471367521
4,25455,4862,452,1.0,647957025


Taking another look at the VIFs, we can see that `income` and `income_squared` are collinear (VIF>10).

In [27]:
# changing to ndarray type for statsmodels library
df = df.values

# variance inflation factors of the features
# don't need to include the intercept
vif = [variance_inflation_factor(df, i) for i in range(df.shape[1])] # vif 
print "income:  %s \n limit:  %s \n rating: %s \n income_squared: %s" %(vif[0],vif[1], vif[2], vif[4])

income:  12.3889199368 
 limit:  3.46314456614 
 rating: 3.54893993398 
 income_squared: 12.3267258883


This time I'll center the `income` feature before creating `income_squared`.  Lets take another look at the VIFs.

In [28]:
# Reading in the data
df = pd.read_csv("default.csv")

# removing default
df['intercept'] = 1.0
df.pop('default')

# centering the income variable 
df['income'] = df['income'] - np.mean(df['income'])

# creating income_squared variable
df['income_squared'] = df['income']**2

# changing to ndarray type for statsmodels library
df = df.values

# variance inflation factors of the features
# don't need to include the intercept
vif = [variance_inflation_factor(df, i) for i in range(df.shape[1])] # vif 
print "income:  %s \n limit:  %s \n rating: %s \n income_squared: %s" %(vif[0],vif[1], vif[2], vif[4])

income:  2.75699200552 
 limit:  3.46314456614 
 rating: 3.54893993398 
 income_squared: 2.69714751516


Centering the `income` variable before the transform has reduced the collinearity to an acceptable level.  If we were to fit models using these features, the coefficient estimates for `income` and `income_squared` would be much more stable.

Another option similar to centering is [standardizing](https://stats.stackexchange.com/questions/29781/when-conducting-multiple-regression-when-should-you-center-your-predictor-varia).  In addition to subtracting the mean, we would also divide by the standard deviation.  This is another option if we want to reduce the VIF of transformed variables.

### Ridge regression 

Another way to deal with collinearity is with a technique called Ridge Regression.  This technique uses a penalty term for collinear variables that essentially shrinks the coefficient estimates towards zero.  

This shrinkage penalty minimizes the effects of the collinear variables.  Lets take a look at an example.  

This time we'll use the [sklearn implementation of Logistic Regression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).  This uses Ridge Regression by default (penalty: 'l2').

I'll use the perfectly collinear variable `limit_rating` which is a sum of the `limit` and `rating` variables.  To turn off the Ridge Regression, change the parameter `C` to a large number, such as `1e9`.

In [103]:
from sklearn import linear_model

# Reading in the data
df = pd.read_csv("default.csv")

# adding a column of ones to fit the intercept
df['intercept'] = 1.0

# creating a collinear variable limit_rating
df['limit_rating'] = df['limit'] + df['rating']

# feaure variables
features = ['intercept','income','limit','rating','limit_rating']

# creating 3 folds
fold1 = df[features][0:116]
y_fold1 = df['default'][0:116]

fold2 = df[features][116:231]
y_fold2 = df['default'][116:231]

fold3 = df[features][231:]
y_fold3 = df['default'][231:]

# reshaping for sklearn model
X = fold1
y = y_fold1
X.head()

# logistic regression model, ridge regression turned off
logreg = linear_model.LogisticRegression(fit_intercept=False, C=1e9, solver='newton-cg')

# coefficient of limit_rating variable
logreg.fit(X, y).coef_[0][4]

-0.0010443457910360868

The coefficient for the `limit_rating` variable is -0.001.  Now lets change turn Ridge Regression on by changing the parameter `C` to `1`.

In [104]:
# logistic regression model, ridge regression turned off
logreg = linear_model.LogisticRegression(fit_intercept=False, C=1, solver='newton-cg')

# coefficient of limit_rating variable
logreg.fit(X, y).coef_[0][4]

1.4651660917476825e-05

The coefficient estimate for `limit_rating` is now 0.00001.  Ridge Regression has shrunk the parameter by approximately two orders of magnitude.  This shrinking of the coefficient reduces the effect that this variable has on hte model.  The sklearn implementation will use Ridge Regression by default, which is a nice feature.

# Ignore collinearity
Up to this point we've seen how collinearity can make a model difficult to interpret.

We've also looked at how to detect collinearity, and some of the potential sources of collinearity in your models.  

Finally, we learned different techniques for dealing with collinearity.

Now I'm going to tell you that collinearity doesn't matter, and you can just ignore it.  Well, sort of.  

Depending on what your goals are, you might be able to [safely ignore collinearity](http://blog.minitab.com/blog/adventures-in-statistics-2/what-are-the-effects-of-multicollinearity-and-when-can-i-ignore-them).

If the only goal of your logistic regression models is to make accurate predictions, then you can probably go ahead and ignore collinearity.  On the other hand, if your goal is to interpret your model and understand how your feature variables affect your outcome, then you can't ignore collinearity.  We went in-depth on this topic earlier, so I won't revisit that here.

To show you that collinearity doesn't have an effect on the accuracy of your model's predictive power, lets run through an example.

I'll be using the same `credit` dataset that we've been working with all along.  First I'll fit a model with the `income`, `limit`, and `rating` features.

In [139]:
from sklearn import linear_model
from sklearn.metrics import accuracy_score

# Reading in the data
df = pd.read_csv("default.csv")

# adding a column of ones to fit the intercept
df['intercept'] = 1.0

# feaure variables
features = ['intercept','income','limit','rating']

# creating 3 folds
fold1 = df[features][0:116]
y_fold1 = df['default'][0:116]

fold2 = df[features][116:231]
y_fold2 = df['default'][116:231]

fold3 = df[features][231:]
y_fold3 = df['default'][231:]


# training data
X = fold1
y = y_fold1
X.head()

# test data
X_test = fold2
y_test = y_fold2

# logistic regression model, ridge regression turned off
logreg = linear_model.LogisticRegression(fit_intercept=False, C=1e9, solver='newton-cg')

# coefficient of limit_rating variable
logreg.fit(X, y)

# making predictions 
y_pred = logreg.predict(X_test)

# accuracy score
accuracy_score(y_test, y_pred)

0.79130434782608694

The accuracy of the model is 79%.  We saw previously that `limit` and `rating` have some mild collinearity.  Lets remove the rating feature and see what happens.

In [141]:
# feaure variables
features = ['intercept','income','limit']

# creating 3 folds
fold1 = df[features][0:116]
y_fold1 = df['default'][0:116]

fold2 = df[features][116:231]
y_fold2 = df['default'][116:231]

fold3 = df[features][231:]
y_fold3 = df['default'][231:]


# training data
X = fold1
y = y_fold1
X.head()

# test data
X_test = fold2
y_test = y_fold2

# logistic regression model, ridge regression turned off
logreg = linear_model.LogisticRegression(fit_intercept=False, C=1e9, solver='newton-cg')

# coefficient of limit_rating variable
logreg.fit(X, y)

# making predictions 
y_pred = logreg.predict(X_test)

# accuracy score
accuracy_score(y_test, y_pred)

0.77391304347826084

The accuracy has now changed slightly to 77%.  Removing the collinear variable had virtually no effect on our model.  We can also test this out by removing the `limit` feature and including the other collinear variable `rating`.

In [148]:
# feaure variables
features = ['intercept','income','rating']

# creating 3 folds
fold1 = df[features][0:116]
y_fold1 = df['default'][0:116]

fold2 = df[features][116:231]
y_fold2 = df['default'][116:231]

fold3 = df[features][231:]
y_fold3 = df['default'][231:]


# training data
X = fold1
y = y_fold1
X.head()

# test data
X_test = fold2
y_test = y_fold2

# logistic regression model, ridge regression turned off
logreg = linear_model.LogisticRegression(fit_intercept=False, C=1e9, solver='newton-cg')

# coefficient of limit_rating variable
logreg.fit(X, y)

# making predictions 
y_pred = logreg.predict(X_test)

# accuracy score
accuracy_score(y_test, y_pred)

0.77391304347826084

As before, the accuracy is still 77%.

These results illustrate an important point.  The existence of collinearity will have no effect on the outcome of the model.  If your goal is purely one of making accurate predictions, then its probably safe to ignore collinearity.  This is more of a machine learning perspective.  

On the other hand, if your goal is to interpret your model, you shouldn't ignore collinearity.  Use the methods from this article to identify and deal with collinearity, and you'll be able to better interpret your results.