# Review of Logistic Regression

Logistic regression is a way for us to predict binary (two outcome) variables.

It is similar to linear regression because we entering predictor variables into our model, and finding coefficients for each of those variables that describe how a 1 unit increase in the variable relates to the outcome.

The key difference is that the outcome isn't continuous, but rather, our model is predicting the log odds of the outcome variable occurring.

From the log odds of our outcome occuring, we can know specifically what the odds and probability of our outcome occuring is based on the level of our predictor variables.

Let's look at a hypothetical scenario.

Imagine we had the following data on whether people passed their test (1=yes, 0=no) and whether they studied (yes=1, 0=no). There are 20 cases, and 10 of those were people who passed the test:
    
    
| PassedTest | Studied |
|------------|---------|
| 1          | 1       |
| 1          | 1       |
| 1          | 1       |
| 1          | 1       |
| 1          | 1       |
| 1          | 1       |
| 1          | 0       |
| 1          | 0       |
| 1          | 0       |
| 1          | 0       |
| 0          | 0       |
| 0          | 0       |
| 0          | 0       |
| 0          | 0       |
| 0          | 0       |
| 0          | 0       |
| 0          | 0       |
| 0          | 1       |
| 0          | 1       |
| 0          | 1       |

We can enter this exact data into a pandas dataframe, where one column is our outcome ("passedtest"), and the other column is our predictor ("studied")

In [62]:
import pandas as pd
data = [
[1,1],
[1,1],
[1,1],
[1,1],
[1,1],
[1,1],
[1,0],
[1,0],
[1,0],
[1,0],
[0,0],
[0,0],
[0,0],
[0,0],
[0,0],
[0,0],
[0,0],
[0,1],
[0,1],
[0,1],
]
df = pd.DataFrame(data,columns=["studied","passedtest"])
df

Unnamed: 0,studied,passedtest
0,1,1
1,1,1
2,1,1
3,1,1
4,1,1
5,1,1
6,1,0
7,1,0
8,1,0
9,1,0


Let's look to see the difference in passing rate between those who studied and those who didn't

The 'groupby' functon for a dataframe, can show the results of a descriptive statistics (for example, the mean) broken down by groups of interest.

In this example, we want to examine differences between those who studied and those who didn't, so "studied" is our groupby variable.

We want to know the average passing rate for each group, so we'll take the mean of how many people passed the test in that group.

In [58]:
df.groupby("studied")['passedtest'].mean()

studied
0    0.3
1    0.6
Name: passedtest, dtype: float64

What do you notice?

Are people who studied (group 1), any more likely to pass the test than those who did not (group 0)?


Hopefully you said "yes" because people who studied had a 60% chance of passing the test, while people who didn't study had only a 30% chance of passing the test.

# Odds

You can express how much more likely a group is to pass than another group by using the odds.

For example, If one group has a .75 chance of passing, then the odds of passing is: .75 / (1 - .75) = 3

If one group has a .25 chance of passing, then the odds of passing is: .25 / (1 - .25) = 1/3

What are the odds of passing the test if you studied?

In [131]:
0.60 / (1 - .60)

1.4999999999999998

What are odds of passing if you did not study?

In [132]:
.30 / (1 - .30)

0.4285714285714286

Logistic regression is useful because it can figure out these odds for us, and tell us how much one variable increases the log odds of the outcome occuring. We can then transform the log odds into a simple odds to get a clearer picture.

# Logistic Regression

We can have a logistic regression figure out these odds for us.

Logistic regression tries to determine how the different variables relate to the log odds of the outcome.

Each coefficient for each variable represents how a 1 unit increase in the variable corresponds to a change in the log odds of the outcome.


First, we'll create a logistic regression clasifier object

In [147]:
from sklearn.linear_model import LogisticRegression

#Create a logistic regression classifier object
#We use C=1000000 to not have our model penalize the predictor (Make it artificially smaller)
# and let its natural relationship with the outcome be describe.
lm = LogisticRegression(C=100000)

Next, we'll specify what are our predictior variables and outcome variables=

In [148]:
#Set our X variable to be equal to the column that indicates whether or not people studied
X = df[['studied']]
y= df['passedtest']


Now we can fit our logistic regression model, where we try to model how our predictor variable relates to the log odds of the outcome variable

In [149]:
lm.fit(X,y)

LogisticRegression(C=100000, class_weight=None, dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', penalty='l2', random_state=None,
          solver='liblinear', tol=0.0001, verbose=0)

Now that the model is fit, we can see what the coefficient is for 'studied'

Notice that the coefficient is positive.

This means that the variable has a positive relationship with the outcome.

When studied is 1, it increase the log odds of the outcome occurring.

Specifically, when people study, the log odds of passing multiply by 1.2527 vs. when people don't study

In [107]:
print zip(["studied"],lm.coef_[0])

[('studied', 1.2527477116304988)]


Because describing the coefficient in terms of the 'log odds' of our outcome is not clear, we can exponentiate our coefficient and interpret it in terms of increasing the 'odds' of our outcome occuring

In [108]:
import numpy as np
np.exp(lm.coef_[0])

array([ 3.4999466])

Because 3.4999 is the exponentiation of our coefficient, we can say that studying multiplies the odds by 3.5 of passing the test vs. not studying.

To make this point more clear, let's break the predictions down:

First, we can compute what the odds are of passing if we didn't study by looking at what the odds are if when studying is 0. Remember, we have to exponentiate everything to express it in terms of odds.

The equation is $$ exp(\beta_0 + \beta_1studying) \\$$

Because studying is equal to 0,  the formula for the odds when not studying is exp(intercept)


In [124]:
print "Odds without studying: %f" % np.exp(lm.intercept_ )

Odds without studying: 0.428576


Notice that .428576 is the same value we got for our odds when we computed it by hand early in the lesson.

Let's compare that to the odds of passing when we DID study.

The equation is $$ exp(\beta_0 + \beta_1studying) \\$$

Because studying is equal to 1, we have to multiply our studying coefficient by 1 and add that to our intercept


In [126]:
print "Odds with studying: %f" % np.exp(lm.intercept_ + lm.coef_[0] )

Odds with studying: 1.499992


So the odds of passing when not studying is .428576
The odds of passing when studying is 1.49992

Therefore, the ratio of 1.49992  to .428576 tells us how much the odds multiply when studying.

This value, 3.5 is what we got for our Beta coefficient of studying. Thus, from the logistic regression model, we can know how much studying multiplies our odds of passing the test just by looking at its coefficient

In [128]:
odds_ratio= np.exp(lm.intercept_ + lm.coef_[0] ) / np.exp(lm.intercept_ )
print "Relative Increase of Odds if Studying vs. not Studying: %f" % odds_ratio

Relative Increase of Odds if Studying vs. not Studying: 3.499947


# Determining Probabilities

We can have our logistic regression model give us the specific probability of a person passing, if we know their value for the predictor variables.

For example, if we know that the person studied, we can our model to predict the probability of that person passing using the "predict_proba" function.

Our model gives that person a 60% chance of passing the test, and a 40% chance of not passing the test.

Notice that this probability corresponds to exactly what we saw in our data, where 6 out of the 10 people who studied passed the test

In [134]:
#set our predictor to just be one row of a person who studied

observation = [1]
#tell the trained logistic regression object to predict the probability of that observation 
#belonging to class 0 and the probability of that observation belonging to class 1
predictions = lm.predict_proba(observation)
predictions

array([[ 0.40000125,  0.59999875]])

Likewise, we can also get the probability of passing the test if someone did not study.

This person has a 30% chance of passing the test, which is exactly what we saw in the data.

Out of the 10 people who did not study, only 3 passed the test.

In [136]:
#set our predictor to just be one row of a person who did not study

observation = [0]
#tell the trained logistic regression object to predict the probability of that observation 
#belonging to class 0 and the probability of that observation belonging to class 1
predictions = lm.predict_proba(observation)
predictions

array([[ 0.69999789,  0.30000211]])

# Adjusting for class weight

What if we knew that passing the test was extremely unlikely. We could give weight to the classes to let our model know that there is a small chance of passing the test.

Let's say historically, only 20% of people pass the test. Therfore, for our class weight parameter, we are going to say there is an .80 probability of being in class 0, and a 20% probability of being in class 1.

Now, when our model makes a prediction, even if you've studied, you only have a 27% chance of passing.

In [155]:
from sklearn.linear_model import LogisticRegression

lm = LogisticRegression(class_weight={0:.80,1:.20})
lm.fit(X,y)
observation = [1]
predictions = lm.predict_proba(observation)
predictions

array([[ 0.73160795,  0.26839205]])

# Regularizing the model (Prevent Overfitting)

To prevent overfitting (having too many coefficients in the model that don't add anything), we can regularize the model by making the coefficients smaller. This regularization means that coefficients have to **really** contribute to the model (be large) to actually influence the model because they are going to be decreased by the reguarlization.

The C parameter is the inverse regularization strength. Small values of C will make the model more regularized and will make the coefficients smaller.

Let's fit a model with three different values for C: 100000 (mild regularlization), 1 (moderate regularization, .0001 (strong regularization).

Notice how the coefficient gets smaller, and are biased downwards under stronger regularization

In [159]:
lm = LogisticRegression(C=100000)
lm.fit(X,y)
print "Coefficient under mild regularlization: %f" % lm.coef_

lm = LogisticRegression(C=1)
lm.fit(X,y)
print "Coefficient under moderate regularlization: %f" % lm.coef_

lm = LogisticRegression(C=.0001)
lm.fit(X,y)
print "Coefficient under strong regularlization: %f" % lm.coef_

Coefficient under mild regularlization: 1.252748
Coefficient under moderate regularlization: 0.578903
Coefficient under strong regularlization: 0.000100


# Finding the best C value

In principle, regularization should make the model cross-validate better because it is not overfit to the specifics of our sample data. However, it is really going to make a difference when there are too many predictors.

If we only have one predictor, then we don't need to regularize because overfitting happens as more predictors are added.

We can use the grid search function of sklearn to go through different C values and see which model cross-validates the best.

Notice how our model actually gets worse with regularlization, beca

In [176]:
from sklearn import grid_search
# Let's see which C value cross-validates the best(10000, 1, .01, .or .001)
param_grid={'C': [10000,1,.01,.001]}

gs = grid_search.GridSearchCV(
    estimator=LogisticRegression(),
    param_grid=param_grid,
    cv=9,
    scoring='accuracy'
    )


gs.fit(X, y)
gs.grid_scores_

[mean: 0.65000, std: 0.28808, params: {'C': 10000},
 mean: 0.65000, std: 0.28808, params: {'C': 1},
 mean: 0.45000, std: 0.06929, params: {'C': 0.01},
 mean: 0.45000, std: 0.06929, params: {'C': 0.001}]