# Lesson 8: Logistic Regression
## Starter code for guided practice & demos

In [8]:
# Imports
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import math
from pathlib import Path
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_curve, roc_auc_score
%matplotlib inline

# Config
DATA_DIR = Path('.')
np.random.seed(1)

## Slide: "Wager those odds!"

### Guided Practice: Logit Function and Odds

In [23]:
def logit_func(odds):
    # uses a float (odds) and returns back the log odds (logit)
    return np.log(odds)

def sigmoid_func(logit):
    # uses a float (logit) and returns back the probability
    return 1 / (1+np.exp(-logit))

odds_set = [
    4./1,   # AlphaGo : Seedol,   4:1
    20./1,  # Chelsea : Leicester City,   20:1
    1.1/1,  # England : Wales,   1.1:1
    7.0/4,  # Brexit : Remain,   7:4
    17.0/3  # President Trump : Not President Trump,   3:17
]

In [14]:
print odds_set

[4.0, 20.0, 1.1, 1.75, 5.666666666666667]


In [21]:
# Print the probability of the (predicted) better team winning in each case above
for i in range(len(odds_set)):
    print logit_func(odds_set[i])
  

1.38629436112
2.99573227355
0.0953101798043
0.559615787935
1.73460105539


In [22]:
# Print the probability of the (predicted) better team winning in each case above
for i in range(len(odds_set)):
    print sigmoid_func(logit_func(odds_set[i]))
  

0.8
0.952380952381
0.52380952381
0.636363636364
0.85


In [24]:
[sigmoid_func(logit_func(odds)) for odds in odds_set]

[0.80000000000000004,
 0.95238095238095233,
 0.52380952380952384,
 0.63636363636363635,
 0.84999999999999998]

## Slide: "Logistic regression implementation"
Use the data titanic.csv and the LogisticRegression estimator in sklearn to predict the target variable `survived`.

1. What is the bias, or prior probability, of the dataset?
2. Build a simple model with one feature and explore the coef_ value.  Does this represent the odds or logit (log odds)?
3. Build a more complicated model using multiple features. Interpreting the odds, which features have the most impact on survival? Which features have the least?
4. What is the accuracy of your model?

N.B. `age` will need some work (since it is missing for a significant portion), and other data cleanup simplifies the data problem a little.

In [None]:
titanic = pd.read_csv(DATA_DIR / 'titanic.csv')

# Transform male/female to 1/0
titanic['is_male'] = titanic.sex.apply(lambda x: 1 if x == 'male' else 0)

titanic.head()

In [None]:
lr = LogisticRegression()
X = titanic[['is_male']]  # try puting other feature(s) in here
y = titanic['survived']
lr.fit(X, y)

In [None]:
# Find out how to print out the log-reg coefficients, intercept and mean survival rate
# Docs: http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html


In [None]:
# Print out the odds for each coefficient


## Slide: "Evaluating logistic regression with alternative metrics"
This Titanic dataset comes from [Kaggle](https://www.kaggle.com/c/titanic).

Spend a few minutes determining which data would be most important to use in the prediction problem. You may need to create new features based on the data available. Consider using a feature selection aide in sklearn. For a worst case scenario, identify one or two strong features that would be useful to include in this model.


1. Spend 1-2 minutes considering which metric makes the most sense to optimise. Accuracy? FPR or TPR? AUC? Given the "business problem" of understanding survival rate aboard the Titanic, why should you use this metric?

2. Build a tuned logistic regression model. Be prepared to explain your design (including regularisation), choice of metric, and your chosen feature set in predicting survival using any tools necessary (such as fit charts). Use the starter code to get you going.

N.B. If you haven't done it yet, `age` will need some work (since it is missing for a significant portion), and other data cleanup simplifies the data problem a little.

In [None]:
# Here's some code for fitting a model and creating an ROC
lr = LogisticRegression()
X = titanic[['is_male']]  # put your other feature(s) in here
y = titanic['survived']
lr.fit(X, y)

predictions = lr.predict(X)
probabilities = lr.predict_proba(X)
plt.plot(roc_curve(titanic[['survived']], probabilities[:,1])[0],
         roc_curve(titanic[['survived']], probabilities[:,1])[1])

In [None]:
# To understand this a little further, try printing these in turn
#titanic[['survived']]
#probabilities
#probabilities[:,1]
roc_curve(titanic[['survived']], probabilities[:,1])
#print roc_curve(titanic[['survived']], probabilities[:,1])[0]
#print roc_curve(titanic[['survived']], probabilities[:,1])[1]

The ROC curve above is based on various probability thresholds (for 'is_male' there's only one thing we can vary, hence one point, joined to (0,0) and (1,1)). This will become more clear if you subtitute e.g. age (after you've cleaned it up!)

In [None]:
plt.plot(roc_curve(titanic[['survived']], predictions)[0],
         roc_curve(titanic[['survived']], predictions)[1])

This chart, which does not play with thresholds, shows the one true TPR and FPR point, joined to 0,0 and 1,1.

The first chart will be more effective as you compare models and determine where the decision line should exist for the data. The second simplifies the first in case this idea of thresholds is confusing.

In [None]:
# Finally, you can use the `roc_auc_score` function to calculate the area under these curves (AUC).
roc_auc_score(titanic['survived'], lr.predict(X))

In [None]:
# ...