### Guided Practice: Logit Function and Odds

In [None]:
import pandas as pd
import numpy as np

# The Logit function

The logit function is the function/transformation we apply to our outcome variable in a logistic regression.

Our predictors, when summed, give us the log odds of our outcome occurring - the logit

We can then convert the logit our model predicts into a probability using the sigmoid function.

In [None]:
def logit_func(odds):
    # uses a float (odds) and returns back the log odds (logit)
    return np.log(odds)

def sigmoid_func(logit):
    # uses a float (logit) and returns back the probability
    return 1. / (1 + np.exp(-logit))

Odds can be any positive number from 0 to infinity, and are typically expressed as a fraction or ratio.

Odds = 1 : 1 means that either outcome (success or failure) is equally likely.

Odds = 2 : 1 means that the first outcome (success) is twice as likely as the second outcome (failure)

Odds = 1 : 2 means that the second outcome (failure) is twice as likely as the first outcome)

In [None]:

odds_set = [
    1.0 / 1.0,
    2.0 / 1.0,
    1.0 / 2.0,
    6.0 / 6.0,
    99999999999999.0 / 1.0,
     1.0 / 99999999999999.0
]

Notice below how the sigmoiud function can take the log odds and transform them to a probability equivalent to what the odds indicate

When the odds are 1:1, and each outcome is equally likely, we get the odds of sucess as .50 or 50%

Also, no matter how large our odds are, the probability returned can never be above 1.0 or below 0.0

In [None]:
for odds in odds_set:
    print sigmoid_func(logit_func(odds))

# Logistic regression

Logistic regression uses the same form a linear regression, where our outcome is equal to:

$$y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + .. + \beta_n X_n \\ $$
 
The outcome however is the the log odds of y.

$$\log{\frac{p}{1-p}} = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + .. + \beta_n X_n \\ $$

The coefficients represent the increase in the log odds of our outcome occurring for 1 unit increase in X.

In [None]:
import pandas as pd
from sklearn.linear_model import LogisticRegression

# Load in the college admission's data

We're going to read in the college admission data that contian information on a student's gre score, their gpa, and their school rank

Again:

'admit' is a binary variable. It indicates whether or not a candidate was admitted admit =1) our not (admit= 0)

'gre' is GRE score

'gpa' stands for Grade Point Average

'rank' is the rank of an applicant's undergraduate alma mater, with 1 being the highest and 4 as the lowest

In [None]:
df = pd.read_csv('collegeadmissions.csv')

In [None]:
df.head()

Rank is currently from 1 to 4, though we can treat it as categorical because the difference between each rank may not be the exact same.

The following code makes rank into a dummy variables, and keeps ranks 1 2 and 3 as variables, which are then joined to the dataset

In [None]:
df = df.join(pd.get_dummies(df['rank'],prefix="rank"))

Look at how our dataframe now has the 4 additional dummy variables corresponding to the four ranks.

In [None]:
df.head()

Let's enter in all of our predictors into the model that tries to predict admission

In [None]:
predictors=['gre', 'gpa', "rank_1", "rank_2", "rank_3"]
lm = LogisticRegression()
lm.fit(df[predictors], df['admit'])

We can view the coefficients for each predictor. These coefficients represent the increase in log odds of admission for 1 unit increase in the variable, holding all of the other variables constant


In [None]:
zip(predictors,lm.coef_[0])

Notice how as a person's GRE score increases by 1 unit, the log odds of being admitted increases by .001639

Because it might be easier to think in terms of odds and not log odds, we can exponentiate the coefficients to interpret them as increasing the odds of being admitted.

In [None]:
# You can easily convert these into odds using numpy.exp()
import numpy as np
zip(predictors,np.exp(lm.coef_)[0])

We can see that the odds of being admitted multiply by 1.0016 for a 1 unit increase in GRE score.

Also, the above makes it more clear that a schools rank as it approaches 4 decreases the odds of getting admitted.

We can evaluate the accuracy of the model by using the score function.

In [None]:
feature_set = df[predictors]
print lm.score(feature_set, df['admit'])

The accuracy of the model with all features (removing one rank) is ~70%.

# Visualizing and tuning a model

Lets's examine the titanic dataset which contains information on whether a passenger died or survived when the ship sank. The dataset also contains information on variables associated with each passenger.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, roc_auc_score

In [None]:
titanic = pd.read_csv('titanic.csv')
titanic.set_index('PassengerId', inplace=True)

We can create variables that match predictions of interest.

For example, we might believe that a person's class on the ship predicts whether that person suvived.

We might also predict that being male decreased a person's chance of surviving

Age may also have an effect, where perhaps younger passengers were more likely to survive, or people with parents and siblings aboard the ship

In [None]:
titanic = titanic.join(pd.get_dummies(titanic.Pclass))
titanic['is_male'] = titanic.Sex.apply(lambda x: 1 if x == 'male' else 0)
titanic['Age'] = titanic.groupby(["Sex", 'Pclass']).Age.transform(lambda x: x.fillna(x.mean()))
titanic['had_parents'] = titanic.Parch.apply(lambda x: 1 if x > 0 else 0)
titanic['had_siblings'] = titanic.SibSp.apply(lambda x: 1 if x > 0 else 0)

In [None]:
titanic.head(5)

Before creating the model, we can visualize differences in survival by plotting two histograms of the variable: one for each outcome.

Notice the slight mean difference between histograms, indicating that sex may indeed to survival probability

In [None]:
%matplotlib inline
titanic.groupby('Survived').Age.hist()

We wil fit the model using a cross-validated grid search of the parameters. For logistic regression, we can have a regularization weight, C, which prevents overfitting, and a class weight to bias our predictions towards one particular class (if think the baserate probability of each outcome isn't 50/50)

In [None]:
from sklearn import grid_search, cross_validation

feature_set = titanic[['is_male', 1, 2, 'Fare', 'Age', 'had_parents', 'had_siblings']]

gs = grid_search.GridSearchCV(
    estimator=LogisticRegression(),
    param_grid={'C': [10**-i for i in range(-5, 5)], 'class_weight': [None, 'balanced']},
    cv=cross_validation.KFold(n=len(titanic), n_folds=10),
    scoring='roc_auc'
)


gs.fit(feature_set, titanic['Survived'])
gs.grid_scores_

We'll take the best performing estimator from the grid search and save it to a variable called "lm"

In [None]:
lm = gs.best_estimator_
print lm

Below the ROC curve is based on various thresholds: it shows with a false positive rate (x-axis) ~0, it also expects a true positive rate (y-axis) ~0 (the same, ish, for the top right hand of the figure).

The second chart, which does not play with thesholds, shows the one true TPR and FPR point, joined to 0,0 and 1,1.

The first chart will be more effective as you compare models and determine where the decision line should exist for the data. The second simplifies the first in case this idea of thresholds is confusing.

In [None]:
actuals = lm.predict(feature_set) 
probas = lm.predict_proba(feature_set)
print len(probas)
plt.plot(roc_curve(titanic['Survived'], probas[:,1])[0], roc_curve(titanic['Survived'], probas[:,1])[1])

Finally, you can use the `roc_auc_score` function to calculate the area under these curves (AUC).

In [None]:
roc_auc_score(titanic['Survived'], lm.predict(feature_set))

### Cancer Prediction Problem

The following data set are taken from actual breast cancer screening images.

They contain whether the breast mass was malignant (cancer) or begnin.

Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass.  A few of the images can be found at: http://www.cs.wisc.edu/~street/images/

Your goal is to try to predict the status of the mass, with special consideration to the desired classification metric.

Here is a full data dictionary for the data:

1) ID number
2) Diagnosis (M = malignant, B = benign)
3-32)

Ten real-valued features are computed for each cell nucleus:

a) radius (mean of distances from center to points on the perimeter)

b) texture (standard deviation of gray-scale values)

c) perimeter

d) area

e) smoothness (local variation in radius lengths)

f) compactness (perimeter^2 / area - 1.0)

g) concavity (severity of concave portions of the contour)

h) concave points (number of concave portions of the contour)

i) symmetry

j) fractal dimension ("coastline approximation" - 1)

** Goals **

1. Spend a few minutes determining which data would be most important to use in the prediction problem. You may need to create new features based on the data available. Consider using a feature selection aide in sklearn. But a worst case scenario; identify one or two strong features that would be useful to include in the model.
2. Spend 1-2 minutes considering which _metric_ makes the most sense to optimize. Accuracy? FPR or TPR? AUC?
3. Build a tuned Logistic model.


In [None]:
cancer = pd.read_csv('cancer.csv')
cancer.head()










