In [1]:
import seaborn
import pandas as pd
import numpy as np


In [2]:
titanic = seaborn.load_dataset('titanic')
titanic.sample()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
515,0,1,male,47.0,0,0,34.0208,S,First,man,True,D,Southampton,no,True


# Task 1

In [3]:
# Split the target 'survived' off from the predictors
X = titanic.drop('survived', axis=1)
y = titanic['survived']

In [8]:
# convert the 'sex' column to a binary where 1 represents female
X['sex'] = X['sex'].map({'female' : 1, 'male' : 0})

In [9]:
X.head()

Unnamed: 0,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,3,0,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,3,1,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,3,0,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


# Task 2

In [10]:
# Implement a train-test split using all default arguments and random_state=42
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

# Task 3

We will fit a logistic regression model using only `sex` and `fare`.  Remember that logistic regression uses regularization by default.  That means that scaling will have an effect on the scores. 

In [12]:
# Fit a standard scaler to train columns of interest, and transform both train and test
from sklearn.preprocessing import StandardScaler

ss = StandardScaler()
X_train_sc = ss.fit_transform(X_train[['sex', 'fare']])
X_test_sc = ss.transform(X_test[['sex', 'fare']])

In [15]:
# Instantiate a logistic regression model with the default arguments and random_state=42
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(random_state=42)

# Fit on the scaled training data. 
lr.fit(X_train_sc, y_train)

LogisticRegression(random_state=42)

# Task 4

The `predict()` method applies a decision threshold to the predicted probabilities.

In [16]:
# Model test predictions
y_hat = lr.predict(X_test_sc)

Under the hood, the model is applying a decision threshold to assign a 0 or 1 prediction. 
For the exercise below, apply a decision threshold of .5 to the predicted probabities, so that you can recreate the 0 or 1 predictions stored in y_hat above.

In [44]:
# Convert these predicted probabilities to 0/1 predictions
y_hat_proba = lr.predict_proba(X_test_sc)

# Apply threshold here
y_hat_by_hand = [0 if val[1] < 0.5 else 1 for val in y_hat_proba]

In [36]:
assert (y_hat == y_hat_by_hand).sum() == len(y_hat)

# Stretch Task

For the final task, we will recreate the predicted probabilities using the coefficients and intercept of the fitted model.  The parameters that the fit method calculates are fed into a linear equation, whose output represents log odds. That can be confusing, but its easier to think about if you think about what results after feeding the log odds into the sigmoid function.

In [38]:
def sigmoid(log_odd):
    
    '''
    The link function translates a log_odd prediction
    and returns a probability of class 1 with a
    number between 0 and 1.
    '''
    
    return 1/(1+np.e**(-log_odd))
    

The sigmoid takes a log odd input, and translates it to a number between 0 and 1.  The sigmoid ensures that any result of the linear equation (log odds) which is negative translates into a probability closer to 0 than 1; in other words, below .5.  A positive log odd results in a prediction closer to 1 than 0; in other words above .5.

In the cell below, use the coef_ and intercept_ attributes from the fit model to calculate the log_odds for each record in the test set. Then pass these log_odds into the sigmoid function above.

In [40]:
lr.coef_

array([[1.14069272, 0.5478949 ]])

In [39]:
lr.intercept_

array([-0.58133268])

In [None]:
# Log odds are the output of the dot product of X_test, 
# a version of the coef_ attribute, plus the interceps_ attribute. 
log_odds = lr.intercept_ + lr.coef_

# use a list comprehension to apply the sigmoid function to each log_odd.
predict_proba_by_hand = None

In [None]:
assert np.isclose(predict_proba_by_hand, y_hat_proba[:,1]).sum() == 223