In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

from cm import plot_confusion_matrix

import warnings

warnings.filterwarnings('ignore')

%matplotlib inline

# Build and Evaluate a Logistic Regression Model

A logistic regression is a type of model which can be used when predicting a categorial outcome. Specifically, it outputs a predicted probability that an observation is in one class or the other.

In this notebook, you'll see how to build a logistic regression model to predict whether a passenger on the Titanic would have lived or died.

In [None]:
titanic = pd.read_csv('../data/titanic.csv')

In [None]:
titanic.head()

In [None]:
titanic.Survived.value_counts()

We can see that 342 passengers survived and 545 died.

In [None]:
titanic.Survived.value_counts(normalize = True)

61% of passengers died.

Now, let's see how survival relates to some of our variables.

Starting with the passenger class, we can use the `pandas` `crosstab` method to get a count or proportion of survival by passenger class.

In [None]:
pd.crosstab(titanic['Pclass'], titanic['Survived'])

In [None]:
pd.crosstab(titanic['Pclass'], titanic['Survived'], normalize='index')

We can turn this table into a bar plot.

In [None]:
pd.crosstab(titanic['Pclass'], titanic['Survived'], normalize='index').plot(kind = 'bar', edgecolor = 'black')
plt.xticks(rotation = 0)
plt.legend(loc = 'upper left', labels = ['died', 'survived']);

In [None]:
pd.crosstab(titanic['Sex'], titanic['Survived'], normalize='index').plot(kind = 'bar', edgecolor = 'black')
plt.xticks(rotation = 0)
plt.legend(loc = 'upper left', labels = ['died', 'survived']);

In [None]:
sns.boxplot(x = 'Survived', y = 'Fare', data = titanic);

In [None]:
sns.boxplot(x = 'Survived', y = 'Age', data = titanic);

Currently, sex is encoded as a string. You need to encode it numerically before building your model. Also, passenger class is encoded using a number, but it is more like a categorical variable, so should be converted to a _dummy variable_.

In [None]:
titanic = titanic.drop(columns = 'Name')
titanic = pd.get_dummies(titanic, columns = ['Sex', 'Pclass'], drop_first = True)

In [None]:
titanic.head()

In [None]:
X = titanic.drop(columns = 'Survived')
y = titanic.Survived

When doing data science, the goal of building models is almost alway to make predictions. The main objective is to have a model which makes good predictions on _unseen_ data. Therefore, in order to evaluate how good a model is, it is necessary to set aside some data as a test set for evaulation purposes. This can be accomplished using the `train_test_split` function.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 321)

In [None]:
logistic_model = LogisticRegression()
logistic_model.fit(X_train, y_train)

In [None]:
fig, ax = plt.subplots(figsize = (7,5))
sns.barplot(x = logistic_model.coef_[0], y = list(X.columns), ax = ax, edgecolor = 'black')
plt.title('Logistic Regression Coefficients')

ymin, ymax = plt.ylim()
plt.vlines(x = 0, ymin = ymin, ymax = ymax);

For this model, a positive coefficient for a variable means that a higher value for that variable (holding all others constant) corresponds to a higher probability of surviving and a negative coefficient indicates the opposite. So, for example, a male passenger has a lower probability of surviving, all else being equal.

## Analyzing Predicted Probabilities

Recall that a logistic regression model outputs predicted probabilities. You can generate these predicted probabilities using the `predict_proba` method of your model. This outputs the predicted probability of both classes (died and survived), but you only really need the probability of surviving, which you can access by slicing the output.

In [None]:
y_pred_prob = logistic_model.predict_proba(X_test)[:,1]

In [None]:
i = 9

print('Passenger Information:\n{}'.format(X_test.iloc[i]))
print('---------------------------------')
print('Predicted Probability of Survival: {}'.format(y_pred_prob[i]))

One way to analyze your model is to look at the receiver operating characteristic (ROC) curve. This shows how the true positive rate and false positive rate change as the prediction threshold changes.

In [None]:
fp_rate, tp_rate, thresholds = metrics.roc_curve(y_test, y_pred_prob)

plt.plot(fp_rate, tp_rate)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.title('ROC curve for Titanic Survival classifier')
plt.xlabel('False Positive Rate (1 - Specificity)')
plt.ylabel('True Positive Rate (Sensitivity)')
plt.grid(True)

### AUC - percentage of the ROC plot that is under the curve

This corresponds to the probability that the model assigns a higher probability of surviving to a random survivor than to a random passenger who died.

A perfect model would have AUC = 1.

In [None]:
print(metrics.roc_auc_score(y_test, y_pred_prob))

## Analyzing Predictions

To generate predictions, you can use the `predict` method of your model.

In [None]:
y_pred = logistic_model.predict(X_test)

The predictions can be summarized using a confusion matrix.

In [None]:
plot_confusion_matrix(y_test, y_pred, labels = ['Died', 'Survived'])

There are a number of metrics you can use to evalute your model. First is accuracy, which is the proportion of predictions which are correct.

In [None]:
print(metrics.accuracy_score(y_test, y_pred))

In [None]:
plot_confusion_matrix(y_test, y_pred, labels = ['Died', 'Survived'], metric = 'accuracy')

The naive model which predicts that all passengers died, would only be correct 62% of the time, so our model is significantly better than the naive one.

**Sensitivity/True Positive Rate:** The proportion of true positives (in our case, people who survived) that are identified as such.

In [None]:
print('sensitivity:')
print(metrics.recall_score(y_test, y_pred))

In [None]:
plot_confusion_matrix(y_test, y_pred, labels = ['Died', 'Survived'], metric = 'sensitivity')

**Specificity/True Negative Rate:** The proportion of true negatives (in our case, people who died) that are identified as such.

In [None]:
print('specificity:')
print(125 / (125 + 14))

In [None]:
plot_confusion_matrix(y_test, y_pred, labels = ['Died', 'Survived'], metric = 'specificity')

**Precision:** The proportion of predicted positives that are actually positive (survived).

In [None]:
print('precision:')
print(metrics.precision_score(y_test, y_pred))

In [None]:
plot_confusion_matrix(y_test, y_pred, labels = ['Died', 'Survived'], metric = 'precision')