**Logistic Regression Review**
(10-15 minutes review from Beginner Track slides. Intuition, outputs a probability vector, making a prediction)

In [0]:
#importing libraries
import matplotlib as mpl
import matplotlib.pyplot as plt
import timeit
import numpy as np

from sklearn.linear_model import LogisticRegression

from sklearn import metrics
import pandas as pd

You can import files from your Google Drive. Very convenient!


In [0]:
from google.colab import drive
drive.mount('drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at drive


In [0]:
titanic_data = pd.read_csv('/content/drive/My Drive/new_workshop_data/cleaned_titanic_data.csv', sep=',')
titanic_data.head()

Unnamed: 0.1,Unnamed: 0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,port_Q,port_S
0,0,0,3,"Braund, Mr. Owen Harris",0,22.0,1,0,A/5 21171,7.25,0,1
1,1,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,38.0,1,0,PC 17599,71.2833,0,0
2,2,1,3,"Heikkinen, Miss. Laina",1,26.0,0,0,STON/O2. 3101282,7.925,0,1
3,3,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,35.0,1,0,113803,53.1,0,1
4,4,0,3,"Allen, Mr. William Henry",0,35.0,0,0,373450,8.05,0,1


We see an annoying column shows up when we read from our previously saved CSV file. Let's drop it!

In [0]:
titanic_data.drop(columns=['Unnamed: 0'], inplace=True)

In [0]:
titanic_data.head()

Unnamed: 0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,port_Q,port_S
0,0,3,"Braund, Mr. Owen Harris",0,22.0,1,0,A/5 21171,7.25,0,1
1,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,38.0,1,0,PC 17599,71.2833,0,0
2,1,3,"Heikkinen, Miss. Laina",1,26.0,0,0,STON/O2. 3101282,7.925,0,1
3,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,35.0,1,0,113803,53.1,0,1
4,0,3,"Allen, Mr. William Henry",0,35.0,0,0,373450,8.05,0,1


In [0]:
titanic_data.drop(columns=['Name','Ticket'], inplace=True)

**Training and Test Split**


It's a good idea to split our data into a training and testing set. The idea is that once we build our model using the training data, we can evaluate its performance on the test data. We can use this information to check if our model is underfitting/overfitting or if it generalizes well to unseen data.

One simple way to have a 75:25 split for training and test data. The downside, of course, is that our model has access to less data while training. On the bright side, the test data gives us an indication of how well our model might do in the real world, so we're not taken by surprise later!
*Why is training error not a good estimate of the model's performance?*

In [0]:
from sklearn.model_selection import train_test_split
import random
np.random.seed(1)

train, test = train_test_split(titanic_data, test_size=0.25)

It's now important to split both our training and test data into 'X' and 'y': the features and the labels, before we can run logistic regression.
(Warning: Do not shuffle your data after this split! That would result in meaningless data)

In [0]:
trainX = train.drop('Survived', axis=1)
trainY = train['Survived']

testX = test.drop('Survived', axis=1)
testY = test['Survived']

In [0]:
trainX.head()

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,port_Q,port_S
125,2,1,30.0,0,0,13.0,0,1
644,3,0,24.0,0,0,7.7958,0,1
627,2,0,27.0,0,0,13.0,0,1
218,3,0,31.0,0,0,7.775,0,1
173,1,0,51.0,0,1,61.3792,0,0


In [0]:
trainY.head()

125    1
644    0
627    0
218    0
173    0
Name: Survived, dtype: int64

**Implementing Logistic Regression!**

In [0]:
LR_Model = LogisticRegression(max_iter = 300) #, copy paste error
LR_Model.fit(trainX,trainY)
trainY_pred = LR_Model.predict(trainX)
acc = metrics.accuracy_score(trainY, trainY_pred, normalize=True)
print('Logistic Regression:\t-- train acc %.3f' % acc)

Logistic Regression:	-- train acc 0.802


Of course, like we mentioned before, the training accuracy doesn't tell us all that much. Let's see how it does on the test set! (Dw, the code is pretty much the same)

In [0]:
testY_pred = LR_Model.predict(testX)
acc = metrics.accuracy_score(testY, testY_pred, normalize=True)
print('Logistic Regression:\t-- test acc %.3f' % acc)

Logistic Regression:	-- test acc 0.798


This is pretty good! Our test accuracy is pretty close to our training accuracy. This is an indication that our model has generalized well. We have a good ML model on our hands! Or do we?

(Discussion about biased data sets, and always predicting the positive class when 99% of the data is from the positive class) 

To verify that our model is actually doing something, it's helpful to think about precision and recall.
Precision = true positives / (true positives + false positives)
Recall = true positives / (true positives + false negatives)
(Discussion about why these are useful metrics)

In [0]:
prec = metrics.precision_score(testY,testY_pred)
print('Precision is:\t acc %.3f' % prec)
rec = metrics.recall_score(testY,testY_pred)
print('Recall is:\t acc %.3f' % rec)

Precision is:	 acc 0.744
Recall is:	 acc 0.736


In [0]:
#Discussion about metrics and how to find the best one
#Things to do: CV to find best metric (default: l2 norm)
#K Fold CV
#End session by introducing decision trees