**Logistic Regression Review**
(10-15 minutes review from Beginner Track slides. Intuition, outputs a probability vector, making a prediction)

In [None]:
#importing libraries
import matplotlib as mpl
import matplotlib.pyplot as plt
import timeit
import numpy as np

from sklearn.linear_model import LogisticRegression

from sklearn import metrics
import pandas as pd

You can import files from your Google Drive. Very convenient!


In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [None]:
# Make sure to change the path if you stored the data
# in a different location last time!
titanic_data = pd.read_csv('/content/gdrive/My Drive/cleaned_titanic_data.csv', sep=',')
titanic_data.head()

Unnamed: 0.1,Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,port_Q,port_S
0,0,0,3,0,22.0,1,0,7.25,0,1
1,1,1,1,1,38.0,1,0,71.2833,0,0
2,2,1,3,1,26.0,0,0,7.925,0,1
3,3,1,1,1,35.0,1,0,53.1,0,1
4,4,0,3,0,35.0,0,0,8.05,0,1


We see an annoying column shows up when we read from our previously saved CSV file. Let's drop it!

In [None]:
titanic_data.drop(columns=['Unnamed: 0'], inplace=True)

# Alternately, you can do the following though it is inefficient
# titanic_data = titanic_data.drop(columns=['Unnamed: 0'], inplace=False)

In [None]:
titanic_data.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,port_Q,port_S
0,0,3,0,22.0,1,0,7.25,0,1
1,1,1,1,38.0,1,0,71.2833,0,0
2,1,3,1,26.0,0,0,7.925,0,1
3,1,1,1,35.0,1,0,53.1,0,1
4,0,3,0,35.0,0,0,8.05,0,1


**Training and Test Split**


It's a good idea to split our data into a training and testing set. The idea is that once we build our model using the training data, we can evaluate its performance on the test data. We can use this information to check if our model is underfitting/overfitting or if it generalizes well to unseen data.

One simple way to have a 75:25 split for training and test data. The downside, of course, is that our model has access to less data while training. On the bright side, the test data gives us an indication of how well our model might do in the real world, so we're not taken by surprise later!
*Why is training error not a good estimate of the model's performance?*

In [None]:
from sklearn.model_selection import train_test_split
import random
np.random.seed(2)

# Create a train-test split of 75-25 called train and test

It's now important to split both our training and test data into 'X' and 'y': the features and the labels, before we can run logistic regression.
(Warning: Do not shuffle your data after this split! Can anyone answer why this would be a bad idea?)

In [None]:
train, test = train_test_split(titanic_data, test_size=0.25) 

# Split the training data into trainX and trainY
trainX = train.drop('Survived', axis=1)
trainY = train['Survived']

# Split the test data into testX and testY
testX = test.drop('Survived', axis = 1)
testY = test['Survived']

In [None]:
trainX.head()

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,port_Q,port_S
199,3,0,28.0,0,0,9.5,0,1
129,3,0,33.0,0,0,7.8958,0,0
90,3,0,20.0,0,0,7.8542,0,1
230,3,0,29.0,0,0,7.775,0,1
126,3,0,24.0,0,0,7.1417,0,1


Can anyone guess what the numbers in the first column are? Why?

In [None]:
trainY.head()

199    0
129    0
90     0
230    0
126    1
Name: Survived, dtype: int64

**Data Leakage**

Let's talk a little bit about data leakage, and how we can avoid it. Does anyone have any ideas what data leakage is?

**Implementing Logistic Regression!**

We want to create a Logistic Regression model. For this, we will be using Scikit Learn. The first step is to visit the documentation page.

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

In [None]:
# Create an instance of the Logistic Regression class
from sklearn.linear_model import LogisticRegression

LR_Model = LogisticRegression(max_iter = 300)

# Can someone try to explain what an instance of this class means?

In [None]:
# We now need to feed the model with our training data
# Just one line! That's how easy it is with Scikit Learn

LR_Model.fit(trainX,trainY)

# (If there's an error message, see if you can fix it!)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=300,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [None]:
# Our model is now trained on the training data
# Let's now make a prediction on the training data (Again, just one line!)
trainY_pred = LR_Model.predict(trainX)

In [None]:
# Let's now try to measure the training accuracy
from sklearn import metrics


acc = metrics.accuracy_score(trainY, trainY_pred,normalize=True)
print('Logistic Regression:\t-- train acc %.3f' % acc)

Logistic Regression:	-- train acc 0.796


Logistic regression outputs probabilities. Above, the decision process is taken care of by Scikit Learn, but we can get some additional details and customize the model if necessary.

In [None]:
# How do we customize the prediction probabilities? Let's discuss
# Hint: predict_proba function



Of course, like we mentioned before, the training accuracy doesn't tell us all that much. Let's see how it does on the test set! (Dw, the code is pretty much the same)

In [None]:
# Let's now make predictions on the test data and measure the accuracy. 
# It's pretty similar to what we did above
testY_pred = LR_Model.predict(testX)
acc = metrics.accuracy_score(testY, testY_pred,normalize=True)
print('Logistic Regression:\t-- test acc %.3f' % acc)

Logistic Regression:	-- test acc 0.825


This is pretty good! Our test accuracy is pretty close to our training accuracy. This is an indication that our model has generalized well. We have a good ML model on our hands! Or do we?

(Discussion about biased data sets, and always predicting the positive class when 99% of the data is from the positive class) 

To verify that our model is actually doing something, it's helpful to think about precision and recall.
Precision = true positives / (true positives + false positives)
Recall = true positives / (true positives + false negatives)
(Discussion about why these are useful metrics)

In [None]:
prec = 
print('Precision is:\t precision %.3f' % prec)
rec = 
print('Recall is:\t accuracy %.3f' % rec)

Question: How do you determine which metric is the best to evaluate your model??

What is K-fold cross validation?

In [None]:
from IPython.display import Image
img_url = "https://ethen8181.github.io/machine-learning/model_selection/img/kfolds.png"

Image(url=img_url)

In [None]:
#K Fold CV

from sklearn.model_selection import KFold
 
titanic_data_X = titanic_data.drop('Survived', axis=1)
titanic_data_Y = titanic_data['Survived']
 
kf = KFold(n_splits=5,shuffle=True, random_state=33)
 
train_accs = []
test_accs = []
 
 
for train_index, test_index in kf.split(titanic_data_X):
  trainX_k, testX_k = titanic_data_X.iloc[train_index,], titanic_data_X.iloc[test_index,]
  trainY_k, testY_k = titanic_data_Y.iloc[train_index,], titanic_data_Y.iloc[test_index,]

  LR_Model = LogisticRegression(max_iter = 300)
  LR_Model.fit(trainX_k,trainY_k)

  trainY_k_pred = LR_Model.predict(trainX_k)
  train_acc = metrics.accuracy_score(trainY_k, trainY_k_pred, normalize=True)
  train_accs.append(train_acc)
 
  testY_k_pred = LR_Model.predict(testX_k)
  test_acc = metrics.accuracy_score(testY_k, testY_k_pred, normalize=True)
  test_accs.append(test_acc)
 
mean_train_acc = sum(train_accs)/len(train_accs)
mean_test_acc = sum(test_accs)/len(test_accs)
 
print("Mean training accuracy:\t",mean_train_acc)
print("Mean testing accuracy:\t",mean_test_acc)

Mean training accuracy:	 0.8028650895241707
Mean testing accuracy:	 0.7997460801117249


That's all for today! Thanks for coming out! Please fill out our feedback form here: https://tinyurl.com/applymltwofeedback