# Logistic Regression

## Exercises
In these exercises, we'll continue working with the titanic dataset and building logistic regression models. Throughout this exercise, be sure you are training, evaluation, and comparing models on the train and validate datasets. The test dataset should only be used for your final model.

For all of the models you create, choose a threshold that optimizes for accuracy.

Create a new notebook, logistic_regression, use it to answer the following questions:

## Imports

In [1]:
import acquire
import prepare

import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

# ignore warnings
import warnings
warnings.filterwarnings("ignore")

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

from pydataset import data

#### Acquire, Prep, Slipt and find the baseline

In [2]:
# Get your titanic data
titanic2_df = acquire.get_titanic_data()

# Clean the new dataset using the new function called prep_titanic
titanic2_df = prepare.prep_titanic2(titanic2_df)

# Prepare and Split my data
train2, validate2, test2 = prepare.split_function(titanic2_df, 'survived')
print(f'train: {train2.head(2)}\n')
print(f'validate: {validate2.head(2)}\n')
print(f'test: {test2.head(2)}\n')

# ---------------------------------------------------------------------------------
# Lets drop 'passenger_id', 'survived', 'sex', 'embarked' because they either are objects or don't add value to the data.
# Also, we need to remove the 'survived' column because is our TARGET.
# Also, lets convert train, validate and test. 
X_train2 = train2.drop(columns = ['survived'])
X_validate2 = validate2.drop(columns = ['survived'])
X_test2 = test2.drop(columns = ['survived'])
print(f'X_train: {X_train2.head(2)}\n')
# ---------------------------------------------------------------------------------
# Set a target
target = 'survived'

# 'y' variable are series
y_train2 = train2[target]
y_validate2 = validate2[target]
y_test2 = test2[target]

# Check the shape
print(f'X_train: {X_train2.shape}, X_validate: {X_validate2.shape}, X_test: {X_test2.shape}')

# ---------------------------------------------------------------------------------
# calculate baseline accuracy
def establish_baseline(y_train2):
    #est baseline
    baseline_prediction2 = y_train2.mode()
    
    #create series of prediction with that baseline val
    #same len as y_train
    y_train_pred2 = pd.Series((baseline_prediction2[0]), range(len(y_train2)))
    
    #compute the confusion matrix for Accuracy
    cm2= confusion_matrix(y_train2, y_train_pred2)
    tn, fp, fn, tp = cm.ravel()
    
    accuracy = (tp+tn) / (tp+tn+fp+fn)
    
    return accuracy

# write a function to compute the baseline for a classification model

def establish_baseline(y_train2):
    #  establish the value we will predict for all observations
    baseline_prediction2 = y_train2.mode()

    # create a series of predictions with that value, 
    # the same length as our training set
    y_train_pred2 = pd.Series((baseline_prediction2[0]), range(len(y_train2)))

    # compute accuracy of baseline
    cm2 = confusion_matrix(y_train2, y_train_pred2)
    tn, fp, fn, tp = cm2.ravel()

    accuracy = (tp+tn)/(tn+fp+fn+tp)
    return accuracy

    
print(f'Baseline accuracy: {establish_baseline(y_train2)}')

csv file found and loaded
train:      survived  pclass   age      fare
455         1       3  29.0    7.8958
380         1       1  42.0  227.5250

validate:      survived  pclass   age     fare
176         0       3   0.0  25.4667
372         0       3  19.0   8.0500

test:      survived  pclass   age     fare
561         0       3  40.0   7.8958
641         1       1  24.0  69.3000

X_train:      pclass   age      fare
455       3  29.0    7.8958
380       1  42.0  227.5250

X_train: (534, 3), X_validate: (178, 3), X_test: (179, 3)
Baseline accuracy: 0.6161048689138576


### 1. Create a model that includes only age, fare, and pclass. Does this model perform better than your baseline?

#### a. Creat it

In [3]:
logit = LogisticRegression()
logit

#### b. Fit it

In [4]:
logit.fit(X_train2, y_train2)

#### c. Use it

In [5]:
logit.score(X_train2, y_train2)

0.6835205992509363

>This model performs better than the baseline.

#### d. take a look at predictions

In [6]:
logit.predict(X_train2)

array([0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1,
       0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0,
       0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0,
       0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0,
       1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1,
       0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1,
       0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,

#### e. View raw probabilityes (output from the model)

In [7]:
logit.predict_proba(X_train2).round(2)[:5]

array([[0.77, 0.23],
       [0.28, 0.72],
       [0.46, 0.54],
       [0.28, 0.72],
       [0.76, 0.24]])

#### f. Classification report

In [8]:
print(classification_report(y_train2, logit.predict(X_train2)))

              precision    recall  f1-score   support

           0       0.70      0.85      0.77       329
           1       0.63      0.42      0.50       205

    accuracy                           0.68       534
   macro avg       0.67      0.63      0.64       534
weighted avg       0.67      0.68      0.67       534



#### g. Coef

In [9]:
logit.coef_

array([[-0.85407869, -0.01409236,  0.00300774]])

#### h. Columns

In [10]:
X_train2.columns

Index(['pclass', 'age', 'fare'], dtype='object')

### 2. Include sex in your model as well. Note that you'll need to encode or create a dummy variable of this feature before including it in a model.

In [11]:
# Get your titanic data
titanic3_df = acquire.get_titanic_data()

# Clean the new dataset using the new function called prep_titanic
titanic3_df = prepare.prep_titanic3(titanic3_df)

# Prepare and Split my data
train, validate, test = prepare.split_function(titanic3_df, 'survived')
print(f'train: {train.head(2)}\n')
print(f'validate: {validate.head(2)}\n')
print(f'test: {test.head(2)}\n')

# ---------------------------------------------------------------------------------
# Lets drop 'passenger_id', 'survived', 'sex', 'embarked' because they either are objects or don't add value to the data.
# Also, we need to remove the 'survived' column because is our TARGET.
# Also, lets convert train, validate and test. 
X_train = train.drop(columns = ['survived'])
X_validate = validate.drop(columns = ['survived'])
X_test = test.drop(columns = ['survived'])
print(f'X_train: {X_train.head(2)}\n')
# ---------------------------------------------------------------------------------
# Set a target
target = 'survived'

# 'y' variable are series
y_train = train[target]
y_validate = validate[target]
y_test = test[target]

# Check the shape
print(f'X_train: {X_train.shape}, X_validate: {X_validate.shape}, X_test: {X_test.shape}')

# ---------------------------------------------------------------------------------
# calculate baseline accuracy
def establish_baseline(y_train):
    #est baseline
    baseline_prediction = y_train.mode()
    
    #create series of prediction with that baseline val
    #same len as y_train
    y_train_pred = pd.Series((baseline_prediction[0]), range(len(y_train)))
    
    #compute the confusion matrix for Accuracy
    cm= confusion_matrix(y_train, y_train_pred)
    tn, fp, fn, tp = cm.ravel()
    
    accuracy = (tp+tn) / (tp+tn+fp+fn)
    
    return accuracy

# write a function to compute the baseline for a classification model

def establish_baseline(y_train):
    #  establish the value we will predict for all observations
    baseline_prediction = y_train.mode()

    # create a series of predictions with that value, 
    # the same length as our training set
    y_train_pred = pd.Series((baseline_prediction[0]), range(len(y_train)))

    # compute accuracy of baseline
    cm = confusion_matrix(y_train, y_train_pred)
    tn, fp, fn, tp = cm.ravel()

    accuracy = (tp+tn)/(tn+fp+fn+tp)
    return accuracy

    
print(f'Baseline accuracy: {establish_baseline(y_train)}')

csv file found and loaded
train:      survived  pclass   age      fare  sex_male
455         1       3  29.0    7.8958         1
380         1       1  42.0  227.5250         0

validate:      survived  pclass   age     fare  sex_male
176         0       3   0.0  25.4667         1
372         0       3  19.0   8.0500         1

test:      survived  pclass   age     fare  sex_male
561         0       3  40.0   7.8958         1
641         1       1  24.0  69.3000         0

X_train:      pclass   age      fare  sex_male
455       3  29.0    7.8958         1
380       1  42.0  227.5250         0

X_train: (534, 4), X_validate: (178, 4), X_test: (179, 4)
Baseline accuracy: 0.6161048689138576


In [12]:
logit3 = LogisticRegression()
logit3.fit(X_train, y_train)
logit3.score(X_train, y_train)

0.7790262172284644

In [13]:
print(classification_report(y_train, logit3.predict(X_train)))

              precision    recall  f1-score   support

           0       0.81      0.83      0.82       329
           1       0.72      0.69      0.71       205

    accuracy                           0.78       534
   macro avg       0.77      0.76      0.76       534
weighted avg       0.78      0.78      0.78       534



In [14]:
X_train.columns

Index(['pclass', 'age', 'fare', 'sex_male'], dtype='object')

### 3. Try out other combinations of features and models.

In [15]:
# ONlY THESE COLUMNS 'pclass', 'age', 'fare', 'sex_male', 'embarked_Q', 'embarked_S'

# Get your titanic data
titanic4_df = acquire.get_titanic_data()

# Clean the new dataset using the new function called prep_titanic
titanic4_df = prepare.prep_titanic4(titanic4_df)

# Prepare and Split my data
train, validate, test = prepare.split_function(titanic4_df, 'survived')
print(f'train: {train.head(2)}\n')

# ---------------------------------------------------------------------------------
# Remove the 'survived' column because is our TARGET.
# Also, lets convert train, validate and test. 
X_train4 = train.drop(columns = ['survived'])
X_validate = validate.drop(columns = ['survived'])
X_test = test.drop(columns = ['survived'])
print(f'X_train: {X_train.head(2)}\n')
# ---------------------------------------------------------------------------------
# Set a target
target = 'survived'

# 'y' variable are series
y_train = train[target]
y_validate = validate[target]
y_test = test[target]

# Check the shape
print(f'X_train: {X_train.shape}, X_validate: {X_validate.shape}, X_test: {X_test.shape}')

# ---------------------------------------------------------------------------------
# calculate baseline accuracy
def establish_baseline4(y_train):
    #est baseline
    baseline_prediction = y_train.mode()
    
    #create series of prediction with that baseline val
    #same len as y_train
    y_train_pred = pd.Series((baseline_prediction[0]), range(len(y_train)))
    
    #compute the confusion matrix for Accuracy
    cm= confusion_matrix(y_train, y_train_pred)
    tn, fp, fn, tp = cm.ravel()
    
    accuracy = (tp+tn) / (tp+tn+fp+fn)
    
    return accuracy

# write a function to compute the baseline for a classification model

def establish_baseline(y_train):
    #  establish the value we will predict for all observations
    baseline_prediction = y_train.mode()

    # create a series of predictions with that value, 
    # the same length as our training set
    y_train_pred = pd.Series((baseline_prediction[0]), range(len(y_train)))

    # compute accuracy of baseline
    cm = confusion_matrix(y_train, y_train_pred)
    tn, fp, fn, tp = cm.ravel()

    accuracy = (tp+tn)/(tn+fp+fn+tp)
    return accuracy

    
print(f'Baseline accuracy: {establish_baseline(y_train)}')

csv file found and loaded
train:      survived  pclass   age      fare  sex_male  embarked_Q  embarked_S
455         1       3  29.0    7.8958         1           0           0
380         1       1  42.0  227.5250         0           0           0

X_train:      pclass   age      fare  sex_male
455       3  29.0    7.8958         1
380       1  42.0  227.5250         0

X_train: (534, 4), X_validate: (178, 6), X_test: (179, 6)
Baseline accuracy: 0.6161048689138576


In [16]:
X_train.columns

Index(['pclass', 'age', 'fare', 'sex_male'], dtype='object')

In [17]:
logit4 = LogisticRegression()
logit4.fit(X_train, y_train)
logit4.score(X_train, y_train)

0.7790262172284644

In [18]:
print(classification_report(y_train, logit4.predict(X_train)))

              precision    recall  f1-score   support

           0       0.81      0.83      0.82       329
           1       0.72      0.69      0.71       205

    accuracy                           0.78       534
   macro avg       0.77      0.76      0.76       534
weighted avg       0.78      0.78      0.78       534



### 4. Use you best 3 models to predict and evaluate on your validate sample.

In [19]:
# Change hyperparameter C = 0.01
logit5 = LogisticRegression(C=0.01)
logit5

In [20]:
logit5.fit(X_train, y_train)

In [21]:
logit5.predict(X_train)

array([0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1,
       0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0,
       0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0,
       1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0,

In [22]:
logit5.score(X_train, y_train)

0.6928838951310862

In [23]:
print(classification_report(y_train,logit5.predict(X_train)))

              precision    recall  f1-score   support

           0       0.68      0.95      0.79       329
           1       0.77      0.29      0.42       205

    accuracy                           0.69       534
   macro avg       0.72      0.62      0.60       534
weighted avg       0.71      0.69      0.65       534



### 5. Choose you best model from the validation performation, and evaluate it on the test dataset. How do the performance metrics compare to validate? to train?

In [24]:
# ONLY THESE COLUMS 'pclass', 'age', 'fare'
logit.score(X_train2, y_train2)

0.6835205992509363

In [25]:
# ONLY THESE COLUMS 'pclass', 'age', 'fare', 'sex_male'
logit3.score(X_train, y_train)

0.7790262172284644

In [26]:
# ONlY THESE COLUMNS 'pclass', 'age', 'fare', 'sex_male', 'embarked_Q', 'embarked_S'
logit4.score(X_train, y_train)

0.7790262172284644

In [27]:
# Change hyperparameter C = 0.01
logit5.score(X_train, y_train)

0.6928838951310862

In [28]:
# Change hyperparameter C = 0.08
logit6 = LogisticRegression(C=0.08)
logit6.fit(X_train, y_train)
logit6.score(X_train, y_train)

0.7808988764044944

### Bonus 1. How do different strategies for handling the missing values in the age column affect model performance?