# Logistic Regression Exercises

In these exercises, we'll continue working with the titanic dataset and building logistic regression models. Throughout this exercise, be sure you are training, evaluation, and comparing models on the train and validate datasets. The test dataset should only be used for your final model.

For all of the models you create, choose a threshold that optimizes for accuracy.

Create a new notebook, logistic_regression, use it to answer the following questions:

In [201]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import acquire
import prepare

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import classification_report, confusion_matrix, plot_confusion_matrix
from sklearn.ensemble import RandomForestClassifier
import warnings
warnings.filterwarnings('ignore')

In [202]:
seed = 42

## Model 1 

### Create a model that includes only age, fare, and pclass. Does this model perform better than your baseline?



In [250]:
titanic = acquire.get_titanic_data()

In [251]:
titanic = titanic.drop(columns=['Unnamed: 0', 'passenger_id', 'sex', 'sibsp', 'parch',
                      'embarked', 'class', 'deck', 'embark_town', 'alone'])

In [252]:
titanic = titanic.dropna()

In [253]:
titanic.head()

Unnamed: 0,survived,pclass,age,fare
0,0,3,22.0,7.25
1,1,1,38.0,71.2833
2,1,3,26.0,7.925
3,1,1,35.0,53.1
4,0,3,35.0,8.05


In [254]:
train, val, test = prepare.split_train_test(titanic, 'survived')

train.shape, val.shape, test.shape

((428, 4), (171, 4), (115, 4))

In [255]:
X_train1 = train.drop(columns='survived')
y_train1 = train['survived']

X_val1 = val.drop(columns='survived')
y_val1 = val['survived']

X_test1 = test.drop(columns='survived')
y_test1 = test['survived']

In [256]:
logit1 = LogisticRegression(random_state=seed)

In [257]:
logit1.fit(X_train1, y_train1)

In [258]:
logit1.score(X_train1, y_train1)

0.7126168224299065

In [259]:
logit1.score(X_val1, y_val1)

0.7076023391812866

In [260]:
y1_preds = logit1.predict(X_train1)

In [261]:
print(classification_report(y_train1, y1_preds))

              precision    recall  f1-score   support

           0       0.72      0.85      0.78       254
           1       0.70      0.51      0.59       174

    accuracy                           0.71       428
   macro avg       0.71      0.68      0.68       428
weighted avg       0.71      0.71      0.70       428



### Include sex in your model as well. Note that you'll need to encode or create a dummy variable of this feature before including it in a model.



In [262]:
titanic = acquire.get_titanic_data()

In [263]:
titanic = titanic.drop(columns=['Unnamed: 0', 'passenger_id', 'sibsp', 'parch',
                      'embarked', 'class', 'deck', 'embark_town', 'alone'])

In [264]:
titanic = titanic.dropna()

In [265]:
dummies = pd.get_dummies(titanic[['sex']], drop_first=True)


In [266]:
titanic = pd.concat([titanic, dummies], axis=1)

In [267]:
titanic = titanic.drop(columns='sex')

In [268]:
titanic.head()

Unnamed: 0,survived,pclass,age,fare,sex_male
0,0,3,22.0,7.25,1
1,1,1,38.0,71.2833,0
2,1,3,26.0,7.925,0
3,1,1,35.0,53.1,0
4,0,3,35.0,8.05,1


In [269]:
train, val, test = prepare.split_train_test(titanic, 'survived')

train.shape, val.shape, test.shape

((428, 5), (171, 5), (115, 5))

In [270]:
X_train2 = train.drop(columns='survived')
y_train2 = train['survived']

X_val2 = val.drop(columns='survived')
y_val2 = val['survived']

X_test2 = test.drop(columns='survived')
y_test2 = test['survived']

In [271]:
logit2 = LogisticRegression(random_state=seed)

In [272]:
logit2.fit(X_train2, y_train2)

In [273]:
logit2.score(X_train2, y_train2)

0.8037383177570093

In [274]:
logit2.score(X_val2, y_val2)

0.8245614035087719

In [275]:
y2_preds = logit2.predict(X_train2)

In [276]:
print(classification_report(y_train2, y2_preds))

              precision    recall  f1-score   support

           0       0.82      0.86      0.84       254
           1       0.78      0.72      0.75       174

    accuracy                           0.80       428
   macro avg       0.80      0.79      0.79       428
weighted avg       0.80      0.80      0.80       428



## Model 2

In [277]:
titanic = acquire.get_titanic_data()

In [278]:
titanic = prepare.prep_titanic(titanic)

In [279]:
titanic.head()

Unnamed: 0,survived,pclass,sibsp,parch,fare,alone,sex_male,embark_town_Queenstown,embark_town_Southampton
0,0,3,1,0,7.25,0,1,0,1
1,1,1,1,0,71.2833,0,0,0,0
2,1,3,0,0,7.925,1,0,0,1
3,1,1,1,0,53.1,0,0,0,1
4,0,3,0,0,8.05,1,1,0,1


In [280]:
train, val, test = prepare.split_train_test(titanic, 'survived')

train.shape, val.shape, test.shape

((534, 9), (214, 9), (143, 9))

In [281]:
X_train3 = train.drop(columns='survived')
y_train3 = train['survived']

X_val3 = val.drop(columns='survived')
y_val3 = val['survived']

X_test3 = test.drop(columns='survived')
y_test3 = test['survived']

In [282]:
logit3 = LogisticRegression(random_state=seed)

In [283]:
logit3.fit(X_train3, y_train3)

In [155]:
logit3.score(X_train3, y_train3)

0.799625468164794

In [284]:
logit3.score(X_val3, y_val3)

0.7429906542056075

In [285]:
y3_preds = logit3.predict(X_train3)

In [286]:
print(classification_report(y_train3, y3_preds))

              precision    recall  f1-score   support

           0       0.83      0.84      0.84       329
           1       0.74      0.73      0.74       205

    accuracy                           0.80       534
   macro avg       0.79      0.79      0.79       534
weighted avg       0.80      0.80      0.80       534



## Model 3

In [287]:
titanic = acquire.get_titanic_data()

In [288]:
titanic = prepare.prep_titanic(titanic)

In [289]:
titanic = titanic.drop(columns=['embark_town_Queenstown', 'embark_town_Southampton'])

In [290]:
titanic.head()

Unnamed: 0,survived,pclass,sibsp,parch,fare,alone,sex_male
0,0,3,1,0,7.25,0,1
1,1,1,1,0,71.2833,0,0
2,1,3,0,0,7.925,1,0
3,1,1,1,0,53.1,0,0
4,0,3,0,0,8.05,1,1


In [291]:
train, val, test = prepare.split_train_test(titanic, 'survived')

train.shape, val.shape, test.shape

((534, 7), (214, 7), (143, 7))

In [292]:
X_train4 = train.drop(columns='survived')
y_train4 = train['survived']

X_val4 = val.drop(columns='survived')
y_val4 = val['survived']

X_test4 = test.drop(columns='survived')
y_test4 = test['survived']

In [293]:
logit4 = LogisticRegression(random_state=seed)

In [294]:
logit4.fit(X_train4, y_train4)

In [295]:
logit4.score(X_train4, y_train4)

0.799625468164794

In [296]:
y4_preds = logit4.predict(X_train4)

In [297]:
print(classification_report(y_train4, y4_preds))

              precision    recall  f1-score   support

           0       0.83      0.84      0.84       329
           1       0.74      0.73      0.74       205

    accuracy                           0.80       534
   macro avg       0.79      0.79      0.79       534
weighted avg       0.80      0.80      0.80       534



## Model 4

In [298]:
titanic = acquire.get_titanic_data()

In [299]:
titanic = prepare.prep_titanic(titanic)

In [300]:
titanic = titanic.drop(columns=['sibsp', 'parch', 'alone'])

In [301]:
titanic.head()

Unnamed: 0,survived,pclass,fare,sex_male,embark_town_Queenstown,embark_town_Southampton
0,0,3,7.25,1,0,1
1,1,1,71.2833,0,0,0
2,1,3,7.925,0,0,1
3,1,1,53.1,0,0,1
4,0,3,8.05,1,0,1


In [302]:
train, val, test = prepare.split_train_test(titanic, 'survived')

train.shape, val.shape, test.shape

((534, 6), (214, 6), (143, 6))

In [303]:
X_train5 = train.drop(columns='survived')
y_train5 = train['survived']

X_val5 = val.drop(columns='survived')
y_val5 = val['survived']

X_test5 = test.drop(columns='survived')
y_test5 = test['survived']

In [304]:
logit5 = LogisticRegression(random_state=seed)

In [305]:
logit5.fit(X_train5, y_train5)

In [306]:
logit5.score(X_train5, y_train5)

0.799625468164794

In [307]:
y5_preds = logit5.predict(X_train5)

In [308]:
print(classification_report(y_train5, y5_preds))

              precision    recall  f1-score   support

           0       0.83      0.85      0.84       329
           1       0.75      0.71      0.73       205

    accuracy                           0.80       534
   macro avg       0.79      0.78      0.79       534
weighted avg       0.80      0.80      0.80       534



## Model 5

In [309]:
titanic = acquire.get_titanic_data()

In [310]:
titanic = prepare.prep_titanic(titanic)

In [311]:
titanic = titanic.drop(columns=['sibsp', 'parch', 'fare', 'alone', 
                                'embark_town_Queenstown', 'embark_town_Southampton'])

In [312]:
titanic.head()

Unnamed: 0,survived,pclass,sex_male
0,0,3,1
1,1,1,0
2,1,3,0
3,1,1,0
4,0,3,1


In [313]:
train, val, test = prepare.split_train_test(titanic, 'survived')

train.shape, val.shape, test.shape

((534, 3), (214, 3), (143, 3))

In [314]:
X_train6 = train.drop(columns='survived')
y_train6 = train['survived']

X_val6 = val.drop(columns='survived')
y_val6 = val['survived']

X_test6 = test.drop(columns='survived')
y_test6 = test['survived']

In [315]:
logit6 = LogisticRegression(random_state=seed)

In [316]:
logit6.fit(X_train6, y_train6)

In [317]:
logit6.score(X_train6, y_train6)

0.799625468164794

In [318]:
y6_preds = logit6.predict(X_train6)

In [319]:
print(classification_report(y_train6, y6_preds))

              precision    recall  f1-score   support

           0       0.82      0.86      0.84       329
           1       0.76      0.70      0.73       205

    accuracy                           0.80       534
   macro avg       0.79      0.78      0.79       534
weighted avg       0.80      0.80      0.80       534



## Use you best 3 models to predict and evaluate on your validate sample.



Model 1

In [326]:
logit1.score(X_train1, y_train1)

0.7126168224299065

In [320]:
logit1.score(X_val1, y_val1)

0.7076023391812866

Model 2

In [327]:
logit2.score(X_train2, y_train2)

0.8037383177570093

In [321]:
logit2.score(X_val2, y_val2)

0.8245614035087719

Model 3

In [328]:
logit3.score(X_train3, y_train3)

0.799625468164794

In [322]:
logit3.score(X_val3, y_val3)

0.7429906542056075

Model 4

In [329]:
logit4.score(X_train4, y_train4)

0.799625468164794

In [323]:
logit4.score(X_val4, y_val4)

0.7383177570093458

Model 5

In [330]:
logit5.score(X_train5, y_train5)

0.799625468164794

In [324]:
logit5.score(X_val5, y_val5)

0.7616822429906542

Model 6

In [331]:
logit6.score(X_train6, y_train6)

0.799625468164794

In [325]:
logit6.score(X_val6, y_val6)

0.7616822429906542

My three best models would be #2, #5, and #6

## Choose your best model from the validation performation, and evaluate it on the test dataset. How do the performance metrics compare to validate? to train?



In [333]:
logit5.score(X_test5, y_test5)

0.7762237762237763

I ended up choosing my fifth model to run my test on. It worked very slightly better than my validate evaluation, and a little bit worse than my train evaluation.

# Class Review