## Exercises

In these exercises, we'll continue working with the titanic dataset and building logistic regression models. Throughout this exercise, be sure you are training, evaluation, and comparing models on the train and validate datasets. The test dataset should only be used for your final model.

For all of the models you create, choose a threshold that optimizes for accuracy.

Create a new notebook, logistic_regression, use it to answer the following questions:

1. Create a model that includes only age, fare, and pclass. Does this model perform better than your baseline?

2. Include sex in your model as well. Note that you'll need to encode or create a dummy variable of this feature before including it in a model.

3. Try out other combinations of features and models.

4. Use you best 3 models to predict and evaluate on your validate sample.

5. Choose you best model from the validation performation, and evaluate it on the test dataset. How do the performance metrics compare to validate? to train?

In [1]:
# custom modules for data prep:
import acquire as a
import prepare as p
import model as m

# tabular manipulation
import numpy as np
import pandas as pd

# ML stuff:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report,recall_score,\
precision_score, f1_score
from sklearn.tree import DecisionTreeClassifier, \
export_text, plot_tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression #logistic not linear!
from sklearn.neighbors import KNeighborsClassifier #pick the classifier one

In [2]:
df=a.get_titanic_data()
df.head(3)

this file exists, reading csv


Unnamed: 0,passenger_id,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,deck,embark_town,alone
0,0,0,3,male,22.0,1,0,7.25,S,Third,,Southampton,0
1,1,1,1,female,38.0,1,0,71.2833,C,First,C,Cherbourg,0
2,2,1,3,female,26.0,0,0,7.925,S,Third,,Southampton,1


In [3]:
def clean_titanic(df):
    """
    students - write docstring
    """
    #drop unncessary columns
    df = df.drop(columns=['embarked','deck', 'class'])

    #drop the rows with null values 
    df = df.dropna()
    
    #made this a string so its categorical
    df.pclass = df.pclass.astype(object)
    
    #filled nas with the mode
    df.embark_town = df.embark_town.fillna('Southampton')
    
    return df


In [4]:
df=clean_titanic(df)
df.head()

Unnamed: 0,passenger_id,survived,pclass,sex,age,sibsp,parch,fare,embark_town,alone
0,0,0,3,male,22.0,1,0,7.25,Southampton,0
1,1,1,1,female,38.0,1,0,71.2833,Cherbourg,0
2,2,1,3,female,26.0,0,0,7.925,Southampton,1
3,3,1,1,female,35.0,1,0,53.1,Southampton,0
4,4,0,3,male,35.0,0,0,8.05,Southampton,1


In [5]:
train,val,test=p.splitting_data(df,'survived')

In [6]:
train_en,val_en,test_en=m.preprocess_titanic(train,val,test)

In [7]:
# separate independents feature & target
X_train, y_train = train_en.drop(columns='survived'), train_en.survived
X_validate, y_validate = val_en.drop(columns='survived'), val_en.survived
X_test, y_test = test_en.drop(columns='survived'), test_en.survived

In [8]:
X_train.head(3)

Unnamed: 0,pclass,age,sibsp,parch,fare,alone,embark_town_Queenstown,embark_town_Southampton,sex_male
702,3,18.0,0,1,14.4542,0,0,0,0
199,2,24.0,0,0,13.0,1,0,1,0
108,3,38.0,0,0,7.8958,1,0,1,1


## baseline

In [9]:
y_train.value_counts()

survived
0    254
1    173
Name: count, dtype: int64

In [10]:
y_train.mode()

0    0
Name: survived, dtype: int64

In [11]:
# baseline accuracy
# baseline accuracy
y_train.value_counts(normalize=True)[0].round(2)

0.59

In [12]:
254/427

0.594847775175644

In [13]:
# also another way
print(train['survived'].value_counts())
baseline_accuracy = round((train.survived == 0).mean(), 2)
baseline_accuracy

survived
0    254
1    173
Name: count, dtype: int64


0.59

> Q1. Create a model that includes only age, fare, and pclass. Does this model perform better than your baseline?

In [14]:
features = ['age','fare','pclass']
X_train[features].head()

Unnamed: 0,age,fare,pclass
702,18.0,14.4542,3
199,24.0,13.0,2
108,38.0,7.8958,3
872,33.0,5.0,1
827,1.0,37.0042,2


In [15]:
# create object
lr1 = LogisticRegression()
# fit it
lr1.fit(X_train[features], y_train)

In [16]:
train_acc1=lr1.score(X_train[features],y_train)

# compare this model with baseline
print(f'Train Accuracy: {train_acc1}')
print(f'Baseline Accuracy: {baseline_accuracy}')


Train Accuracy: 0.7353629976580797
Baseline Accuracy: 0.59


From above, model with only age, fare, pclass is better than baseline accuracy.

> Q2) Include sex in your model as well.

Note that you'll need to encode or create a dummy variable of this feature before including it in a model.

In [17]:
features = ['age','fare','pclass', 'sex_male']
X_train[features].head()

Unnamed: 0,age,fare,pclass,sex_male
702,18.0,14.4542,3,0
199,24.0,13.0,2,0
108,38.0,7.8958,3,1
872,33.0,5.0,1,1
827,1.0,37.0042,2,1


In [18]:
lr2 = LogisticRegression()
lr2.fit(X_train[features], y_train)

In [19]:
train_acc2=lr2.score(X_train[features],y_train)

# compare this model with baseline
print(f'Train Accuracy: {train_acc1}')
print(f'Baseline Accuracy: {baseline_accuracy}')

Train Accuracy: 0.7353629976580797
Baseline Accuracy: 0.59


> Q3) Try out other combinations of features and models.

In [20]:
# Test model with all features

# create algorithm object
lr3 = LogisticRegression(C=1, random_state=42, intercept_scaling=1, solver='liblinear')

# fit model with all features
lr3.fit(X_train, y_train)

# compute accuracy
train_acc3 = lr3.score(X_train, y_train)

# compare this model with baseline
print(f'Train Accuracy: {train_acc3}')
print(f'Baseline Accuracy: {baseline_accuracy}')

Train Accuracy: 0.810304449648712
Baseline Accuracy: 0.59


In [21]:
# Try changing 'solver' to 'lbfgs' feature

# create algorithm object
lr4 = LogisticRegression(C=1, random_state=42, intercept_scaling=1, solver='lbfgs')

# fit model with all features
lr4.fit(X_train, y_train)

# compute accuracy
train_acc4 = lr4.score(X_train, y_train)

# compare this model with baseline
print(f'Train Accuracy: {train_acc4}')
print(f'Baseline Accuracy: {baseline_accuracy}')

Train Accuracy: 0.810304449648712
Baseline Accuracy: 0.59


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [22]:
# Try changing 'class_weight' to 'balanced'

# create algorithm object
lr5 = LogisticRegression(C=1, class_weight='balanced', random_state=42, intercept_scaling=1, solver='lbfgs')

# fit model with all features
lr5.fit(X_train, y_train)

# compute accuracy
train_acc5 = lr5.score(X_train, y_train)

# compare this model with baseline
print(f'Train Accuracy: {train_acc5}')
print(f'Baseline Accuracy: {baseline_accuracy}')

Train Accuracy: 0.810304449648712
Baseline Accuracy: 0.59


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [23]:
# Try changing c-value (regularization strength) from 1 to 0.1

# create algorithm object
lr6 = LogisticRegression(C=0.1, random_state=123, intercept_scaling=1, solver='lbfgs')

# fit model with all features
lr6.fit(X_train, y_train)

# compute accuracy
train_acc6 = lr6.score(X_train, y_train)

# compare this model with baseline
print(f'Train Accuracy: {train_acc6}')
print(f'Baseline Accuracy: {baseline_accuracy}')

Train Accuracy: 0.8056206088992974
Baseline Accuracy: 0.59


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


> Q4) Use you best 3 models to predict and evaluate on your validate sample.

In [27]:
print(f'train accuracy 1  :{train_acc1}')
print(f'train accuracy 2  :{train_acc2}')
print(f'train accuracy 3  :{train_acc3}')
print(f'train accuracy 4  :{train_acc4}')
print(f'train accuracy 5  :{train_acc5}')
print(f'train accuracy 6  :{train_acc6}')
print(f'baseline accuracy :{baseline_accuracy}')

train accuracy 1  :0.7353629976580797
train accuracy 2  :0.8032786885245902
train accuracy 3  :0.810304449648712
train accuracy 4  :0.810304449648712
train accuracy 5  :0.810304449648712
train accuracy 6  :0.8056206088992974
baseline accuracy :0.59


In [29]:
# select model 3
# use logit to make predictions for the X_validate observations
y_val_pred3 = lr3.predict(X_validate)
# compute accuracy
val_acc3 = lr3.score(X_validate, y_validate)
# create a list and add to a dataframe at the end comparing all the models. 
model3 = [3, train_acc3, val_acc3]


# select model 4
y_val_pred4 = lr4.predict(X_validate)
val_acc4 = lr4.score(X_validate, y_validate) 
model4 = [4, train_acc4, val_acc4]

# select model 5
y_val_pred5 = lr5.predict(X_validate)
val_acc5 = lr5.score(X_validate, y_validate) 
model5 = [5, train_acc5, val_acc5]

pd.DataFrame([model3, model4, model5], columns=['model', 'in-sample accuracy', 'out-of-sample accuracy'])

Unnamed: 0,model,in-sample accuracy,out-of-sample accuracy
0,3,0.810304,0.795775
1,4,0.810304,0.795775
2,5,0.810304,0.795775


> Q 5) Choose you best model from the validation performation, and evaluate it on the test dataset. How do the performance metrics compare to validate? to train?

let's choice model 3 although all's three has same accuracy.

In [33]:
# Test Model 3

y_pred3 = lr3.predict(X_test)
y_pred_proba = lr3.predict_proba(X_test)
print("Model 3: solver = lbfgs, c = 1")
print('Accuracy: {:.2f}'.format(lr3.score(X_test, y_test)))
print(confusion_matrix(y_test, y_pred3))
print(classification_report(y_test, y_pred3))

Model 3: solver = lbfgs, c = 1
Accuracy: 0.76
[[67 18]
 [17 41]]
              precision    recall  f1-score   support

           0       0.80      0.79      0.79        85
           1       0.69      0.71      0.70        58

    accuracy                           0.76       143
   macro avg       0.75      0.75      0.75       143
weighted avg       0.76      0.76      0.76       143

