# End-to-End ML (Part 3): Model Selection and Deployment


Goal:
- Use k-fold cross validation on the one-hot data
- Show model ensembling of the LR and RF models
- Do some ML Ops / model analysis on the final ensembled model
- Show how to deploy this model to "production" using streamlit (maybe HF spaces?)

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pathlib import Path

from sklearn.linear_model import LogisticRegressionCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, f1_score, confusion_matrix, accuracy_score, precision_score, recall_score

seed = 123
np.random.seed(seed)

In [2]:
import warnings
warnings.filterwarnings('ignore')

In [3]:
def get_scores(y,yhat):
    print('accuracy: ', round(accuracy_score(y, yhat), 4))
    print('precision: ', round(precision_score(y, yhat), 4))
    print('recall: ', round(recall_score(y, yhat), 4))
    print('f1: ', round(f1_score(y, yhat), 4))
    print('auc: ', round(roc_auc_score(y, yhat), 4))
    print('confusion matrix:\n', confusion_matrix(y, yhat))

In [5]:
file = Path().cwd().parent / 'resources' / 'titanic_cleaned.csv'
df = pd.read_csv(file)
df.head()

Unnamed: 0.1,Unnamed: 0,Survived,Sex,Age,SibSp,Parch,Fare,Age_missing,Letter_missing,Pclass_1,...,Prefix_4,Prefix_5,Prefix_6,Letter_0,Letter_1,Letter_2,Letter_3,Letter_4,Letter_5,Letter_6
0,0,0,0,0.271174,1,0,0.76014,0,1,0,...,0,0,0,0,0,1,0,0,0,0
1,1,1,1,0.472229,1,0,0.888896,0,0,1,...,0,0,0,0,0,1,0,0,0,0
2,2,1,1,0.321438,0,0,0.765155,0,1,0,...,0,0,1,0,0,1,0,0,0,0
3,3,1,1,0.434531,1,0,0.872307,0,0,1,...,0,0,1,0,0,1,0,0,0,0
4,4,0,0,0.434531,0,0,0.766037,0,1,0,...,0,0,1,0,0,1,0,0,0,0


In [7]:
X = df.drop('Survived', axis=1).values
y = df['Survived'].values
X.shape, y.shape

((891, 33), (891,))

## Part 3 Notes

- Issues changing the seed strongly affecting results (lack of data). Could solve with cross validation?
- Danger of overfitting to this particular test set. Really want a model that generalizes to unseen data well. Should really give each score a 2% or so error band; once you do that a lot of these models are basically equivalent.
- Which metric to use? Often dangerous to focus on only one and optimize it, as weird edge cases can happen if you ignore others.
- We're doing about as well as we can expect with this data. Even Kaggle [discussions](https://www.kaggle.com/code/carlmcbrideellis/titanic-leaderboard-a-score-0-8-is-great) consider 77-85% good scores here. Not worth more effort?
- Think about the use case. What are you using this model for? How good does it have to be? What value does it provide? Don't just mindlessly fall into optimizing it. Real life isn't a Kaggle competition.
- Selecting the best model isn't about optimizing a metric, but finding best overall fit. Which one is "good enough", in the sense that it's accurate enough, fast enough, easy to implement and maintain, (where necessary) easy to interpret, etc.
- Possible improvements: Tune the hyperparameters of the above models more. Use cross validation for stable metric estimates. Use other models. Use more advanced resampling techniques like SMOTE/ADASYN. Take the unlabeled "test" set from Kaggle, label it with your best model, and use that as new training data on top of what you've already got. Try more advanced categorical encodings like learned embeddings. Better yet, turn all your features into categorical features by thresholding them.
-**You need to use k-fold CV. Way too much fluctuation in scores with different seeds. Over 5%.**
-**Thinking: Make this one about data cleaning. Do one after this about cross val, pipelines, and deployment.**

In [None]:
from sklearn.model_selection import KFold
X = X_onehot
kf = KFold(n_splits=5)
kf.get_n_splits(X)

print('Onehot\n')

print('Logistic Regression\n')
accs = []
for i_train, i_test in kf.split(X):
    model = LogisticRegressionCV(random_state=seed, class_weight='balanced')
    model.fit(X[i_train], y[i_train])
    acc_train = model.score(X[i_train], y[i_train])
    acc_test = model.score(X[i_test], y[i_test])
    accs.append(acc_test)
    print(acc_train, acc_test)

print()
print(f'avg acc: {sum(accs) / len(accs)}')
print()

print('Random Forest\n')
accs = []
for i_train, i_test in kf.split(X):
    model = RandomForestClassifier(n_estimators=100, random_state=seed, class_weight='balanced',
                                   max_depth=6, min_samples_leaf=2)
    model.fit(X[i_train], y[i_train])
    acc_train = model.score(X[i_train], y[i_train])
    acc_test = model.score(X[i_test], y[i_test])
    accs.append(acc_test)
    print(acc_train, acc_test)

print()
print(f'avg acc: {sum(accs) / len(accs)}')

In [None]:
from sklearn.ensemble import VotingClassifier

ensemble = VotingClassifier(estimators=[
    ('lr', LogisticRegressionCV(random_state=seed, class_weight='balanced')),
    ('rf', RandomForestClassifier(n_estimators=100, random_state=seed, class_weight='balanced',
                                  max_depth=6, min_samples_leaf=2)),
], voting='hard')

ensemble.fit(X_train_onehot, y_train)
ensemble.score(X_test_onehot, y_test)

In [None]:
X = X_onehot
kf = KFold(n_splits=5)
kf.get_n_splits(X)

print('Onehot\n')

print('Logistic Regression\n')
accs = []
for i_train, i_test in kf.split(X):
    model = LogisticRegressionCV(random_state=seed, class_weight='balanced')
    model.fit(X[i_train], y[i_train])
    acc_train = model.score(X[i_train], y[i_train])
    acc_test = model.score(X[i_test], y[i_test])
    accs.append(acc_test)
    print(acc_train, acc_test)

print()
print(f'avg acc: {sum(accs) / len(accs)}')
print()

print('Random Forest\n')
accs = []
for i_train, i_test in kf.split(X):
    model = RandomForestClassifier(n_estimators=100, random_state=seed, class_weight='balanced',
                                   max_depth=6, min_samples_leaf=2)
    model.fit(X[i_train], y[i_train])
    acc_train = model.score(X[i_train], y[i_train])
    acc_test = model.score(X[i_test], y[i_test])
    accs.append(acc_test)
    print(acc_train, acc_test)

print()
print(f'avg acc: {sum(accs) / len(accs)}')
print()

print('Ensemble\n')
accs = []
for i_train, i_test in kf.split(X):
    model = VotingClassifier(estimators=[
        ('lr', LogisticRegressionCV(random_state=seed, class_weight='balanced')),
        ('rf', RandomForestClassifier(n_estimators=100, random_state=seed, class_weight='balanced',
                                      max_depth=6, min_samples_leaf=2)),
    ], voting='hard')
    model.fit(X[i_train], y[i_train])
    acc_train = model.score(X[i_train], y[i_train])
    acc_test = model.score(X[i_test], y[i_test])
    accs.append(acc_test)
    print(acc_train, acc_test)

print()
print(f'avg acc: {sum(accs) / len(accs)}')