<img src="https://teaching.bowyer.ai/sdsai/resources/0/img/IMPERIAL_logo_RGB_Blue_2024.svg" alt="Imperial Logo" width="500"/><br /><br />

Supervised Learning - Tutorial Exercise Solutions
==============
### SURG70098 - Surgical Data Science and AI
### Stuart Bowyer

# Setup

In [None]:
# Install and import
%pip install pandas
%pip install matplotlib
import pandas as pd
import matplotlib.pyplot as plt
import pandas_gbq

# @markdown Enter your Google Cloud Project ID:
project_id = 'mimic-test-476513'  # @param {type:"string"}

df_day1_vitalsign = pandas_gbq.read_gbq("""
  SELECT *
  FROM `physionet-data.mimiciv_3_1_derived.first_day_vitalsign`
  LEFT JOIN (
    SELECT
      subject_id,
      stay_id,
      gender,
      race,
      admission_age,
      dod IS NOT NULL AS mortality
    FROM
      `physionet-data.mimiciv_3_1_derived.icustay_detail`
  )
  USING(subject_id, stay_id)
  LEFT JOIN (
    SELECT
      stay_id,
      AVG(weight) as weight
    FROM
      `physionet-data.mimiciv_3_1_derived.weight_durations`
    GROUP BY
      stay_id
  )
  USING(stay_id)
  LEFT JOIN (
    SELECT
      stay_id,
      CAST(AVG(height) AS FLOAT64) AS height
    FROM
      `physionet-data.mimiciv_3_1_derived.height`
    GROUP BY
      stay_id
  )
  USING(stay_id)
  WHERE heart_rate_mean IS NOT NULL
  LIMIT 100000
""", project_id=project_id)

# Exercise 4.1


## Build the simple model on all data

In [None]:
# Import the linear regression function from sklearn
from sklearn.linear_model import LinearRegression

# Prepare our data
#  - first remove any nan values as LR cannot work with them (if not already cleaned)
#. - engineer a male/female
data = df_day1_vitalsign.dropna(subset=['height', 'weight']).copy()
data['is_male'] = data['gender'] == 'M'
X = data[['height', 'admission_age', 'is_male']]
Y = data['weight']

# Create the model
model = LinearRegression()

# Train (fit) the model
model.fit(X,Y)

# Print model coefficients
print(f'Slope:     {model.coef_}')
print(f'Intercept: {model.intercept_}')

Slope:     [ 0.85093662 -0.30093496  1.16209297]
Intercept: -39.58135167336545


Here we can see that both height and being male have positive slopes, therefore increase the predicted weight of the individual. The admission age has a negative slope, therefore increasing age reduced predicted weight.

## Predict the Y values for the training data X values

In [None]:
# Make predictions on the entire range of X values
Y_pred = model.predict(X)

## Compute metrics

In [None]:
## Compute Metrics

from sklearn.metrics import mean_squared_error, root_mean_squared_error

mse = mean_squared_error(Y, Y_pred)
print(f'Mean Squared Error:      {mse} kg2')

rmse = root_mean_squared_error(Y, Y_pred)
print(f'Root Mean Squared Error: {rmse} kg')

Mean Squared Error:      491.3781288938699 kg2
Root Mean Squared Error: 22.167050523104557 kg


This has made the model very slightly better than the original that used only height

## Cross validation

In [None]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(model, X, Y, cv=5, scoring='neg_root_mean_squared_error')

print(f'Cross-Validation Scores: {scores}')
print(f'Mean Score:              {scores.mean()}')

Cross-Validation Scores: [-26.49507213 -22.68719541 -21.56510978 -19.91907557 -19.71820154]
Mean Score:              -22.07693088699745


Again, this has made the model slightly better than the original

# Exercise 4.2

## Build the simple model on all data

In [None]:
from sklearn.linear_model import LogisticRegression

# Prepare our data
data = df_day1_vitalsign.dropna(subset=['admission_age', 'heart_rate_mean', 'sbp_mean', 'glucose_max', 'mortality'])
X = data[['admission_age', 'heart_rate_mean', 'sbp_mean', 'glucose_max']]
Y = data['mortality']

# Create the model
model = LogisticRegression()

# Train (fit) the model
model.fit(X, Y)

## Predict the Y values

In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

# Predict the classes
Y_pred = model.predict(X)

## Baseline metrics

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

accuracy = accuracy_score(Y, Y_pred)
precision = precision_score(Y, Y_pred)
recall = recall_score(Y, Y_pred)
f1 = f1_score(Y, Y_pred)

print(f"Accuracy:  {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall:    {recall:.2f}")
print(f"F1 Score:  {f1:.2f}")

Accuracy:  0.64
Precision: 0.61
Recall:    0.45
F1 Score:  0.52


These are all slightly higher than the simple model

## Cross validation
Here we use the `cross_validate` function as it allows the use of multiple scoring functions so we can compare to the above. The output is a `dict` so takes a little formatting to give print the scores.

In [None]:
from sklearn.model_selection import cross_validate

logreg_scores = cross_validate(model, X, Y, cv=5, scoring=['f1','accuracy','precision','recall','roc_auc'])

print(f'Cross-Validation Accuracy Scores: {logreg_scores["test_accuracy"]}')
print(f'Cross-Validation Accuracy Mean:   {logreg_scores["test_accuracy"].mean()}')

print(f'Cross-Validation Precision Scores: {logreg_scores["test_precision"]}')
print(f'Cross-Validation Precision Mean:   {logreg_scores["test_precision"].mean()}')

print(f'Cross-Validation Recall Scores: {logreg_scores["test_recall"]}')
print(f'Cross-Validation Recall Mean:   {logreg_scores["test_recall"].mean()}')

print(f'Cross-Validation F1 Scores: {logreg_scores["test_f1"]}')
print(f'Cross-Validation F1 Mean:   {logreg_scores["test_f1"].mean()}')

print(f'Cross-Validation ROC AUC Scores: {logreg_scores["test_roc_auc"]}')
print(f'Cross-Validation ROC AUC Mean:   {logreg_scores["test_roc_auc"].mean()}')

Cross-Validation Accuracy Scores: [0.67062468 0.6484254  0.59886422 0.63584711 0.63842975]
Cross-Validation Accuracy Mean:   0.6384382319937536
Cross-Validation Precision Scores: [0.61538462 0.62354892 0.54657293 0.62449799 0.61538462]
Cross-Validation Precision Mean:   0.6050778159534251
Cross-Validation Recall Scores: [0.61686747 0.45301205 0.3746988  0.37515078 0.41495778]
Cross-Validation Recall Mean:   0.44693737555771945
Cross-Validation F1 Scores: [0.61612515 0.5247732  0.44460329 0.46872645 0.49567723]
Cross-Validation F1 Mean:   0.5099810651249
Cross-Validation ROC AUC Scores: [0.71805814 0.68773849 0.60275247 0.65573938 0.68082157]
Cross-Validation ROC AUC Mean:   0.6690220117281035


These are similar to the case where we used the same training and testing

# Exercise 4.3

## SVM (Linear kernel)

In [None]:
from sklearn.svm import SVC

model = SVC(kernel='linear')

svm_lin_scores = cross_validate(model, X, Y, cv=5, scoring=['f1','accuracy','precision','recall','roc_auc'])

print(f'Cross-Validation Accuracy Scores: {svm_lin_scores["test_accuracy"]}')
print(f'Cross-Validation Accuracy Mean:   {svm_lin_scores["test_accuracy"].mean()}')

print(f'Cross-Validation Precision Scores: {svm_lin_scores["test_precision"]}')
print(f'Cross-Validation Precision Mean:   {svm_lin_scores["test_precision"].mean()}')

print(f'Cross-Validation Recall Scores: {svm_lin_scores["test_recall"]}')
print(f'Cross-Validation Recall Mean:   {svm_lin_scores["test_recall"].mean()}')

print(f'Cross-Validation F1 Scores: {svm_lin_scores["test_f1"]}')
print(f'Cross-Validation F1 Mean:   {svm_lin_scores["test_f1"].mean()}')

print(f'Cross-Validation ROC AUC Scores: {svm_lin_scores["test_roc_auc"]}')
print(f'Cross-Validation ROC AUC Mean:   {svm_lin_scores["test_roc_auc"].mean()}')

Cross-Validation Accuracy Scores: [0.67062468 0.6442953  0.61125452 0.63068182 0.63378099]
Cross-Validation Accuracy Mean:   0.6381274613123302
Cross-Validation Precision Scores: [0.62030075 0.63031423 0.57418112 0.62555066 0.62295082]
Cross-Validation Precision Mean:   0.6146595165561068
Cross-Validation Recall Scores: [0.59638554 0.41084337 0.35903614 0.34258142 0.36670688]
Cross-Validation Recall Mean:   0.4151106718793146
Cross-Validation F1 Scores: [0.60810811 0.49744712 0.44180875 0.44271239 0.46165528]
Cross-Validation F1 Mean:   0.4903463288387848
Cross-Validation ROC AUC Scores: [0.71760865 0.68506329 0.60463099 0.65229818 0.67808321]
Cross-Validation ROC AUC Mean:   0.6675368644676221


## SVM (RBF kernel)

In [None]:
from sklearn.svm import SVC

model = SVC(kernel='rbf')

svm_rbf_scores = cross_validate(model, X, Y, cv=5, scoring=['f1','accuracy','precision','recall','roc_auc'])

print(f'Cross-Validation Accuracy Scores: {svm_rbf_scores["test_accuracy"]}')
print(f'Cross-Validation Accuracy Mean:   {svm_rbf_scores["test_accuracy"].mean()}')

print(f'Cross-Validation Precision Scores: {svm_rbf_scores["test_precision"]}')
print(f'Cross-Validation Precision Mean:   {svm_rbf_scores["test_precision"].mean()}')

print(f'Cross-Validation Recall Scores: {svm_rbf_scores["test_recall"]}')
print(f'Cross-Validation Recall Mean:   {svm_rbf_scores["test_recall"].mean()}')

print(f'Cross-Validation F1 Scores: {svm_rbf_scores["test_f1"]}')
print(f'Cross-Validation F1 Mean:   {svm_rbf_scores["test_f1"].mean()}')

print(f'Cross-Validation ROC AUC Scores: {svm_rbf_scores["test_roc_auc"]}')
print(f'Cross-Validation ROC AUC Mean:   {svm_rbf_scores["test_roc_auc"].mean()}')

NameError: name 'cross_validate' is not defined

Slightly better than the logistic regression, but (probably) slower to train

## SVM (Polynomial Kernel)

In [None]:
from sklearn.svm import SVC

model = SVC(kernel='poly')

svm_poly_scores = cross_validate(model, X, Y, cv=5, scoring=['f1','accuracy','precision','recall','roc_auc'])

print(f'Cross-Validation Accuracy Scores: {svm_poly_scores["test_accuracy"]}')
print(f'Cross-Validation Accuracy Mean:   {svm_poly_scores["test_accuracy"].mean()}')

print(f'Cross-Validation Precision Scores: {svm_poly_scores["test_precision"]}')
print(f'Cross-Validation Precision Mean:   {svm_poly_scores["test_precision"].mean()}')

print(f'Cross-Validation Recall Scores: {svm_poly_scores["test_recall"]}')
print(f'Cross-Validation Recall Mean:   {svm_poly_scores["test_recall"].mean()}')

print(f'Cross-Validation F1 Scores: {svm_poly_scores["test_f1"]}')
print(f'Cross-Validation F1 Mean:   {svm_poly_scores["test_f1"].mean()}')

print(f'Cross-Validation ROC AUC Scores: {svm_poly_scores["test_roc_auc"]}')
print(f'Cross-Validation ROC AUC Mean:   {svm_poly_scores["test_roc_auc"].mean()}')

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Cross-Validation Accuracy Scores: [0.67681982 0.57150232 0.57150232 0.57179752 0.57179752]
Cross-Validation Accuracy Mean:   0.5926839024306993
Cross-Validation Precision Scores: [0.65789474 0.         0.         0.         0.        ]
Cross-Validation Precision Mean:   0.13157894736842107
Cross-Validation Recall Scores: [0.51204819 0.         0.         0.         0.        ]
Cross-Validation Recall Mean:   0.10240963855421688
Cross-Validation F1 Scores: [0.57588076 0.         0.         0.         0.        ]
Cross-Validation F1 Mean:   0.1151761517615176
Cross-Validation ROC AUC Scores: [0.71881347 0.64119241 0.35280635 0.58102567 0.57377278]
Cross-Validation ROC AUC Mean:   0.5735221344743276


## Decision Trees

In [None]:
from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier(max_depth=5)

dt_scores = cross_validate(model, X, Y, cv=5, scoring=['f1','accuracy','precision','recall','roc_auc'])

print(f'Cross-Validation Accuracy Scores: {dt_scores["test_accuracy"]}')
print(f'Cross-Validation Accuracy Mean:   {dt_scores["test_accuracy"].mean()}')

print(f'Cross-Validation Precision Scores: {dt_scores["test_precision"]}')
print(f'Cross-Validation Precision Mean:   {dt_scores["test_precision"].mean()}')

print(f'Cross-Validation Recall Scores: {dt_scores["test_recall"]}')
print(f'Cross-Validation Recall Mean:   {dt_scores["test_recall"].mean()}')

print(f'Cross-Validation F1 Scores: {dt_scores["test_f1"]}')
print(f'Cross-Validation F1 Mean:   {dt_scores["test_f1"].mean()}')

print(f'Cross-Validation ROC AUC Scores: {dt_scores["test_roc_auc"]}')
print(f'Cross-Validation ROC AUC Mean:   {dt_scores["test_roc_auc"].mean()}')

Cross-Validation Accuracy Scores: [0.64481156 0.6293237  0.58079504 0.61880165 0.63326446]
Cross-Validation Accuracy Mean:   0.6213992840594427
Cross-Validation Precision Scores: [0.58179724 0.58163265 0.51359517 0.60756501 0.61247637]
Cross-Validation Precision Mean:   0.5794132873156272
Cross-Validation Recall Scores: [0.60843373 0.48072289 0.40963855 0.31001206 0.39083233]
Cross-Validation Recall Mean:   0.439927914311044
Cross-Validation F1 Scores: [0.59481743 0.52638522 0.45576408 0.41054313 0.47717231]
Cross-Validation F1 Mean:   0.4929364349657934
Cross-Validation ROC AUC Scores: [0.6893977  0.67497197 0.5791181  0.64706011 0.66747412]
Cross-Validation ROC AUC Mean:   0.6516043990345775


It will take a bit of playing with the parameters, but you can get the decision tree to be almost as good as logistic regression

## KNN


In [None]:
from sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier(n_neighbors=10)

knn_scores = cross_validate(model, X, Y, cv=5, scoring=['f1','accuracy','precision','recall','roc_auc'])

print(f'Cross-Validation Accuracy Scores: {knn_scores["test_accuracy"]}')
print(f'Cross-Validation Accuracy Mean:   {knn_scores["test_accuracy"].mean()}')

print(f'Cross-Validation Precision Scores: {knn_scores["test_precision"]}')
print(f'Cross-Validation Precision Mean:   {knn_scores["test_precision"].mean()}')

print(f'Cross-Validation Recall Scores: {knn_scores["test_recall"]}')
print(f'Cross-Validation Recall Mean:   {knn_scores["test_recall"].mean()}')

print(f'Cross-Validation F1 Scores: {knn_scores["test_f1"]}')
print(f'Cross-Validation F1 Mean:   {knn_scores["test_f1"].mean()}')

print(f'Cross-Validation ROC AUC Scores: {knn_scores["test_roc_auc"]}')
print(f'Cross-Validation ROC AUC Mean:   {knn_scores["test_roc_auc"].mean()}')

Cross-Validation Accuracy Scores: [0.64068147 0.61538462 0.58853898 0.62293388 0.6089876 ]
Cross-Validation Accuracy Mean:   0.6153053093946932
Cross-Validation Precision Scores: [0.59203297 0.57657658 0.53374233 0.61647059 0.57142857]
Cross-Validation Precision Mean:   0.5780502069123505
Cross-Validation Recall Scores: [0.51927711 0.38554217 0.31445783 0.31604343 0.34740651]
Cross-Validation Recall Mean:   0.37654540962402083
Cross-Validation F1 Scores: [0.55327343 0.46209386 0.39575436 0.41786284 0.43210803]
Cross-Validation F1 Mean:   0.4522185031144755
Cross-Validation ROC AUC Scores: [0.66350878 0.63090574 0.5817748  0.62773741 0.63158397]
Cross-Validation ROC AUC Mean:   0.6271021361886655


Again, with careful parameter choice, we can get the performance to approximately the same as the logistic regression

# Exercise 4.4

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

model = KNeighborsClassifier(n_neighbors=10)

knn_std_scores = cross_validate(model, X_scaled, Y, cv=5, scoring=['f1','accuracy','precision','recall','roc_auc'])

print(f'Cross-Validation Accuracy Scores: {knn_std_scores["test_accuracy"]}')
print(f'Cross-Validation Accuracy Mean:   {knn_std_scores["test_accuracy"].mean()}')

print(f'Cross-Validation Precision Scores: {knn_std_scores["test_precision"]}')
print(f'Cross-Validation Precision Mean:   {knn_std_scores["test_precision"].mean()}')

print(f'Cross-Validation Recall Scores: {knn_std_scores["test_recall"]}')
print(f'Cross-Validation Recall Mean:   {knn_std_scores["test_recall"].mean()}')

print(f'Cross-Validation F1 Scores: {knn_std_scores["test_f1"]}')
print(f'Cross-Validation F1 Mean:   {knn_std_scores["test_f1"].mean()}')

print(f'Cross-Validation ROC AUC Scores: {knn_std_scores["test_roc_auc"]}')
print(f'Cross-Validation ROC AUC Mean:   {knn_std_scores["test_roc_auc"].mean()}')

Cross-Validation Accuracy Scores: [0.63397006 0.61590088 0.60144553 0.61725207 0.61828512]
Cross-Validation Accuracy Mean:   0.6173707317697555
Cross-Validation Precision Scores: [0.57766367 0.57733813 0.55846774 0.59482759 0.59453782]
Cross-Validation Precision Mean:   0.5805669888276779
Cross-Validation Recall Scores: [0.54216867 0.38674699 0.33373494 0.33293124 0.34137515]
Cross-Validation Recall Mean:   0.38739139913090237
Cross-Validation F1 Scores: [0.55935364 0.46320346 0.41779789 0.42691415 0.43371648]
Cross-Validation F1 Mean:   0.4601971231232511
Cross-Validation ROC AUC Scores: [0.67855269 0.62759874 0.61228981 0.63332091 0.64688794]
Cross-Validation ROC AUC Mean:   0.6397300184356307


By standardising the data we have a small increase in performance. Looking at the data, that glucose has a large range compared to the others. This means that when searching for neighbours, the algorithm is biased away from this feature.