<img src="https://teaching.bowyer.ai/sdsai/resources/0/img/IMPERIAL_logo_RGB_Blue_2024.svg" alt="Imperial Logo" width="500"/><br /><br />

ML Foundations and Data Preparation - Tutorial Exercises
==============
### SURG70098 - Surgical Data Science and AI
### Stuart Bowyer

# Setup

In [None]:
# Install and import
%pip install pandas
%pip install matplotlib
import pandas as pd
import matplotlib.pyplot as plt
import pandas_gbq

# @markdown Enter your Google Cloud Project ID:
project_id = 'mimic-test-476513'  # @param {type:"string"}

df_day1_vitalsign = pandas_gbq.read_gbq("""
 SELECT *
 FROM `physionet-data.mimiciv_3_1_derived.first_day_vitalsign`
 LEFT JOIN (
 SELECT
 subject_id,
 stay_id,
 gender,
 race,
 admission_age,
 dod IS NOT NULL AS mortality
 FROM
 `physionet-data.mimiciv_3_1_derived.icustay_detail`
 )
 USING(subject_id, stay_id)
 WHERE heart_rate_mean IS NOT NULL
 LIMIT 10000
""", project_id=project_id)

df_day1_lab = pandas_gbq.read_gbq("""
  SELECT *
  FROM `physionet-data.mimiciv_3_1_derived.first_day_lab`
  LIMIT 1000
""", project_id=project_id)


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0.1[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0.1[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.
Downloading: 100%|[32m██████████[0m|
Downloading: 100%|[32m██████████[0m|


## Exercise 3.1
### EDA

Perform an EDA of some of the the remaining observation columns in the `df_day1_vitalsign` dataset

You should perform the analysis as though you have been given this data to include in a model; however, here are some things to explore:
*   What are the associations between different continuous variables and `mortality`?
*   Which continuous variables are most/least correlated with `admission_age`?
*   Based on appropriate measures of spread (e.g., standard deviation, IQR), which continuous variables appear to vary the most and the least?

# Exercise 3.2
## Data Cleaning Missingness

Spend the next 5 minutes exploring/addressing the missing data in the `df_day1_lab` MIMIC dataset

Hints
*   Start by identifying how many values per observation (i.e. per column) are missing
*   Try using each of the methods we have covered
*   Consider the data types, as some are not compatible

# Exercise 3.3
## Data Cleaning Types

Spend the next 5 minutes cleaning types and inconsistencies in the `Temp`, `Gender`, and `BloodGlucose` columns of the toy dataset in `df_messy`

Hints
*   The conversion from temp in 'F' to 'C' is `(Temp°F − 32) × 5/9 = Temp°C`
*   Also look at the `BGMethod` column when considering the `BloodGlucose` values

In [None]:
df_messy = pd.read_csv("https://teaching.bowyer.ai/sdsai/resources/3/data/messy_data.csv")

# Exercise 3.4
## Data Cleaning Complete

Clean as much of the `df_messy` data frame as you can. Copy from the previous exercises/examples where useful.

Some of the columns in the data do not have an obvious way to impute them, you can use your judgement what to do with these.

In [None]:
df_messy = pd.read_csv("https://teaching.bowyer.ai/sdsai/resources/3/data/messy_data.csv")

# Exercise 3.5
## Normalisation and standardisation

Spend the next five minutes applying one of the normalisation, standardisation or log-tranformation to the heart-rate data in `df_day1_vitalsigns`

In [None]:
df_day1_vitalsign.heart_rate_mean

# Exercise 3.6
## Data Encoding

Try applying ordinal encoding to the `gender` and `race` columns in `df_day1_vitalsign`

Can you think of a more efficient way of encoding `gender`?

In [None]:
df_day1_vitalsign

# Exercise 3.7
## Feature Engineering

[MEWS](https://www.mdcalc.com/calc/1875/modified-early-warning-score-mews-clinical-deterioration) is an early warning risk score for patient deterioration.

Engineer a new feature for the `df_day1_vitalsign` dataset called `mews` that computes this value based on systolic BP, HR, RR and temperature.

Then engineer another new feature called `mews_category` that bins the MEWS score into categories of >= 5, and >= 3

In [None]:
df_day1_vitalsign

# Exercise 3.8
## The Complete Preprocessing Pipeline

Consider that you want to build a model to predict which patients undergoing a Cesarian section will require a long admission.

You will define a long admission as one that contains a Cesarian section procedure and that lasts more than 7 days.

You have been given the following data sets to build your model from.

*   `df_admissions`
*   `df_age`
*   `df_diagnoses_icd`
*   `df_procedures_icd`

For the model, you hypothesise that you will need the following features (all of which are available in the data sets above)

1.  Age at admission
1.  A diagnosis of diabetes mellitus
1.  Admission type (i.e. emergency/urgent/routine)
1.  Ethnicity

## Suggested Steps
Follow the steps below to build a data preparation pipeline for this model

1.  Briefly explore the raw data sets you have been given to understand what they contain and represent
1.  Identify your patient/admission cohort for the model (i.e. admissions where patients have had a Cesarian section)
1.  Establish which admissions are/are not longer than 7 days
1.  Extract/engineer each of the features above for the patient/admissions in your cohort
1.  Perform a simple EDA of each feature to understand what it represents and potentially whether it has an association with long admissions
1.  Combine all of your features, and the output labels for long admissions, into a single feature table
1.  Perform any further data preparation necessary to get your data into a (numerical/boolean) format that can be used with a basic machine learning model

## Hints
*   The ICD-9 procedure codes for Cesarian section are 740 to 749
*   The ICD-9 diagnosis codes diabetes mellitus are 25000 to 25099
*   The `hadm_id` key allows you to link data between tables based on an admission. i.e. all the data were recorded for the same admission.

## Data Retrieval Code
The following Python code pulls the datasets from BigQuery/MIMIC-IV for you and stores them in pandas data frames

To avoid making unnecessary requests to the database, you should only run this block once and then do your processing on the local data frames.

In [None]:
df_admissions = pandas_gbq.read_gbq("""
  SELECT *
  FROM `physionet-data.mimiciv_3_1_hosp.admissions`
  WHERE MOD(subject_id, 10) = 0
""", project_id=project_id)

df_age = pandas_gbq.read_gbq("""
  SELECT subject_id, hadm_id, age
  FROM `physionet-data.mimiciv_3_1_derived.age`
  WHERE MOD(subject_id, 10) = 0
""", project_id=project_id)

df_diagnoses_icd = pandas_gbq.read_gbq("""
  SELECT subject_id, hadm_id, icd_code, icd_version
  FROM `physionet-data.mimiciv_3_1_hosp.diagnoses_icd`
  WHERE MOD(subject_id, 10) = 0
""", project_id=project_id)

df_procedures_icd = pandas_gbq.read_gbq("""
  SELECT subject_id, hadm_id, icd_code, icd_version
  FROM `physionet-data.mimiciv_3_1_hosp.procedures_icd`
  WHERE MOD(subject_id, 10) = 0
""", project_id=project_id)

In [None]:
# YOUR CODE HERE...