<img src="https://teaching.bowyer.ai/sdsai/resources/0/img/IMPERIAL_logo_RGB_Blue_2024.svg" alt="Imperial Logo" width="500"/><br /><br />

Programming, Manipulating and Visualising Data - Challenge Exercise 1
==============
### SURG70098 - Surgical Data Science and AI
### Stuart Bowyer

# Setup

In [None]:
%pip install pandas_gbq --quiet
import pandas_gbq
import pandas as pd
import matplotlib.pyplot as plt

# @markdown Enter your Google Cloud Project ID:
project_id = 'mimic-test-12345'  # @param {type:"string"}

# Part 1 - Exploring the Patient Table

## Part 1a - Load the `patients` table

* Load the table (using the function from the lecture notes)
* Print it to check you have the right data

## Part 1b - Check the data types for each column

* Use the MIMIC documentation to understand what each column is
* What do you notice about the dates - unlike when loading a CSV?

## Part 1c - Compute the gender rates

* Start by exploring which values are in the column (`.unique()` might help)
* Compute the number of patients for each
* Compute the percentages of total

## Part 1d - Visualise the gender rates

* Use an appropriate visualisation method of your choice to display the gender rates

# Part 2 - Exploring Diagnosis Data

## Part 2a - Load the diagnosis table

* Load the table (using the function from the lecture notes)
* Print it to check you have the right data
* **NOTE - this is a big table** you should just work with a small part of it for now (use `LIMIT 10000` in your SQL)

## Part 2b - Check the structure and types

* Use the MIMIC documentation to understand what each column is
* How can we link a diagnosis to a patient?
* How can we link a diagnosis to a date/time?
* How are diagnoses encoded?

## Part 2c - Count number of atrial fibrillation diagnoses in the dataset

* The ICD-9 code for Atrial fibrillation is 427.31 **coded in MIMIC as 42731**
* You can use your `LIMIT`ed dataset for this, but remember you're not actually looking at all data

## Part 2d - Count number of patients with atrial fibrillation diagnoses

* You might want to use `.nunique()` to find the number of unique values in a given column
* What does comparing this result to that in the previous section tell you?

The fact there are fewer patients with atrial fibrillation than total diagnoses of atrial fibrillation shows that some patients have multiple repeated diagnoses. This is important to be aware of. 

## Part 2e - Compute and visualise the gender breakdown of patients with atrial fibrilation

* This requires combining the diagnosis and patients tables
* There are many ways to achieve this, but here is a suggest set of steps:
    * Get a `SUBJECT_ID` column that defines all the patients with atrial fibrilation
    * Use the `.merge` to combine the `patients` table with this list of patients
    * Use the analysis methods from part 1 on the resulting table

# Part 3 - Defining a Sepsis Cohort

In this exercise, we will compute several characteristics of patients in the MIMIC-IV dataset with sepsis.

## Part 3a - Identifying a patient cohort list

* Explore ways to identify the list of `SUBJECT_ID`s for patients with sepsis
* There are multiple ways you can do this, ask and we can discuss options

### Hints

This step is very similar to what you did in Part 2. You will query the `diagnoses_icd` table and then filter it in Pandas.

1.  Load a *sample* of the `diagnoses_icd` table. Remember, it's a big table, so use `LIMIT`. `LIMIT 100000` is a good starting point.
    * *Query:* `SELECT * FROM physionet-data.mimiciv_3_1_hosp.diagnoses_icd LIMIT 100000`
    * Load this into a DataFrame, e.g., `df_diagnoses`.
    * You might already have this loaded so can skip this
2.  The ICD-9 codes for 'Sepsis' is 99591 and 'Septic shock' is 78552.
3.  You need to filter your `df_diagnoses` to find rows where the `icd_code` is one of these two values.
    * *Hint:* The `.isin()` method is perfect for this:
        `df_diagnoses[df_diagnoses['icd_code'].isin(['99591', '78552'])]`
4.  Store this filtered data in a new DataFrame, e.g., `df_sepsis`.
5.  You might want to get a unique list of admissions by using `.drop_duplicates(subset=['hadm_id'])`. This DataFrame is your sepsis cohort.

## Part 3b - Identifying an associated set of admissions

* Knowing which patients have sepsis is only half the challenge, you also need to identify **when** they had sepsis
* Start by looking at the `admissions` table and identify which are associated with a sepsis event

### Hints

Now your goal is to get more details about the *admissions* for the cohort you just identified in `df_sepsis`. This will help you find out *when* they were admitted.

1.  First, load a sample of the `admissions` table. This table is not too big, but using a `LIMIT 100000` is still a good, safe practice.
    * *Query:* `SELECT * FROM physionet-data.mimiciv_3_1_hosp.admissions LIMIT 100000`
    * Load this into a new DataFrame, e.g., `df_admissions`.
2.  Now, `merge` your `df_sepsis` (from 3a) with `df_admissions`.
    * You will want to merge these on the `hadm_id` column, as both tables share it.
    * `df_sepsis_admissions = pd.merge(df_sepsis, df_admissions, on='hadm_id')`
3.  This new `df_sepsis_admissions` DataFrame now contains all the rows from `df_admissions` that correspond to a sepsis diagnosis *in your sample*. It will have the `admittime` and `dischtime` columns you need for the next step.
    * **Note:** Your result might be small if your `LIMIT`ed samples didn't overlap much. This is expected and is fine for the exercise.

## Part 3c - Basic sepsis cohort characteristics

* Explore and visualise the gender split and length of stay distribution for your sepsis cohort

### Hints

You will use the DataFrames you've already created.

**Gender Analysis**

1.  You need two DataFrames:
    * `df_sepsis_admissions` (from Part 3b)
    * `df_patients` (from Part 1)
2.  `merge` these two DataFrames on `subject_id`.
3.  The resulting DataFrame will have both `gender` and admission data. You can now analyze the `gender` column (e.g., `.value_counts()`) and visualize it.

**Length of Stay (LOS) Analysis**

1.  This is even easier! The DataFrame `df_sepsis_admissions` (which you created in Part 3b) *already* has the `admittime` and `dischtime` columns.
2.  Check the data types with `.info()`. The `admittime` and `dischtime` columns should be datetime objects.
3.  You can create a new `los` column by subtracting the `admittime` from the `dischtime`:
    ```python
    df_sepsis_admissions['los'] = df_sepsis_admissions['dischtime'] - df_sepsis_admissions['admittime']
    ```
4.  This will give you a `timedelta` object. To get the LOS in days (as a number), you can use the `.dt.total_seconds()` accessor and divide:
    ```python
    df_sepsis_admissions['los_days'] = df_sepsis_admissions['los'].dt.total_seconds() / (60*60*24)
    ```
5.  Now you can plot a histogram of this `los_days` column.

## Part 3d - Plot a single patient's heart rate

* Pick any single patient from your cohort and plot their heart rate observations during the admission
* You will need to use the `chartevents` and `d_items` tables
* `d_items` is a 'dictionary' that lets you lookup event codes for specific types of observation

### Hints

The `chartevents` table is **massive** (it has billions of rows).

**DO NOT** try to load this table without a `LIMIT`. You **must** query a small sample. This is the only way to do this analysis in Pandas.

**Step 1: Find the `itemid` for Heart Rate**
1.  Load the `d_items` (dictionary) table.
    * *Query:* `SELECT * FROM physionet-data.mimiciv_3_1_icu.d_items`
    * Load into `df_d_items`.
2.  Filter this DataFrame in Pandas to find the `itemid` for 'Heart Rate'.
    * `hr_itemid = df_d_items[df_d_items['label'] == 'Heart Rate']`

**Step 2: Load a Sample of `chartevents` and filter for Heart Rate**
1.  Load a **sample** of the `chartevents` table. A `LIMIT 1000000` (one million) is a good start. This will still be a large query.
    * *Query:* `SELECT * FROM physionet-data.mimiciv_icu.chartevents LIMIT 1000000`
    * Load into `df_chartevents`.
2.  Filter this `df_chartevents` to *only* keep rows for Heart Rate.
    * `df_hr = df_chartevents[df_chartevents['itemid'] == hr_itemid].copy()`
    * This `df_hr` DataFrame now contains all heart rate measurements *from your sample*.

**Step 3: Plot a single patient's heart rate (Part 3d)**
1.  Look at your `df_sepsis_admissions` (from 3b) and pick any single `hadm_id`.
2.  Filter your `df_hr` (from Step 2) to get the rows for *only* that one `hadm_id`.
    * `df_one_patient_hr = df_hr[df_hr['hadm_id'] == <your_chosen_hadm_id>]`
3.  This `df_one_patient_hr` will be small. You can now plot `valuenum` (the HR value) vs. `charttime` (the time) using a line plot.
    * *Hint:* You might want to `sort_values('charttime')` first.

## Part 3e - Analyse and visualise the distribution of heart rate values for patients in your cohort

* Extract heart rate values for every sepsis admission in your cohort
* Visualise the distribution of these values in an appropriate way

### Hints

1.  You need to find all the heart rates that belong to your sepsis cohort.
2.  `merge` your `df_sepsis_admissions` (from 3b) with your `df_hr` (from Step 2).
    * You should merge on `hadm_id`.
    * `df_sepsis_hr = pd.merge(df_sepsis_admissions, df_hr, on='hadm_id')`
3.  This `df_sepsis_hr` DataFrame now *only* contains heart rate measurements for the sepsis patients found in your samples.
4.  You can now plot a histogram of the `valuenum` column from `df_sepsis_hr`.

## Supplementary SQL

You might find that with the `LIMIT`s on SQL queries you end up with little or no data for 3d and 3e. To help with that, you can use the code below to selectively load all the HR chartevents for your sepsis patients. To use this, you need to have a dataframe called `df_sepsis_admissions` that has a column called `subject_id` with your sepsis patient IDs. This should load all data in about a minute, depending on how many patients you've identified.

This works by selecting only chartevents `WHERE` :

* itemid is 220045 (which is HR)
* AND
* subject_id is in your list of sepsis patients

In [None]:
# Get the list of subject IDs from sepsis admissions
sepsis_subjects = df_sepsis_admissions['subject_id'].unique().tolist()
sepsis_subjects_str = ','.join(map(str, sepsis_subjects))

# Now load only the heart rate data for sepsis patients
df_chartevents = pandas_gbq.read_gbq(f"""
SELECT
    *
FROM
    `physionet-data.mimiciv_3_1_icu.chartevents`
WHERE
    itemid = 220045
    AND subject_id IN ({sepsis_subjects_str})
""", project_id=project_id)

## Bonus 1 - Repeat above using an alternative sepsis definition

* There is a `microbiology` table