<img src="https://teaching.bowyer.ai/sdsai/resources/0/img/IMPERIAL_logo_RGB_Blue_2024.svg" alt="Imperial Logo" width="500"/><br /><br />

Accessing Clinical Data and Applied Challenge 1
==============
### SURG70098 - Surgical Data Science and AI
### Stuart Bowyer

## Intended Learning Outcomes
1.  Understand the benefits and challenges of the MIMIC-IV dataset
1.  Experience and practice of using programming, manipulation and visualisation of clinical data
1.  Understand and run basic SQL against clinical datasets
1.  Be able to connect your Python environment to clinical data on BigQuery


## Session Outline
1.  [MIMIC - Medical Information Mart for Intensive Care](#mimic)
1.  [BigQuery - Cloud Database](#bigquery)
1.  [Database Queries (SQL)](#sql)
1.  [Revision and Consolidation Challenge](#challenge)
1.  [Wrap Up](wrap_up)

# MIMIC
## Medical Information Mart for Intensive 
https://www.nature.com/articles/s41597-022-01899-x

* Large, freely available database of de-identified health data from patients admitted to the Beth Israel Deaconess Medical Center ICU in Boston, USA
* Focused on ICU, but version IV includes hospital wide data
* Used for learning, clinical informatics, clinical decision making, and ML
* Data from:
    * **364,627 unique patients**
    * **546,028 admissions**
    * **94,458 ICU stays**



## Data Contents

* **Patient demographics:** Age, gender, admission/discharge info.
* **Clinical observations:** Vital signs, lab results.
* **Diagnoses:** ICD codes.
* **Interventions:** Procedures, Medications.
* **Outcome data:** Length of stay, mortality.
* **Free-text notes:** Nursing notes, discharge summaries (not accessed separately)

## Challenges of Using MIMIC

* Absolute calendar dates (e.g., “2100-05-12”) are meaningless
    * Consistent within patient
    * Unusable between patients
* Ages > 89 are coded as 300 years old
* Text data are redacted
* All the challenges of using real data - missingness, formatting issues, alignment
* 100+ GB of data

# BigQuery - Cloud Database

* Google Cloud’s data warehouse – run SQL on huge datasets
* No servers to manage – fully managed and scalable
* Fast – queries on millions of records in seconds
* Pay-as-you-go – pay only for storage and queries
* Works with Python & notebooks – easy for analysis
* Great for healthcare data – secure, reproducible, and handles large EHR datasets

## Setup
**Instructions are on Blackboard**

* Request access to MIMIC-IV 3.1 through BigQuery by clicking on the link in the files section of the physionet page - https://physionet.org/content/mimiciv/3.1/#files
* If you do not see a 'Request access using Google BigQuery' option, check you have added and enabled your Gmail address in physionet
* Check that you get a response like the following:
* Access to the GCP BigQuery has been granted to your.email@gmail.com for project: MIMIC-IV v3.1 
* BigQuery is a complex and powerful platform; however, we will only use it in a simple way for this module
* Sign into BigQuery - https://console.cloud.google.com/bigquery

### In BigQuery
* Create a project
* Press 'Select a project'
* Press 'NEW PROJECT'
* Add a project name, e.g. "MIMIC Project"
* Press 'CREATE'
* Take note of the full project name, as you will need this for your python code later on
* Ensure billing is turned off
* Add the physionet datasets so you can access them by:
    * clicking the '+ Add data' button (ensure you're in the 'Explorer' tab)
    * then clicking the 'Star a project by name' option
    * then typing 'physionet-data'
You should now see the physionet datasets and tables

# Database Queries (SQL)

* Database structures and query languages are a topic of their own
* For this course, all data manipulation will be in Python; however, we use a small bit of SQL to extract initial data
* Structured Query Language (SQL) is widely used to work with databases
* Lets you select, filter, and join data from tables
* Great for exploring and cleaning data before analysis in Python

## Example SQL

* Let's look at the first 100 patients' basic data (`patients` table)
* Try running this in your BigQuery project
* Now try it for the hospital admissions (`admissions` table)

In [None]:
SELECT * FROM `physionet-data.mimiciv_3_1_hosp.diagnoses_icd` LIMIT 100

## From SQL to Python

* In this course, we will use the python library `pandas_gbq` [[github link]](https://github.com/googleapis/python-bigquery-pandas)
* Try running the code below to check you have everything installed and linked

In [None]:
%pip install pandas_gbq --quiet
import pandas_gbq

# @markdown Enter your Google Cloud Project ID:
project_id = 'mimic-project-12345'  # @param {type:"string"}

# Test query to load data from BigQuery
df_patients = pandas_gbq.read_gbq("""
SELECT * FROM `physionet-data.mimiciv_3_1_hosp.patients` LIMIT 1000
""", project_id=project_id)

# Now print the first few rows of the dataframe

## ❓ Applied Challenge Exercise 1

This exercise will walk you through some deeper exploration of the MIMIC-IV clinical dataset using the skills you have learnt during the first section of the course.

[Colab notebook](https://colab.research.google.com/github/stuartbowyer/sdsai-lecture-notes/blob/main/Examples_Exercises/Challenge01.ipynb)

# Wrap Up
* First coursework is set today - check the due date on Blackboard
* Now that we can load, manipulate, and visualise clinical data, next week we will start looking at machine learning methods
* Ensure you are fully caught up on all exercises and additional reading
* Check the reading list!