<a href="https://colab.research.google.com/github/xborrat/NEFRoHack/blob/main/notebooks/intro-to-sql-mimic-iv.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# SQL Exercises

The general learning objectives of these exercises are for you to:

- understand the structure of MIMIC, and where to find more info
- understand derived concepts already created for MIMIC

Our first few cells will be a bit of setup.

In [None]:
# Import libraries
from datetime import timedelta
import os
from pathlib import Path


import numpy as np
import pandas as pd
import re
import matplotlib.pyplot as plt
import matplotlib.dates as mdates

# Make pandas dataframes prettier
from IPython.display import display, HTML, Image
%matplotlib inline

plt.style.use('ggplot')
plt.rcParams.update({'font.size': 20})

# Access data using Google BigQuery.
from google.colab import auth
from google.cloud import bigquery

In [None]:
# Set up environment variables
project_id = 'lcp-internal'
if project_id == 'CHANGE-ME':
  raise ValueError('You must change project_id to your GCP project.')
os.environ["GOOGLE_CLOUD_PROJECT"] = project_id

# Read data from BigQuery into pandas dataframes.
def run_query(query, project_id=project_id):
  return pd.io.gbq.read_gbq(
      query,
      project_id=project_id,
      dialect='standard')

# test it works
df = run_query("SELECT * FROM physionet-data.mimiciv_hosp.patients LIMIT 5")
df.head()

If the above raises an error, you'll need to double check you've set your project correctly, and ensure that you have [requested access to MIMIC-IV on Google BigQuery via the PhysioNet project page](https://mimic-iv.mit.edu/docs/access/cloud/#accessing-mimic-iv-on-the-cloud).

## Exercise questions

We will be examining data for a single individual `subject_id`.


In [None]:
subject_id = 10000032

You will f-string syntax frequently as this is used to insert the above `subject_id` into queries. Here's an example:

In [None]:
query = f"Curly brackets are used to insert variables: {subject_id}."
print(query)

You can [read more about f-strings here](https://docs.python.org/3/tutorial/inputoutput.html).

# Questions

Question 1: Run the below query. How many rows are returned?

In [None]:
df = run_query(f"""
select
    ce.subject_id
  , ce.stay_id
  , ce.charttime
  , ce.valuenum
FROM physionet-data.mimiciv_icu.chartevents ce
where ce.subject_id = {subject_id}
AND ce.itemid = 220045
""")
display(df)

Answer 1:

Question 2: Write a query which counts the number of rows.


In [None]:
# Answer 2
df = run_query(f"""

""")
display(df)

Question 3: Write a query which calculates the average of the `valuenum` column.

In [None]:
# Answer 3
df = run_query(f"""

""")
display(df)

Question 4: Rewrite the above query to use the `value` column. It won't run. What is the error message you receive? What is the reason why this query won't run?

In [None]:
# Answer 4 - rewrite the query and demonstrate the error
df = run_query(f"""

""")
display(df)

Answer 4 (explain the error):

Question 5: Write a query which identifies the *label* for the given `itemid`. What is the *label*?

In [None]:
# Answer 5
df = run_query(f"""

""")
display(df)

## Vital signs

Question 6: Run the below query. What is `ce`? Where is it defined? Why is it used?

In [None]:
df = run_query(f"""
select
  ce.subject_id
  , ce.charttime
  , ROUND(
      AVG(
        CASE WHEN itemid IN (223761) AND valuenum > 70 AND valuenum < 120 THEN (valuenum-32)/1.8
             WHEN itemid IN (223762) AND valuenum > 10 AND valuenum < 50  THEN valuenum
        ELSE null END
      )
    , 2) as merged_value
  , MAX(CASE WHEN itemid = 224642 THEN value ELSE NULL END) AS merged_site
FROM physionet-data.mimiciv_icu.chartevents ce
where ce.subject_id = {subject_id}
AND ce.itemid IN (223761, 223762, 224642)
GROUP BY ce.subject_id, ce.charttime
""")
display(df)

Answer 6:


Question 7: Explain what the CASE statement in the above query is doing.

Answer 7:


Question 8: Copy the above query into the next cell and remove the `charttime` column from the SELECT and GROUP BY statements. Re-run the query. How many rows are returned?


In [None]:
df = run_query(f"""

""")
display(df)


Answer 8:

Question 9: Did we end up with a different number of rows in Question 8 versus when we originally ran the query (Question 6)? Why or why not?

Answer 9:

Question 10 (bonus): What would change if we used `MIN()` instead of `MAX()` in the above query? Explain your reasoning.

Answer 10:

## Extracting Height

The following query extracts the height of our subject as documented in *chartevents*.

In [None]:
df = run_query(f"""
-- prep height
WITH ht_in AS
(
  SELECT
    c.subject_id, c.stay_id, c.charttime
    -- Ensure that all heights are in centimeters
    , ROUND(c.valuenum * 2.54, 2) AS height
    , c.valuenum as height_orig
  FROM physionet-data.mimiciv_icu.chartevents c
  WHERE c.valuenum IS NOT NULL
  -- Height (measured in inches)
  AND c.itemid = 226707
  AND c.subject_id = {subject_id}
)
, ht_cm AS
(
  SELECT
    c.subject_id, c.stay_id, c.charttime
    -- Ensure that all heights are in centimeters
    , ROUND(c.valuenum, 2) AS height
  FROM physionet-data.mimiciv_icu.chartevents c
  WHERE c.valuenum IS NOT NULL
  -- Height cm
  AND c.itemid = 226730
  AND c.subject_id = {subject_id}
)
-- merge cm/height, only take 1 value per charted row
, ht_stg0 AS
(
  SELECT
  COALESCE(h1.subject_id, h2.subject_id) as subject_id
  , COALESCE(h1.stay_id, h2.stay_id) AS stay_id
  , COALESCE(h1.charttime, h2.charttime) AS charttime
  , COALESCE(h1.height, h2.height) as height
  FROM ht_cm h1
  FULL OUTER JOIN ht_in h2
    ON h1.subject_id = h2.subject_id
    AND h1.charttime = h2.charttime
)
SELECT subject_id, stay_id, charttime, height
FROM ht_stg0
WHERE height IS NOT NULL
-- filter out bad heights
AND height > 120 AND height < 230;
""")
display(df)

Question 11: What is the purpose of the `COALESCE` statements?

Answer 11:

## Writing SQL

Question 12: Write a query which lists all hospital admissions for the patient; specifically their `hadm_id`, `admittime`, and `dischtime`.


In [None]:
df = run_query(f"""

""")
display(df)


Question 13: Write a query which extracts the lowest and highest heart rate for the given `subject_id`.


In [None]:
df = run_query(f"""

""")
display(df)


Question 14:  Write a query to extract the *first* heart rate for the given `subject_id`.


In [None]:
df = run_query(f"""

""")
display(df)


Question 15: Write a query which returns all the INR values for the given `subject_id`. INR is a lab value routinely measured for critically ill patients.


In [None]:
df = run_query(f"""

""")
display(df)


Question 16: Write a query to extract all the medications prescribed to the given `subject_id`, and a count of how many times they were prescribed.

In [None]:
df = run_query(f"""

""")
display(df)