[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/shengpu1126/BDSI2019-ML/blob/master/1%20-%20Explore%20the%20Dataset.ipynb)

### Note: Extra setup steps for Colab users

- Downloads the data and helper files. 
- You should choose "Reset all runtimes before running" for first-time use to clear your temporary workspace. 
- If you already have the data/helpers in your workspace, you can deselect "Reset all runtimes before running" and skip these steps. 
- To save the notebook, use menu "File - Save a copy in Drive..."

In [None]:
!pip install -U wget
!rm -rf data.zip data lib
!mkdir lib

In [None]:
import wget
wget.download('https://github.com/shengpu1126/BDSI2019-ML/raw/master/lib/config.yaml', 'lib/config.yaml')
wget.download('https://github.com/shengpu1126/BDSI2019-ML/raw/master/lib/helper.py', 'lib/helper.py')
wget.download('https://github.com/shengpu1126/BDSI2019-ML/raw/master/data.zip', 'data.zip')

import zipfile
with zipfile.ZipFile("data.zip","r") as zip_ref:
    zip_ref.extractall(".")

# 1 - Explore the Dataset

This dataset contains data for a total of **12,000 patient admissions**. For the tutorials, we will use data from the **first 2,500 patients** to explore several modeling and hyperparameter selection techniques. 

Each admission is associated with a RecordID (filename), and a separate csv file under data/files/. These csv files contain timestamped observations for several variables. Outcomes for each patient admission are listed in data/labels.csv, in which the first column is RecordID. The second column In-hospital mortality contains binary labels: 1 means the patient died in the hospital, and −1 means the patient survived and was discharged. The third column 30-day survival also contains binary labels: 1 means the patient died within 30 days, and −1 means the patient survived past 30 days.

For each patient admission, you are given data representing the first 48 hours of the ICU admission, as a csv file with three columns: `Time`, `Variable` and `Value`. Each row in the csv file represents a single observation. Each observation has an associated timestamp indicating the time of the observation relative to the start of the ICU admission, in hours and minutes. For example, a timestamp of 35:19 means that the associated observation was made 35 hours and 19 minutes after the patient was admitted to the ICU.

There are two main types of variables: time-invariant and time-varying. Their definitions are specified in `config.yaml`. In particular, time-invariant variables are collected at the time the patient is admitted to the ICU. Their associated timestamps are set to 00:00 (thus they appear at the beginning of each patient’s record). Unknown values are explicitly encoded as −1. On the other hand, time-series variables (e.g., heart rate) for a patient might be measured one time, many times or not at all. You will notice that, in addition to being time-invariant or time-varying, a variable could be one of the heterogeneous data types: numeric or categorical. See the documentation for more details regarding the meaning of each variable.

## Tasks:
- Load the dataset
- Calculate mean heart rate of a patient
- Plot heart rate trajectory for a patient
- Calculate correlation between mean systolic and diastolic blood pressures across patients
- Population charateristics

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
from lib.helper import load_data

In [None]:
# Load the dataset
# `raw_data` is a dictionary mapping patient ID to the data associated with that patient
raw_data, df_labels = load_data()

In [None]:
# Extract patient IDs
IDs = sorted(raw_data.keys())

## Heart rate of the first patient

In [None]:
# Get first patient's ID and his/her data
ID = IDs[0]
df = raw_data[ID]

In [None]:
print(ID)

In [None]:
df

In [None]:
# Extracting heart rate
df_HR = df[df['Variable'] == 'HR'].copy()

In [None]:
# Convert time to fractional hours
df_HR['Time'] = df_HR['Time'].apply(lambda s: int(s.split(':')[0]) + int(s.split(':')[1])/60)

In [None]:
## TODO:
# Calculate mean HR


In [None]:
## TODO:
# Plot HR trajectory


## Blood pressure measurements

In [None]:
## TODO:
# Extract blood pressures for all patients
# Calculate mean blood pressures for each patient


In [None]:
## Calculate correlation


## Population characteristics

In [None]:
# Gender: what's the percentage of females?


In [None]:
# Age: mean and interquartile range (IQR)?
np.percentile(.., [25, 50, 75])


In [None]:
# ICUType: what are the fraction of patients in each ICU type?
from collections import Counter
