<a href="https://colab.research.google.com/github/supanat-tht/HDAT9910/blob/main/HDAT9910_main.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# HDAT 9910 Capstone Project

**Authors**: Supanat Thitipatarakorn

**zID**: z5383184

**Creation date**: 22 February 2024

**Purpose of this notebook**: To demonstrate data mingling of MIMIC-III data and answer two research questions: 1) mortality prediction in ICU and 2) weekend effect in ICU.

---
## Introduction

Here are some suggested grammar corrections for the given text:

Patients admitted to hospitals are sent to the appropriate wards depending on their conditions. Patients with critical conditions in need of close monitoring and medical care will be admitted or transferred to intensive care units (ICUs). Because of the nature of their conditions, patients in ICUs often have a higher mortality rate than patients in regular wards.

MIMIC-III is a large database containing deidentified health-related data associated with over forty thousand patients who stayed in the intensive care units of the Beth Israel Deaconess Medical Center in Boston, Massachusetts, between 2001 and 2012. The data was gathered from the Philips CareVue Clinical Information System and iMDsoft MetaVision ICU. MIMIC-III is available to researchers worldwide for data study including performing various data science tasks.

This notebook aims to utilize data science procedures on the MIMIC-III data to 1) build a predictive algorithm based on data from the first 24 hours in the ICU and 2) to investigate whether admission to the ICU on weekends increases the risk of ICU mortality.

## Data Manipulation

### Setup

In [1]:
# Autosave every 60 minutes
%autosave 60

Autosaving every 60 seconds


In [2]:
# Check required libraries are installed if not calling system to install
import sys
import subprocess
import pkg_resources

required = {'numpy', 'pandas', 'plotnine', 'matplotlib', 'seaborn',
            'grid', 'shap', 'scikit-learn'}
installed = {pkg.key for pkg in pkg_resources.working_set}
missing = required - installed

if missing:
    print('Installing: ', missing)
    python = sys.executable
    subprocess.check_call([python, '-m', 'pip', 'install', *missing], stdout=subprocess.DEVNULL)
# Delete unwanted variables
del required
del installed
del missing

# Load the rpy2 package to use R alongside Python
%load_ext rpy2.ipython

Installing:  {'shap', 'grid'}


In [3]:
# Mount Google Drive

if 'google.colab' in str(get_ipython()):
    from google.colab import drive # import drive from Gogle colab
    root = '/content/drive'     # default location for the drive
    # print(root)                 # print content of ROOT (Optional)
    drive.mount(root)
else:
    print('Not running on CoLab')

Mounted at /content/drive


In [4]:
# Assign the project path

from pathlib import Path

if 'google.colab' in str(get_ipython()):
    project_path = Path(root) / 'MyDrive' / 'HDAT9910'
else:
    project_path = Path()

---
### Visualize the raw data

In [None]:
%%R

# Load libraries
library(dplyr)
library(ggplot2)
library(tidyr)
library(purrr)
install.packages("Hmisc")


# Set working directory
setwd('/content/drive/MyDrive/HDAT9910/mimic_data/')

In [6]:
%%R

# Load all files and assign file names to variable names
file_list <- c("patients", "pt_icu_outcome")
for (i in 1:length(file_list)){
    file_name <- paste0(file_list[i], ".csv")
    read_file <- read.csv(file_name, na.strings=c(""))
    assign(file_list[i], read_file)
}

In [7]:
%%R

# Overview of the file
Hmisc::describe(patients)

patients 

 8  Variables      46520  Observations
--------------------------------------------------------------------------------
row_id 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
   46520        0    46520        1    23260    15507     2327     4653 
     .25      .50      .75      .90      .95 
   11631    23260    34890    41868    44194 

lowest :     1     2     3     4     5, highest: 46516 46517 46518 46519 46520
--------------------------------------------------------------------------------
subject_id 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
   46520        0    46520        1    34426    31040     2458     4910 
     .25      .50      .75      .90      .95 
   12287    24650    55478    82100    91062 

lowest :     2     3     4     5     6, highest: 99985 99991 99992 99995 99999
--------------------------------------------------------------------------------
gender 
       n  missing distinct 
   46520       

There are 46520 distinct subject IDs same as indicated in the [documentation](https://mimic.mit.edu/docs/iii/tables/patients/). The date of birth in year 1800 seems to be a result of date shifting for patiets older than 89 years old. The date of death was missing from 30,761 patients, meaning that they were alive until the end of data collection. I will take the `dod` column as the main source of date of death, as it is a combination of `dod_hosp` and `dod_ssn`.

As indicated [here](https://physionet.org/content/mimiciii/1.4/), dates in the dataset were all shifted, resulting in hospital stays occurring between years 2100 and 2200. However, the maximum date of birth and date of death seem to be over 2200, which need to be explored.

In [8]:
%%R

as.Date(patients$dob[1])

[1] "2075-03-13"


In [10]:
%%R

# Create a new variable indicating whether Variable A is blank or not
patients$dod_missing <- ifelse(is.na(patients$dod), "missing", "not missing")

# Tabulate Variable A blank/not blank vs. Variable B
table(patients$dod_missing, patients$expire_flag)

             
                  0     1
  missing     30761     0
  not missing     0 15759
