<a href="https://colab.research.google.com/github/supanat-tht/HDAT9910/blob/main/HDAT9910_main.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# HDAT 9910 Capstone Project

**Authors**: Supanat Thitipatarakorn

**zID**: z5383184

**Creation date**: 22 February 2024

**Purpose of this notebook**: To demonstrate data mingling of MIMIC-III data and answer two research questions: 1) mortality prediction in ICU and 2) weekend effect in ICU.

---
## Introduction

Here are some suggested grammar corrections for the given text:

Patients admitted to hospitals are sent to the appropriate wards depending on their conditions. Patients with critical conditions in need of close monitoring and medical care will be admitted or transferred to intensive care units (ICUs). Because of the nature of their conditions, patients in ICUs often have a higher mortality rate than patients in regular wards.

MIMIC-III is a large database containing deidentified health-related data associated with over forty thousand patients who stayed in the intensive care units of the Beth Israel Deaconess Medical Center in Boston, Massachusetts, between 2001 and 2012. The data was gathered from the Philips CareVue Clinical Information System and iMDsoft MetaVision ICU. MIMIC-III is available to researchers worldwide for data study including performing various data science tasks.

This notebook aims to utilize data science procedures on the MIMIC-III data to 1) build a predictive algorithm based on data from the first 24 hours in the ICU and 2) to investigate whether admission to the ICU on weekends increases the risk of ICU mortality.

## Data Manipulation

### Setup

In [2]:
# Autosave every 60 minutes
%autosave 60

Autosaving every 60 seconds


In [3]:
# Check required libraries are installed if not calling system to install
import sys
import subprocess
import pkg_resources

required = {'numpy', 'pandas', 'plotnine', 'matplotlib', 'seaborn',
            'grid', 'shap', 'scikit-learn'}
installed = {pkg.key for pkg in pkg_resources.working_set}
missing = required - installed

if missing:
    print('Installing: ', missing)
    python = sys.executable
    subprocess.check_call([python, '-m', 'pip', 'install', *missing], stdout=subprocess.DEVNULL)
# Delete unwanted variables
del required
del installed
del missing

# Load the rpy2 package to use R alongside Python
%load_ext rpy2.ipython

Installing:  {'shap', 'grid'}


In [4]:
# Mount Google Drive

if 'google.colab' in str(get_ipython()):
    from google.colab import drive # import drive from Gogle colab
    root = '/content/drive'     # default location for the drive
    # print(root)                 # print content of ROOT (Optional)
    drive.mount(root)
else:
    print('Not running on CoLab')

Mounted at /content/drive


In [5]:
# Assign the project path

from pathlib import Path

if 'google.colab' in str(get_ipython()):
    project_path = Path(root) / 'MyDrive' / 'HDAT9910'
else:
    project_path = Path()

---
### Data cleaning and exploration

In [6]:
%%R

# Load libraries
library(dplyr)
library(ggplot2)
library(tidyr)
library(purrr)
install.packages("Hmisc")


# Set working directory
setwd('/content/drive/MyDrive/HDAT9910/')

Attaching package: ‘dplyr’



    filter, lag



    intersect, setdiff, setequal, union


https://r-graphics.org

(as ‘lib’ is unspecified)







































	‘/tmp/RtmpksM1AC/downloaded_packages’



In [7]:
%%R

# Load all files and assign file names to variable names
file_list <- c("pt_icu_outcome")
for (i in 1:length(file_list)){
    file_name <- paste0("mimic_data/",file_list[i], ".csv")
    read_file <- read.csv(file_name, na.strings=c(""))
    assign(file_list[i], read_file)
}

#### `pt_icu_outcome` data

In [8]:
%%R

# Overview of the file
Hmisc::describe(pt_icu_outcome)

pt_icu_outcome 

 17  Variables      61533  Observations
--------------------------------------------------------------------------------
row_id 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
   61533        0    61532        1    30766    20511     3078     6154 
     .25      .50      .75      .90      .95 
   15383    30766    46149    55379    58455 

lowest :     1     2     3     4     5, highest: 61528 61529 61530 61531 61532
--------------------------------------------------------------------------------
subject_id 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
   61533        0    46476        1    33888    30698     2430     4864 
     .25      .50      .75      .90      .95 
   12046    24280    54191    81491    90901 

lowest :     2     3     4     5     6, highest: 99985 99991 99992 99995 99999
--------------------------------------------------------------------------------
dob 
       n  missing distinct 
   61533   

There are 61,533 ICU stays of 46,476 distinct patients because some patients had more than one ICU stays. The ICU stay ID should be unique each row. However, there is a duplicate ID (`icustay_id` count = 61,533; distinct count = 61,532). The age of patients older than 89 years old has been shifted to 91.4 as indicated [here](https://mimic.mit.edu/docs/iii/tables/patients/). The `icu_expire_flag` shows that 289 patients died during ICU stays.

In [9]:
%%R

# Show the duplicate `icustay_id`
duplicated(pt_icu_outcome$icustay_id)

[1;30;43mเอาต์พุตของการสตรีมมีการตัดเหลือเพียง 5000 บรรทัดสุดท้าย[0m
 [1549] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [1561] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [1573] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [1585] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [1597] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [1609] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [1621] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [1633] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [1645] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [1657] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [1669] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [1681] FALSE FALSE FALSE FALSE FALSE FALSE FALSE

In [8]:
%%R

as.Date(patients$dob[1])

[1] "2075-03-13"


In [10]:
%%R

# Create a new variable indicating whether Variable A is blank or not
patients$dod_missing <- ifelse(is.na(patients$dod), "missing", "not missing")

# Tabulate Variable A blank/not blank vs. Variable B
table(patients$dod_missing, patients$expire_flag)

             
                  0     1
  missing     30761     0
  not missing     0 15759
