<a href="https://colab.research.google.com/github/xborrat/NEFRoHack/blob/main/notebooks/summary_stats.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# MIMIC-IV

# Summary statistics

This notebook shows how summary statistics can be computed for a patient cohort using the `tableone` package. Usage instructions for tableone are at: https://pypi.org/project/tableone/


## Load libraries and connect to the database

In [1]:
# Import libraries
import numpy as np
import os
import pandas as pd
import matplotlib.pyplot as plt

# Make pandas dataframes prettier
from IPython.display import display, HTML

# Access data using Google BigQuery.
from google.colab import auth
from google.cloud import bigquery

In [2]:
# authenticate
auth.authenticate_user()

In [3]:
# Set up environment variables
project_id='lcp-internal'
os.environ["GOOGLE_CLOUD_PROJECT"]=project_id

## Install and load the `tableone` package

The tableone package can be used to compute summary statistics for a patient cohort. Unlike the previous packages, it is not installed by default in Colab, so will need to install it first.

In [4]:
!pip install tableone

Collecting tableone
  Downloading tableone-0.8.0-py3-none-any.whl (33 kB)
Installing collected packages: tableone
Successfully installed tableone-0.8.0


In [5]:
# Import the tableone class
from tableone import TableOne

## Load the patient cohort

In this example, we will load all data from the patient data, and link it to APACHE data to provide richer summary information.

In [8]:
# Link the patient and apachepatientresult tables on patientunitstayid
# using an inner join.
%%bigquery cohort

WITH tmp1 AS (
  SELECT a.hadm_id, a.admittime, a.dischtime, a.deathtime,
    a.insurance, a.language, a.marital_status, a.race,
    a.hospital_expire_flag, k.aki_stage_creat, k.charttime AS kdigo_time,
    DENSE_RANK() OVER (PARTITION BY a.hadm_id ORDER BY k.charttime ASC) AS seq
  FROM `physionet-data.mimiciv_hosp.admissions` a
  LEFT JOIN `physionet-data.mimiciv_derived.kdigo_stages` k
  ON a.hadm_id = k.hadm_id
  WHERE k.aki_stage_creat = 1
  ORDER BY a.subject_id, a.hadm_id, a.admittime, k.charttime
  LIMIT 1000)
SELECT *
FROM tmp1;

Query is running:   0%|          |

Downloading:   0%|          |

In [9]:
cohort.head()

  and should_run_async(code)


Unnamed: 0,hadm_id,admittime,dischtime,deathtime,insurance,language,marital_status,race,hospital_expire_flag,aki_stage_creat,kdigo_time,seq
0,26184834,2131-01-07 20:39:00,2131-01-20 05:15:00,2131-01-20 05:15:00,Medicare,ENGLISH,MARRIED,BLACK/AFRICAN AMERICAN,1,1,2131-01-13 08:23:00,1
1,23822395,2129-08-04 12:44:00,2129-08-18 16:53:00,NaT,Other,ENGLISH,MARRIED,WHITE,0,1,2129-08-05 05:01:00,1
2,23822395,2129-08-04 12:44:00,2129-08-18 16:53:00,NaT,Other,ENGLISH,MARRIED,WHITE,0,1,2129-08-06 06:05:00,2
3,28662225,2156-04-12 14:16:00,2156-04-29 16:26:00,NaT,Medicare,ENGLISH,WIDOWED,WHITE,0,1,2156-04-14 04:38:00,1
4,28662225,2156-04-12 14:16:00,2156-04-29 16:26:00,NaT,Medicare,ENGLISH,WIDOWED,WHITE,0,1,2156-04-14 04:38:00,1


## Calculate summary statistics

Before summarizing the data, we will need to convert the ages to numerical values.

In [None]:
cohort['agenum'] = pd.to_numeric(cohort['age'], errors='coerce')

In [None]:
columns = ['unitadmitsource', 'gender', 'agenum', 'ethnicity',
          'admissionweight','unittype','unitstaytype',
          'acutephysiologyscore','apachescore']

In [None]:
table = TableOne(cohort, columns=columns, rename={'agenum': 'age'},
                 groupby='actualhospitalmortality',
                 label_suffix=True, limit=4, pval=False)

print(table.tabulate(tablefmt = "fancy_grid"))

## Questions

- Are the severity of illness measures higher in the survival or non-survival group?

## Visualizing the data

Plotting the distribution of each variable by group level via histograms, kernel density estimates and boxplots is a crucial component to data analysis pipelines. Vizualisation is often is the only way to detect problematic variables in many real-life scenarios. We'll review a couple of the variables.

In [None]:
# Plot distributions
cohort[['acutephysiologyscore','agenum']].dropna().plot.kde(figsize=[12,8])
plt.legend(['APS Score', 'Age (years)'])
plt.xlim([-30,250])