## **MIMIC-III**: Pre-processing

This notebook contains the pre-processing of the [MIMIC-III](https://physionet.org/content/mimiciii/1.4/) dataset. 

In [2]:
from data import *
from utils import *
import os
import torch
from torch.utils.data import Dataset, DataLoader
from torchvision import transforms
from PIL import Image
import pandas as pd

from data import *

MIMIC_PATH = '../data/mimic-iii/'

## 1. **Tabular data**

[MIMIC-III Documentation](https://mimic.mit.edu/docs/iii/tables/). 

>**MIMIC-III Data Description**
>
>MIMIC-III is a relational database consisting of 26 tables. Tables are linked by identifiers which usually have the suffix ‘ID’. For example, SUBJECT_ID refers to a unique patient, HADM_ID refers to a unique admission to the hospital, and ICUSTAY_ID refers to a unique admission to an intensive care unit.
>
>Charted events such as notes, laboratory tests, and fluid balance are stored in a series of ‘events’ tables. For example the OUTPUTEVENTS table contains all measurements related to output for a given patient, while the LABEVENTS table contains laboratory test results for a patient.
>
>Tables prefixed with ‘D_’ are dictionary tables and provide definitions for identifiers. For example, every row of CHARTEVENTS is associated with a single ITEMID which represents the concept measured, but it does not contain the actual name of the measurement. By joining CHARTEVENTS and D_ITEMS on ITEMID, it is possible to identify the concept represented by a given ITEMID.
>
>Developing the MIMIC data model involved balancing simplicity of interpretation against closeness to ground truth. As such, the model is a reflection of underlying data sources, modified over iterations of the MIMIC database in response to user feedback. Care has been taken to avoid making assumptions about the underlying data when carrying out transformations, so MIMIC-III closely represents the raw hospital data.
>
>Broadly speaking, five tables are used to define and track patient stays: ADMISSIONS; PATIENTS; ICUSTAYS; SERVICES; and TRANSFERS. Another five tables are dictionaries for cross-referencing codes against their respective definitions: D_CPT; D_ICD_DIAGNOSES; D_ICD_PROCEDURES; D_ITEMS; and D_LABITEMS. The remaining tables contain data associated with patient care, such as physiological measurements, caregiver observations, and billing information.
>
>In some cases it would be possible to merge tables—for example, the D_ICD_PROCEDURES and CPTEVENTS tables both contain detail relating to procedures and could be combined—but our approach is to keep the tables independent for clarity, since the data sources are significantly different. Rather than combining the tables within MIMIC data model, we suggest researchers develop database views and transforms as appropriate.



We wish to convert these 26 tables into a single table, where each row represents a single patient's features. This will be done by joining the tables on the `SUBJECT_ID` column and adding features such as age, mortality, etc.

The main 5 tables contain the following columns:

| Table | Columns |
| --- | --- |
| `ADMISSIONS` | **ROW_ID**, **SUBJECT_ID**, **HADM_ID**, **ADMITTIME**, **DISCHTIME**, **DEATHTIME**, **ADMISSION_TYPE**, **ADMISSION_LOCATION**, **DISCHARGE_LOCATION**, **INSURANCE**, **LANGUAGE**, **RELIGION**, **MARITAL_STATUS**, **ETHNICITY**, **EDREGTIME**, **EDOUTTIME**, **DIAGNOSIS**, **HOSPITAL_EXPIRE_FLAG**, **HAS_CHARTEVENTS_DATA** |
| `PATIENTS` | ROW_ID, **SUBJECT_ID**, **GENDER**, **DOB**, **DOD**, DOD_HOSP, DOD_SSN, EXPIRE_FLAG |
| `ICUSTAYS` | ROW_ID, **SUBJECT_ID**, **HADM_ID**, **ICUSTAY_ID**, **DBSOURCE**, FIRST_CAREUNIT, LAST_CAREUNIT, FIRST_WARDID, LAST_WARDID, **INTIME**, **OUTTIME**, **LOS** |
| `SERVICES` | ROW_ID, **SUBJECT_ID**, **HADM_ID**, TRANSFERTIME, PREV_SERVICE, CURR_SERVICE |
| `TRANSFERS` | ROW_ID, **SUBJECT_ID**, **HADM_ID**, **ICUSTAY_ID**, **DBSOURCE**, EVENTTYPE, PREV_CAREUNIT, CURR_CAREUNIT, PREV_WARDID, CURR_WARDID, INTIME, OUTTIME, LOS |

[This repo](https://github.com/bmmalone/mimic-preprocessing) pre-processes MIMIC by keeping certain features. All features in bold above were kept. The following features are missing from the main 5 tables: 
- EPISODE = (ADMITTIME - DISCHTIME) ?
- EPISODE_NAME = 
- stay
- AGE = (ADMITTIME - DOB)
- MORTALITY_INUNIT = 
- MORTALITY = 1 if DOD is not null, 0 otherwise
- MORTALITY_INHOSPITAL = 1 if DOD_HOSP is not null, 0 otherwise
- HEIGHT
- WEIGHT

In [20]:
# Load admissions, patients, and icustays
admissions = pd.read_csv(MIMIC_PATH+'ADMISSIONS.csv')
patients = pd.read_csv(MIMIC_PATH+'PATIENTS.csv')
icustays = pd.read_csv(MIMIC_PATH+'ICUSTAYS.csv')

# Merge patients and admissions
patients = patients[['SUBJECT_ID', 'DOB', 'DOD']]
table = admissions.join(patients.set_index('SUBJECT_ID'), on='SUBJECT_ID', how='left', lsuffix='_adm', rsuffix='_pat')

# Merge table with ICU stays on subject_id and hadm_id
icustays = icustays[['SUBJECT_ID', 'HADM_ID', 'ICUSTAY_ID', 'DBSOURCE', 'INTIME', 'OUTTIME', 'LOS']]
table = table.join(icustays.set_index(['SUBJECT_ID', 'HADM_ID']), on=['SUBJECT_ID', 'HADM_ID'], how='left', rsuffix='_icu')

table



Unnamed: 0,ROW_ID,SUBJECT_ID,HADM_ID,ADMITTIME,DISCHTIME,DEATHTIME,ADMISSION_TYPE,ADMISSION_LOCATION,DISCHARGE_LOCATION,INSURANCE,...,DIAGNOSIS,HOSPITAL_EXPIRE_FLAG,HAS_CHARTEVENTS_DATA,DOB,DOD,ICUSTAY_ID,DBSOURCE,INTIME,OUTTIME,LOS
0,21,22,165315,2196-04-09 12:26:00,2196-04-10 15:54:00,,EMERGENCY,EMERGENCY ROOM ADMIT,DISC-TRAN CANCER/CHLDRN H,Private,...,BENZODIAZEPINE OVERDOSE,0,1,2131-05-07 00:00:00,,204798.0,carevue,2196-04-09 12:27:00,2196-04-10 15:54:00,1.1438
1,22,23,152223,2153-09-03 07:15:00,2153-09-08 19:10:00,,ELECTIVE,PHYS REFERRAL/NORMAL DELI,HOME HEALTH CARE,Medicare,...,CORONARY ARTERY DISEASE\CORONARY ARTERY BYPASS...,0,1,2082-07-17 00:00:00,,227807.0,carevue,2153-09-03 09:38:55,2153-09-04 15:59:11,1.2641
2,23,23,124321,2157-10-18 19:34:00,2157-10-25 14:00:00,,EMERGENCY,TRANSFER FROM HOSP/EXTRAM,HOME HEALTH CARE,Medicare,...,BRAIN MASS,0,1,2082-07-17 00:00:00,,234044.0,metavision,2157-10-21 11:40:38,2157-10-22 16:08:48,1.1862
3,24,24,161859,2139-06-06 16:14:00,2139-06-09 12:48:00,,EMERGENCY,TRANSFER FROM HOSP/EXTRAM,HOME,Private,...,INTERIOR MYOCARDIAL INFARCTION,0,1,2100-05-31 00:00:00,,262236.0,carevue,2139-06-06 16:15:36,2139-06-07 04:33:25,0.5124
4,25,25,129635,2160-11-02 02:06:00,2160-11-05 14:55:00,,EMERGENCY,EMERGENCY ROOM ADMIT,HOME,Private,...,ACUTE CORONARY SYNDROME,0,1,2101-11-21 00:00:00,,203487.0,carevue,2160-11-02 03:16:23,2160-11-05 16:23:27,3.5466
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
58971,58594,98800,191113,2131-03-30 21:13:00,2131-04-02 15:02:00,,EMERGENCY,CLINIC REFERRAL/PREMATURE,HOME,Private,...,TRAUMA,0,1,2111-11-05 00:00:00,,210188.0,metavision,2131-03-30 21:14:14,2131-03-31 18:18:14,0.8778
58972,58595,98802,101071,2151-03-05 20:00:00,2151-03-06 09:10:00,2151-03-06 09:10:00,EMERGENCY,CLINIC REFERRAL/PREMATURE,DEAD/EXPIRED,Medicare,...,SAH,1,1,2067-09-21 00:00:00,2151-03-06 00:00:00,294783.0,metavision,2151-03-05 20:01:18,2151-03-06 10:54:24,0.6202
58973,58596,98805,122631,2200-09-12 07:15:00,2200-09-20 12:08:00,,ELECTIVE,PHYS REFERRAL/NORMAL DELI,HOME HEALTH CARE,Private,...,RENAL CANCER/SDA,0,1,2158-03-02 00:00:00,,203155.0,metavision,2200-09-12 19:25:51,2200-09-14 18:17:01,1.9522
58974,58597,98813,170407,2128-11-11 02:29:00,2128-12-22 13:11:00,,EMERGENCY,EMERGENCY ROOM ADMIT,SNF,Private,...,S/P FALL,0,0,2068-02-04 00:00:00,,283274.0,carevue,2128-11-11 02:30:42,2128-11-19 03:22:46,8.0362


## 2. **Images data**


The mimic-cxr-2.0.0-metadata.csv.gz file contains useful meta-data derived from the original DICOM files in MIMIC-CXR. The columns are:

* **dicom_id** - An identifier for the DICOM file. The stem of each JPG image filename is equal to the dicom_id.
* **PerformedProcedureStepDescription** - The type of study performed ("CHEST (PA AND LAT)", "CHEST (PORTABLE AP)", etc).
* **ViewPosition** - The orientation in which the chest radiograph was taken ("AP", "PA", "LATERAL", etc).
* **Rows** - The height of the image in pixels.
* **Columns** - The width of the image in pixels.
* **StudyDate** - An anonymized date for the radiographic study. All images from the same study will have the same date and time. Dates are anonymized, but chronologically consistent for each patient. Intervals between two scans have not been modified during de-identification.
* **StudyTime** - The time of the study in hours, minutes, seconds, and fractional seconds. The time of the study was not modified during de-identification.
* **ProcedureCodeSequence_CodeMeaning** - The human readable description of the coded procedure (e.g. "CHEST (PA AND LAT)". Descriptions follow Simon-Leeming codes [11].
* **ViewCodeSequence_CodeMeaning** - The human readable description of the coded view orientation for the image (e.g. "postero-anterior", "antero-posterior", "lateral").
* **PatientOrientationCodeSequence_CodeMeaning** - The human readable description of the patient orientation during the image acquisition. Three values are possible: "Erect", "Recumbent", or a null value (missing).

In [5]:
# Info from the images
info_jpg = pd.read_csv('../fake_data/mimic-cxr-2.0.0-metadata.csv')
info_jpg.head()

Unnamed: 0,dicom_id,subject_id,study_id,PerformedProcedureStepDescription,ViewPosition,Rows,Columns,StudyDate,StudyTime,ProcedureCodeSequence_CodeMeaning,ViewCodeSequence_CodeMeaning,PatientOrientationCodeSequence_CodeMeaning
0,02aa804e-bde0afdd-112c0b34-7bc16630-4e384014,10000032,50414267,CHEST (PA AND LAT),PA,3056,2544,21800506,213014.531,CHEST (PA AND LAT),postero-anterior,Erect
1,174413ec-4ec4c1f7-34ea26b7-c5f994f8-79ef1962,10000032,50414267,CHEST (PA AND LAT),LATERAL,3056,2544,21800506,213014.531,CHEST (PA AND LAT),lateral,Erect
2,2a2277a9-b0ded155-c0de8eb9-c124d10e-82c5caab,10000032,53189527,CHEST (PA AND LAT),PA,3056,2544,21800626,165500.312,CHEST (PA AND LAT),postero-anterior,Erect
3,e084de3b-be89b11e-20fe3f9f-9c8d8dfe-4cfd202c,10000032,53189527,CHEST (PA AND LAT),LATERAL,3056,2544,21800626,165500.312,CHEST (PA AND LAT),lateral,Erect
4,68b5c4b1-227d0485-9cc38c3f-7b84ab51-4b472714,10000032,53911762,CHEST (PORTABLE AP),AP,2705,2539,21800723,80556.875,CHEST (PORTABLE AP),antero-posterior,


Note that "No Finding" is the absence of any of the 13 descriptive labels and a check that the text does not mention a specified set of other common findings beyond those covered by the descriptive labels. Each label column contains one of four values: 1.0, -1.0, 0.0, or missing. These labels have the following interpretation:

* **1.0** - The label was positively mentioned in the associated study, and is present in one or more of the corresponding images e.g. "A large pleural effusion"
* **0.0** - The label was negatively mentioned in the associated study, and therefore should not be present in any of the corresponding images e.g. "No pneumothorax."
* **-1.0** - The label was either: (1) Explicit uncertainty or (2) Ambiguous language
* **Missing** (empty element) - No mention of the label was made in the report

In [6]:
labels_data = pd.read_csv('../fake_data/mimic-cxr-2.0.0-chexpert.csv')
labels_data.head()

Unnamed: 0,subject_id,study_id,Atelectasis,Cardiomegaly,Consolidation,Edema,Enlarged Cardiomediastinum,Fracture,Lung Lesion,Lung Opacity,No Finding,Pleural Effusion,Pleural Other,Pneumonia,Pneumothorax,Support Devices
0,10000032,50414267,,,,,,,,,1.0,,,,,
1,10000032,53189527,,,,,,,,,1.0,,,,,
2,10000032,53911762,,,,,,,,,1.0,,,,,
3,10000032,56699142,,,,,,,,,1.0,,,,,
4,10000764,57375967,,,1.0,,,,,,,,,-1.0,,


In [7]:
def list_images(base_path):
    """
    Recursively lists all image files starting from the base path.
    Assumes that images have extensions typical for image files (e.g., .jpg, .jpeg, .png).
    """
    image_files = []
    for subdir, dirs, files in os.walk(base_path):
        for file in files:
            if file.lower().endswith(('.png', '.jpg', '.jpeg')):
                image_files.append(os.path.join(subdir, file))
    return image_files

image_files = list_images('../fake_data/files')
image_files

['../fake_data/files/p10/p10000764/s57375967/dcfeeac4-1597e318-d0e6736a-8b2c2238-47ac3f1b.jpg',
 '../fake_data/files/p10/p10000764/s57375967/b79e55c3-735ce5ac-64412506-cdc9ea79-f1af521f.jpg',
 '../fake_data/files/p10/p10000764/s57375967/096052b7-d256dc40-453a102b-fa7d01c6-1b22c6b4.jpg',
 '../fake_data/files/p10/p10000032/s56699142/ea030e7a-2e3b1346-bc518786-7a8fd698-f673b44c.jpg',
 '../fake_data/files/p10/p10000032/s53189527/e084de3b-be89b11e-20fe3f9f-9c8d8dfe-4cfd202c.jpg',
 '../fake_data/files/p10/p10000032/s53189527/2a2277a9-b0ded155-c0de8eb9-c124d10e-82c5caab.jpg',
 '../fake_data/files/p10/p10000032/s53911762/68b5c4b1-227d0485-9cc38c3f-7b84ab51-4b472714.jpg',
 '../fake_data/files/p10/p10000032/s53911762/fffabebf-74fd3a1f-673b6b41-96ec0ac9-2ab69818.jpg',
 '../fake_data/files/p10/p10000032/s50414267/174413ec-4ec4c1f7-34ea26b7-c5f994f8-79ef1962.jpg',
 '../fake_data/files/p10/p10000032/s50414267/02aa804e-bde0afdd-112c0b34-7bc16630-4e384014.jpg',
 '../fake_data/files/p11/p11000011/s5102

In [38]:
image_labels_mapping = create_image_labels_mapping(image_files, labels_data, info_jpg)
        
image_labels_mapping[image_files[0]]    

{'subject_id': 10000764.0,
 'study_id': 57375967.0,
 'Atelectasis': nan,
 'Cardiomegaly': nan,
 'Consolidation': 1.0,
 'Edema': nan,
 'Enlarged Cardiomediastinum': nan,
 'Fracture': nan,
 'Lung Lesion': nan,
 'Lung Opacity': nan,
 'No Finding': nan,
 'Pleural Effusion': nan,
 'Pleural Other': nan,
 'Pneumonia': -1.0,
 'Pneumothorax': nan,
 'Support Devices': nan,
 'ViewPosition': 'LATERAL'}

In [46]:
import numpy as np
import torch
from torch.utils.data import Dataset
from torchvision import transforms
from PIL import Image, ImageOps

class MedicalImagesDataset(Dataset):
    def __init__(self, data_dict, size=224, transform=None):
        self.data_dict = data_dict
        self.transform = transform
        self.size = size
        self.classes = ['Atelectasis', 'Cardiomegaly', 'Consolidation', 'Edema', 
                        'Enlarged Cardiomediastinum', 'Fracture', 'Lung Lesion', 
                        'Lung Opacity', 'No Finding', 'Pleural Effusion', 
                        'Pleural Other', 'Pneumonia', 'Pneumothorax', 'Support Devices']
        # Organize paths by subject_id and study_id
        self.organized_paths = self._organize_paths()
        # Filter out pairs where both images are None
        self.organized_paths = {k: v for k, v in self.organized_paths.items() if v['PA'] is not None or v['Lateral'] is not None}

    def _organize_paths(self):
        organized = {}
        for path in self.data_dict.keys():
            parts = path.split(os.sep)
            subject_id = parts[-3][1:]
            study_id = parts[-2][1:]
            key = (subject_id, study_id)

            if key not in organized:
                organized[key] = {'PA': None, 'Lateral': None}

            view_position = self.data_dict[path]['ViewPosition']
            if view_position in ['PA', 'Lateral']:
                organized[key][view_position] = path

        return organized

    def __len__(self):
        return len(self.organized_paths)

    def _load_and_process_image(self, path):
        if path:
            image = Image.open(path).convert('RGB')
        else:
            # Create a blank (black) image if path is None
            image = Image.new('RGB', (self.size, self.size))

        image = image.resize((self.size, self.size))

        if self.transform:
            image = self.transform(image)
        else:
            image = transforms.ToTensor()(image)

        return image

    def __getitem__(self, idx):
        subject_study_pair = list(self.organized_paths.keys())[idx]
        pa_path = self.organized_paths[subject_study_pair]['PA']
        lateral_path = self.organized_paths[subject_study_pair]['Lateral']

        # Load and process PA and Lateral images
        pa_image = self._load_and_process_image(pa_path)
        lateral_image = self._load_and_process_image(lateral_path)

        # Use one of the available paths to get labels (assuming they are the same for both views)
        labels_path = pa_path if pa_path else lateral_path

        if not labels_path:
            # Skip this patient if both PA and Lateral images are missing
            return None

        labels = self.data_dict[labels_path]
        label_values = [labels[class_name] if not np.isnan(labels[class_name]) else 0 for class_name in self.classes]
        label_tensor = torch.tensor(label_values, dtype=torch.float32)

        return pa_image, lateral_image, label_tensor

# Transform definition (if needed)
transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])

dataset = MedicalImagesDataset(image_labels_mapping, transform=transform)
dataloaders = DataLoader(dataset, batch_size=4, shuffle=True, num_workers=0)

for pa_images, lateral_images, labels in dataloaders:
    print(pa_images.shape)
    print(lateral_images.shape)
    print(labels.shape)
    break


torch.Size([2, 3, 224, 224])
torch.Size([2, 3, 224, 224])
torch.Size([2, 14])


In [37]:
# Iterate over batches of image,label pairs
for i, (images, labels, view) in enumerate(dataloader):
    print(images.shape)
    print(labels.shape)
    print(view)
    break

torch.Size([13, 3, 224, 224])
torch.Size([13, 14])
('AP', 'PA', 'AP', 'LATERAL', 'PA', 'LATERAL', 'AP', 'LATERAL', 'AP', 'LATERAL', 'AP', 'AP', 'AP')
