<a href="https://colab.research.google.com/github/ufbfung/mimic/blob/main/mimic_cxr.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# MIMIC-CXR
This notebook will explore the new chest x-ray repository from mimic.

## Setup environment


### Mount Google Drive & Map File Paths
This section will connect to your google drive and define any file paths for subsequent access and references.

In [11]:
# Import relevant libraries
from google.colab import drive # to mount google drive
import os # to deal with file paths
import numpy as np # to deal with data
import pandas as pd # to deal with data
import cv2 # to read videos
from matplotlib import pyplot as plt # to plot figures

# Mount drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [19]:
# Define file paths
root_path = '/content/drive/My Drive/coding/mimic/cxr'
train_split_csv = os.path.join(root_path, 'mimic-cxr-2.0.0-split.csv')

In [22]:
# Read the CSV file into a dataframe
split = pd.read_csv(train_split_csv)

# Summarize the values in the 'split' column
split_summary = split['split'].value_counts()

# Display the summary
print(split_summary)

train       368960
test          5159
validate      2991
Name: split, dtype: int64


In [25]:
# Print first few rows of csv
print(split.head)

<bound method NDFrame.head of                                             dicom_id  study_id  subject_id  \
0       02aa804e-bde0afdd-112c0b34-7bc16630-4e384014  50414267    10000032   
1       174413ec-4ec4c1f7-34ea26b7-c5f994f8-79ef1962  50414267    10000032   
2       2a2277a9-b0ded155-c0de8eb9-c124d10e-82c5caab  53189527    10000032   
3       e084de3b-be89b11e-20fe3f9f-9c8d8dfe-4cfd202c  53189527    10000032   
4       68b5c4b1-227d0485-9cc38c3f-7b84ab51-4b472714  53911762    10000032   
...                                              ...       ...         ...   
377105  428e2c18-5721d8f3-35a05001-36f3d080-9053b83c  57132437    19999733   
377106  58c403aa-35ff8bd9-73e39f54-8dc9cc5d-e0ec3fa9  57132437    19999733   
377107  58766883-376a15ce-3b323a28-6af950a0-16b793bd  55368167    19999987   
377108  7ba273af-3d290f8d-e28d0ab4-484b7a86-7fc12b08  58621812    19999987   
377109  1a1fe7e3-cbac5d93-b339aeda-86bb86b5-4f31e82e  58971208    19999987   

        split  
0       train  
1

### Integrate with BigQuery
This section will allow integration with Google BigQuery to bring data into your Colab instance.

### Authenticate and authorize Colab Notebook
This step will install relevant libraries for authentication and authorization and also authenticate your Colab notebook to connect to Google BigQuery.

In [24]:
# Install relevant libraries
from google.colab import auth

# Authenticate
auth.authenticate_user()

# Install relevant library
from google.cloud import bigquery

# Define project you're interested in querying from BigQuery
project = 'mimic-iv-390722'

# Create a client
client = bigquery.Client(project=project)

# Write a sample query to check connection
query = """
SELECT *
FROM `physionet-data.mimiciii_clinical.admissions`
LIMIT 1
"""

# Run the query
query_job = client.query(query)

# Get the results
results = query_job.result()

# Process and extract the data
for row in results:
  print(row) # process the row data

Row((3757, 3115, 134067, datetime.datetime(2139, 2, 13, 3, 11), datetime.datetime(2139, 2, 20, 7, 33), None, 'EMERGENCY', 'EMERGENCY ROOM ADMIT', 'SNF', 'Medicare', None, None, None, 'WHITE', datetime.datetime(2139, 2, 13, 0, 2), datetime.datetime(2139, 2, 13, 3, 22), 'STAB WOUND', 0, 1), {'ROW_ID': 0, 'SUBJECT_ID': 1, 'HADM_ID': 2, 'ADMITTIME': 3, 'DISCHTIME': 4, 'DEATHTIME': 5, 'ADMISSION_TYPE': 6, 'ADMISSION_LOCATION': 7, 'DISCHARGE_LOCATION': 8, 'INSURANCE': 9, 'LANGUAGE': 10, 'RELIGION': 11, 'MARITAL_STATUS': 12, 'ETHNICITY': 13, 'EDREGTIME': 14, 'EDOUTTIME': 15, 'DIAGNOSIS': 16, 'HOSPITAL_EXPIRE_FLAG': 17, 'HAS_CHARTEVENTS_DATA': 18})


## Connect to Google Cloud Storage (GCS) to access the chest x-rays
We will plan to access the images directly from GCS vs. uploading them. This section will access the GCS bucket to process the images.

This section will assume that you have already authenticated with Google Cloud from the prior step using auth.authenticate_user()

In [None]:
# Install relevant libraries
from google.cloud import storage

# Create a client
project_id = 'mimic-iv-390722'
storage_client = storage.Client(project=project_id)

# Get a bucket reference
bucket_name = 'mimic-iv-cxr'
bucket = storage_client.get_bucket(bucket_name)

# List objects in the bucket
blobs = bucket.list_blobs()

# Set paths to files
# files_path =

# Alternatively, you can directly access a specific blob by its name
specific_blob = bucket.blob('files/p10/p10000032/s50414267/02aa804e-bde0afdd-112c0b34-7bc16630-4e384014.jpg')
print(specific_blob)
# Perform operations on the specific blob
# For example, download the file
#specific_blob.download_to_filename('local_file.jpg')
#print("File downloaded successfully.")
#for blob in blobs:
#  print(blob.name)

<Blob: mimic-iv-cxr, files/p10/p10000032/s50414267/02aa804e-bde0afdd-112c0b34-7bc16630-4e384014.jpg, None>


### Download metadata files from mimic_cxr as csv files
This section will download key metadata files as csvs into your working directory for further processing. Specifically, it will do the following:
- Query and download the chexpert table from mimic_cxr dataset
- Query and download the record_list table from mimic_cxr dataset
- Join chexpert and record_list on study_id to get the specific path of where each image is located.

The purpose of this section is to allow querying of specific labels of interest from chexpert (e.g. pneumonia) and retrieve all the image paths associated with that label for further investigation and modeling.

#### Define helper functions that will assist in the extraction
These are the functions that will perform the extraction. These will eventually be moved into a utils program.

In [103]:
def export_table(dataset, table, output_file=None):
    # Set the default output file name if not provided
    if output_file is None:
        output_file = f"{table}.csv"

    # Create the query to fetch the entire table
    query = create_query(dataset, table)

    # Execute the query and get the results
    results = query_dataset(query)

    # Convert the results to a Pandas DataFrame
    results_df = pd.DataFrame(results)

    # Save the DataFrame to a CSV file
    results_df.to_csv(output_file, index=False)

    print(f"Exported {table} table to {output_file}.")

def query_dataset(query):
  # Run the query
    query_job = client.query(query)

    # Get the results
    results = query_job.result()

    # Convert the results to a list of dictionaries
    results_list = [dict(row.items()) for row in results]

    return results_list

def create_query(dataset, table):
  project = 'physionet-data'
  query = """
  SELECT *
  FROM `{0}.{1}.{2}`
  """.format(project, dataset, table)
  return query

def extract_labeled_rows(dataframe, column, default_value=1.0):
    # Filter the rows based on the specified column and default value
    filtered_rows = dataframe[(dataframe[column] == default_value)]

    return filtered_rows

def summarize_output(dataframe):
    # Tally the number of total images found
    total_images = dataframe.shape[0]

    # Tally the number of unique patients represented by subject_id_x
    unique_patients = dataframe['subject_id_x'].nunique()

    # Tally the number of unique studies by study_id
    unique_studies = dataframe['study_id'].nunique()

    # Create a dictionary to store the summary
    summary = {
        'Total Images': total_images,
        'Unique Patients': unique_patients,
        'Unique Studies': unique_studies
    }

    return summary

In [104]:
# Define dataset
dataset = 'mimic_cxr'

# Download the full chexpert table as a csv
print('Downloading chexpert table...')
# export_table(dataset, 'chexpert') # Only need to run once
print('Downloading chexpert table complete')

# Download the full record_list table as a csv
print('Downloading record_list table...')
# export_table(dataset, 'record_list') # Only need to run once
print('Downloading record_list table complete')

# Read both csvs into dataframes
chexpert_csv = pd.read_csv('chexpert.csv')
record_list_csv = pd.read_csv('record_list.csv')

# Print number of rows for each
print("Number of rows in chexpert_csv:", chexpert_csv.shape[0])
print("Number of rows in record_list_csv:", record_list_csv.shape[0])

# Join the two csv tables
joined_df = chexpert_csv.merge(record_list_csv, on='study_id', how='inner')

# Filtered columns
filtered_columns = ['subject_id_x', 'subject_id_y', 'study_id', 'dicom_id', 'path', 'Lung_Lesion', 'Pneumonia']

# Define label of interest
label = 'Pneumonia'

# Extract labeled rows
interest_df = extract_labeled_rows(joined_df, label)

# Call the summarize_output function
summary = summarize_output(interest_df)

# Print the summary
print(summary)

Downloading chexpert table...
Downloading chexpert table complete
Downloading record_list table...
Downloading record_list table complete
Number of rows in chexpert_csv: 227827
Number of rows in record_list_csv: 377110
{'Total Images': 26222, 'Unique Patients': 10355, 'Unique Studies': 16556}


## Setup ResNet CNN
- For the training-validate-test splits, we'll use the predefined studies set by mimic in their mimic-cxr-2.0.0-split.csv

In [None]:
#

# Explore the data
Once everything is setup. We can start to run some queries to get an idea of what the dataset looks like. We'll start with some queries that profile the MIMIC-III dataset.

In this section, we'll also define some helper functions that will assist us in querying the data.

A examplar notebook for exploring the data will be primarily based on the [MIMIC code repository](https://github.com/MIT-LCP/mimic-code/blob/main/mimic-iv/notebooks/tableone.ipynb) for doing just that - exploring demographics and charts within MIMIC-IV.

In [None]:
# !pip install tableone
# from tableone import TableOne

In [None]:
# Import relevant libraries for exploration
from collections import OrderedDict
from tabulate import tabulate
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

In [None]:
def summarize_data(dataset, table):
    # Run the query to retrieve the data
    query = """
    SELECT *
    FROM `{0}.{1}.{2}`
    """.format('physionet-data', dataset, table)

    query_job = client.query(query)
    results = query_job.result()

    # Initialize counters and dictionaries
    total_patients = 0
    distinct_patients = set()
    disease_counts = {}
    disease_percentages = {}

    # Process and extract the data
    for row in results:
        total_patients += 1
        distinct_patients.add(row.subject_id)

        # Iterate over the diseases/findings columns
        for column_name in row.keys():
            if column_name not in ['subject_id', 'study_id']:
                disease_value = row[column_name]

                # Count the occurrences of each disease/finding
                if disease_value is not None:
                    if column_name not in disease_counts:
                        disease_counts[column_name] = 0
                    disease_counts[column_name] += 1

    # Calculate the percentages of each disease/finding
    for column_name, count in disease_counts.items():
        percentage = (count / total_patients) * 100
        disease_percentages[column_name] = percentage

    # Prepare the summary statistics as a list of lists
    summary_data = [
        ["Total count of patients:", total_patients],
        ["Count of distinct patients:", len(distinct_patients)],
        ["Count of each disease/finding (from most frequent to least frequent):"]
    ]
    for column_name, count in sorted(disease_counts.items(), key=lambda x: x[1], reverse=True):
        summary_data.append([column_name, count])
    summary_data.append(["Percentage of each disease/finding to the total count of patients:"])
    for column_name, percentage in sorted(disease_percentages.items(), key=lambda x: x[1], reverse=True):
        summary_data.append([column_name, "{:.2f}%".format(percentage)])

    # Generate the formatted table
    table_output = tabulate(summary_data, headers=["Category", "Count/Percentage"], tablefmt="github")

    # Print the formatted table
    print(table_output)

# Define the dataset and table of interest
dataset = 'mimic_cxr'
table = 'chexpert'

# Create query
# query = create_query(dataset, table)
# query_dataset(query)

# Summarize findings
summarize_data(dataset, table)

| Category                                                              | Count/Percentage   |
|-----------------------------------------------------------------------|--------------------|
| Total count of patients:                                              | 227827             |
| Count of distinct patients:                                           | 65379              |
| Count of each disease/finding (from most frequent to least frequent): |                    |
| Pleural_Effusion                                                      | 87272              |
| No_Finding                                                            | 75455              |
| Support_Devices                                                       | 70281              |
| Cardiomegaly                                                          | 66799              |
| Edema                                                                 | 65833              |
| Pneumonia                                       