## Step 0: Mounting Google Drive

In [1]:
from google.colab import drive
drive.mount('/content/drive')

%cd /content/drive/MyDrive/multimodal-xray-agent

!ls

Mounted at /content/drive
/content/drive/MyDrive/multimodal-xray-agent
app	      data	  LICENSE  notebooks	   README.md	     scripts
chexpert.zip  deployment  models   PROJECT_LOG.md  requirements.txt  src


In [7]:
import os

## Step 1: Define Dataset Paths

In [None]:
IMG_DIR = "./data/images_sample"

In [None]:
!mkdir -p $IMG_DIR/chexpert $IMG_DIR/chest14

## Step 2: Downloading and Unzipping the CheXpert Dataset

This dataset was obtained from Kaggle. You can download it here: https://www.kaggle.com/datasets/ashery/chexpert

**Details from the website:**

"This dataset is a smaller, downsampled version of the original dataset, which can be found here. It includes 224,316 chest radiographs from 65,240 patients, featuring both frontal and lateral views. The dataset is designed to aid in the automated interpretation of chest x-rays and includes uncertainty labels and evaluation sets annotated by radiologists."

This dataset will provide will serve as the primary image retrieval corpus for our **multimodal RAG system.**

Later, we are going to preprocess the images in this dataset into a form that our model can recognize.

In [None]:
ZIP_PATH = "./data/images_sample/archive.zip"

In [None]:
DEST_DIR = "./data/images_sample/chexpert_raw"

In [None]:
!mkdir -p $DEST_DIR

In [None]:
!unzip -q $ZIP_PATH -d $DEST_DIR

In [None]:
chexpert_train_dir = "/content/drive/MyDrive/multimodal-xray-agent/data/images_sample/chexpert_raw/train"

In [None]:
jpg_count = 0
for root, _, files in os.walk(chexpert_train_dir):
    jpg_count += sum(f.lower().endswith(".jpg") for f in files)

print(f"Total JPG images found in CheXpert train set: {jpg_count}")

Total JPG images found in CheXpert train set: 223415


## Step 3: Downloading and Unzipping the Chest X-ray 14 Dataset

You can download the dataset from Kaggle: https://www.kaggle.com/datasets/khanfashee/nih-chest-x-ray-14-224x224-resized

Description:

"This NIH Chest X-ray Dataset is comprised of 112,120 X-ray images with disease labels from 30,805 unique patients. To create these labels, the authors used Natural Language Processing to text-mine disease classifications from the associated radiological reports. The labels are expected to be >90% accurate and suitable for weakly-supervised learning."

This dataset adds diversity to our **multimodal RAG system** by supplementing the CheXpert dataset.

In [None]:
ZIP_PATH_1 = "./data/images_sample/chest14.zip"

In [None]:
DEST_DIR_1 = "./data/images_sample/chest14_raw"

In [None]:
!mkdir -p $DEST_DIR_1

In [None]:
!unzip -q $ZIP_PATH_1 -d $DEST_DIR_1

In [None]:
chest14_dir = "/content/drive/MyDrive/multimodal-xray-agent/data/images_sample/chest14_raw/images-224/images-224"

In [None]:
img_count = 0
for root, _, files in os.walk(chest14_dir):
    img_count += sum(f.lower().endswith(".png") for f in files)

print(f"Total images found in ChestX-ray14: {img_count}")

Total images found in ChestX-ray14: 112120


## Step 4: Flattening CheXpert Dataset

The raw dataset that was downloaded from Kaggle has a complex folder structure that is very difficult to navigate around. To get around this, we "flattened" the dataset by copying all the images into a single folder. This was done in the local computer and the zipped file was then uploaded to Drive. Now we unzip the file again.


In [3]:
ZIP_PATH_2 = "./data/images_sample/chexpert_flat.zip"

In [4]:
DEST_DIR_2 = "./data/images_sample/chexpert"

In [2]:
!cp "./data/images_sample/chexpert_flat.zip" /content/

In [3]:
!unzip -q /content/chexpert_flat.zip -d /content/chexpert_flat

In [None]:
!cp -r /content/chexpert_flat ./data/images_sample/

In [5]:
chexpert_dir = "/content/drive/MyDrive/multimodal-xray-agent/data/images_sample/chexpert_flat/chexpert_flat"

In [9]:
img_count = 0
for root, _, files in os.walk(chexpert_dir):
    img_count += sum(f.lower().endswith(".jpg") for f in files)

print(f"Total images found in chexpert_flat: {img_count}")

Total images found in chexpert_flat: 223416


In [10]:
!ls -lh /content/chexpert_flat | head

total 15M
drwxr-xr-x 2 root root  15M May 27 10:03 chexpert_flat
drwxr-xr-x 3 root root 4.0K May 28 03:45 __MACOSX


In [11]:
!find /content/chexpert_flat -type f | head -n 10

/content/chexpert_flat/chexpert_flat/patient24375_study8_view2_lateral.jpg
/content/chexpert_flat/chexpert_flat/patient09758_study5_view1_frontal.jpg
/content/chexpert_flat/chexpert_flat/patient59311_study1_view1_frontal.jpg
/content/chexpert_flat/chexpert_flat/patient39457_study1_view1_frontal.jpg
/content/chexpert_flat/chexpert_flat/patient14590_study4_view1_frontal.jpg
/content/chexpert_flat/chexpert_flat/patient56675_study1_view1_frontal.jpg
/content/chexpert_flat/chexpert_flat/patient15877_study6_view1_frontal.jpg
/content/chexpert_flat/chexpert_flat/patient12479_study1_view2_lateral.jpg
/content/chexpert_flat/chexpert_flat/patient31413_study10_view1_frontal.jpg
/content/chexpert_flat/chexpert_flat/patient07800_study1_view3_lateral.jpg
