## Step 0: Mounting Google Drive

In [1]:
from google.colab import drive
drive.mount('/content/drive')

%cd /content/drive/MyDrive/multimodal-xray-agent

!ls

Mounted at /content/drive
/content/drive/MyDrive/multimodal-xray-agent
app	      data	  LICENSE  notebooks	   README.md	     scripts
chexpert.zip  deployment  models   PROJECT_LOG.md  requirements.txt  src


## Step 1: Define Dataset Paths

In [2]:
IMG_DIR = "./data/images_sample"

In [None]:
!mkdir -p $IMG_DIR/chexpert $IMG_DIR/chest14

## Step 2: Downloading and Unzipping the CheXpert Dataset

This dataset was obtained from Kaggle. You can download it here: https://www.kaggle.com/datasets/ashery/chexpert

**Details from the website:**

"This dataset is a smaller, downsampled version of the original dataset, which can be found here. It includes 224,316 chest radiographs from 65,240 patients, featuring both frontal and lateral views. The dataset is designed to aid in the automated interpretation of chest x-rays and includes uncertainty labels and evaluation sets annotated by radiologists."

This dataset will provide will serve as the primary image retrieval corpus for our **multimodal RAG system.**

Later, we are going to preprocess the images in this dataset into a form that our model can recognize.

In [12]:
ZIP_PATH = "./data/images_sample/archive.zip"

In [9]:
DEST_DIR = "./data/images_sample/chexpert_raw"

In [10]:
!mkdir -p $DEST_DIR

In [11]:
!unzip -q $ZIP_PATH -d $DEST_DIR

In [21]:
import os

In [22]:
chexpert_train_dir = "/content/drive/MyDrive/multimodal-xray-agent/data/images_sample/chexpert_raw/train"

In [23]:
jpg_count = 0
for root, _, files in os.walk(chexpert_train_dir):
    jpg_count += sum(f.lower().endswith(".jpg") for f in files)

print(f"Total JPG images found in CheXpert train set: {jpg_count}")

Total JPG images found in CheXpert train set: 223415


## Step 3: Downloading and Unzipping the Chest X-ray 14 Dataset

You can download the dataset from Kaggle: https://www.kaggle.com/datasets/khanfashee/nih-chest-x-ray-14-224x224-resized

Description:

"This NIH Chest X-ray Dataset is comprised of 112,120 X-ray images with disease labels from 30,805 unique patients. To create these labels, the authors used Natural Language Processing to text-mine disease classifications from the associated radiological reports. The labels are expected to be >90% accurate and suitable for weakly-supervised learning."

This dataset adds diversity to our **multimodal RAG system** by supplementing the CheXpert dataset.

In [16]:
ZIP_PATH_1 = "./data/images_sample/chest14.zip"

In [17]:
DEST_DIR_1 = "./data/images_sample/chest14_raw"

In [18]:
!mkdir -p $DEST_DIR_1

In [19]:
!unzip -q $ZIP_PATH_1 -d $DEST_DIR_1

In [24]:
chest14_dir = "/content/drive/MyDrive/multimodal-xray-agent/data/images_sample/chest14_raw/images-224/images-224"

In [25]:
img_count = 0
for root, _, files in os.walk(chest14_dir):
    img_count += sum(f.lower().endswith(".png") for f in files)

print(f"Total images found in ChestX-ray14: {img_count}")

Total images found in ChestX-ray14: 112120
