<a href="https://colab.research.google.com/github/rashwinr/MONAI_tutorials/blob/main/MONAI_datasets.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# MONAI: Datasets

1. **Medical datasets are key to developing AI solutions for healthcare:** They provide the essential data for training and evaluating algorithms that can detect diseases, assist in diagnosis, and personalize treatment plans.
2. **Accessing and utilizing these datasets can be challenging:** Issues include large file sizes, complex formats, and the need for anonymization to protect patient privacy.
3. **MONAI simplifies this process:** This open-source framework offers tools and resources specifically designed for medical imaging data, streamlining tasks like data loading, preprocessing, and analysis, making it easier for researchers to work with these valuable datasets.




In [None]:
!pip show monai

In [None]:
!pip install monai[all]

In [None]:
import monai
from monai.config import print_config
print_config()

## Create a dummy image

**Syntax**:
```python
monai.data.synthetic.create_test_image_2d(height, width, num_objs=12, num_seg_classes=1, channel_dim=3, random_state=None)
```

**Parameters**
________
**height** – height of the image.

**width** – width of the image.

**num_objs** – number of circles to generate. Defaults to 12.

**rad_max** – maximum circle radius. Defaults to 30.

**rad_min** – minimum circle radius. Defaults to 5.

**noise_max** – if greater than 0 then noise will be added to the image taken from the uniform distribution on range [0,noise_max). Defaults to 0.

**num_seg_classes** – number of classes for segmentations. Defaults to 5.

**channel_dim** – if None, create an image without channel dimension, otherwise create an image with channel dimension as first dim or last dim. Defaults to None.

**random_state** – the random generator to use. Defaults to np.random.



In [None]:
from monai.data import create_test_image_2d

image, seg = create_test_image_2d(height=128, width=128,num_objs=5,rad_max=10,rad_min=2,num_seg_classes=10)

print(f"Image shape: {image.shape}")
print(f"Segmentation shape: {seg.shape}")

print(f"Image min: {image.min()}, max: {image.max()}")
print(f"Segmentation min: {seg.min()}, max: {seg.max()}")

### Visualization

- Matplotlib has a number of built-in colormaps
- An intuitive color scheme for the parameter you are plotting
- More details on ``matplotlib.colormaps`` is available here: https://matplotlib.org/stable/users/explain/colors/colormaps.html


<img src="https://matplotlib.org/stable/_images/sphx_glr_colormaps_014.png">

In [None]:
from matplotlib import colormaps
# list(colormaps)

In [None]:
import matplotlib.pyplot as plt

plt.figure("visualize", (12, 6))
plt.subplot(1, 2, 1)
plt.title("image")
plt.imshow(image, cmap="gray")
plt.subplot(1, 2, 2)
plt.title("segmentation")
plt.imshow(seg,cmap="gnuplot")
plt.show()

## Recap

- MONAI dataset
- MONAI transforms
- Working with medical images

### MONAI dataset

* Dataset: Combines data and its associated transform into a single entity
  * Syntax: ``Dataset(data,transform=None)``

    Where transform is an image or object manipulation that will be activated and acts on the data

In [None]:
from monai.data import Dataset

data = [
    {"image": image, "seg": seg}
]

# Define a dataset using the data list
dataset = Dataset(data=data)


print(f"Dataset length: {len(dataset)}")

# Access a data item by index
item = dataset[0]
print(f"Keys in item: {item.keys()}")

print(f"Image shape: {item['image'].shape}")
print(f"Segmentation shape: {item['seg'].shape}")


### Attributes of monai dataset


In [None]:
# prompt: view attributes of monai dataset variable dataset

# print(dir(dataset))

### MONAI transforms

In [None]:

import monai
from monai.transforms import Compose, EnsureChannelFirstd, ScaleIntensityd, ToTensord

# Define a transform to convert image and segmentation into tensors,
# ensure channel first and scale intensity
transform = Compose([
    EnsureChannelFirstd(keys=["image", "seg"],channel_dim="no_channel"),
    ScaleIntensityd(keys=["image", "seg"]),
    ToTensord(keys=["image", "seg"])
])

# Create a monai dataset with the transform
dataset = monai.data.Dataset(data=data, transform=transform)


# Access a data item by index
item = dataset[0]
print(f"Keys in item: {item.keys()}")

print(f"Image shape: {item['image'].shape}")
print(f"Segmentation shape: {item['seg'].shape}")

print(f"Image min: {item['image'].min()}, max: {item['image'].max()}")
print(f"Segmentation min: {item['seg'].min()}, max: {item['seg'].max()}")

## MONAI datasets

From MONAI Applications: ``class monai.apps``
- MEDNIST
- Medical Decathlon
- TCIA
- Others: MEDMNIST
- Others: PhysioNet

In [None]:
import os
dir_path = os.getcwd()
print(dir_path)

### MEDNIST Dataset

The MedNIST dataset was gathered from several sets from [TCIA](https://wiki.cancerimagingarchive.net/display/Public/Data+Usage+Policies+and+Restrictions),
[the RSNA Bone Age Challenge](http://rsnachallenges.cloudapp.net/competitions/4),
and [the NIH Chest X-ray dataset](https://cloud.google.com/healthcare/docs/resources/public-datasets/nih-chest).

The dataset is kindly made available by [Dr. Bradley J. Erickson M.D., Ph.D.](https://www.mayo.edu/research/labs/radiology-informatics/overview) (Department of Radiology, Mayo Clinic)
under the Creative Commons [CC BY-SA 4.0 license](https://creativecommons.org/licenses/by-sa/4.0/). If you use the MedNIST dataset, please acknowledge the source.

Syntax: ``MedNISTDataset(root_dir, section, transform=(), download=False, seed=0, val_frac=0.1, test_frac=0.1, cache_num=9223372036854775807, cache_rate=1.0, num_workers=1, progress=True, copy_cache=True, as_contiguous=True, runtime_cache=False)``

**Parameters**:
- **root_dir** – target directory to download and load MedNIST dataset.
- **section** – expected data section, can be: training, validation or test.
- **download** – whether to download and extract the MedNIST from resource link, default is False. if expected file already exists, skip downloading even set it to True. user can manually copy MedNIST.tar.gz file or MedNIST folder to root directory.
- **seed** – random seed to randomly split training, validation and test datasets, default is 0.
- **val_frac** – percentage of validation fraction in the whole dataset, default is 0.1.
- **test_frac** – percentage of test fraction in the whole dataset, default is 0.1.

In [None]:
from monai.apps import MedNISTDataset
train_data = MedNISTDataset(root_dir=dir_path, section="training",download=True, seed=24, val_frac=0.1, test_frac=0.55)
val_data = MedNISTDataset(root_dir=dir_path, section="validation",download=False, seed=24, val_frac=0.1, test_frac=0.5)
test_data = MedNISTDataset(root_dir=dir_path, section="test",download=False, seed=24, val_frac=0.1, test_frac=0.5)


In [None]:
print(f"Length of training dataset: {len(train_data)}")
print(f"Length of validation dataset: {len(val_data)}")
print(f"Length of test dataset: {len(test_data)}")
print(f"Type of train data: {type(train_data)}")
# print(dir(train_data))
for data in val_data:
    image = data['image']
    label = data['label']
    # Print or inspect the 'image' and 'label'
    print(f"Image shape: {image.shape}, Label: {label}")

### Attributes of MONAI MEDNIST dataset variable

In [None]:
print(dir(train_data))

#### Data exploration

First of all, check the dataset files and show some statistics.  
There are 6 folders in the dataset: Hand, AbdomenCT, CXR, ChestCT, BreastMRI, HeadCT, which should be used as the labels to train our classification model.

In [None]:
import os

mednist_folder = os.path.join(dir_path, 'MedNIST')

if os.path.exists(mednist_folder):
  for subfolder in os.listdir(mednist_folder):
    subfolder_path = os.path.join(mednist_folder, subfolder)
    if os.path.isdir(subfolder_path):
      print(f"Subfolder: {subfolder}")
else:
  print("MedNIST folder not found.")


In [None]:
mednist_folder = os.path.join(dir_path, 'MedNIST')

if os.path.exists(mednist_folder):
  for subfolder in os.listdir(mednist_folder):
    subfolder_path = os.path.join(mednist_folder, subfolder)
    if os.path.isdir(subfolder_path):
      print(f"Subfolder: {subfolder}")
      file_count = 0
      file_types = {}
      for filename in os.listdir(subfolder_path):
        file_count += 1
        file_type = filename.split('.')[-1]
        if file_type in file_types:
          file_types[file_type] += 1
        else:
          file_types[file_type] = 1
      print(f"  Number of files: {file_count}")
      print(f"  File types and counts: {file_types}")
else:
  print("MedNIST folder not found.")


In [None]:
import matplotlib.pyplot as plt
import numpy as np
from PIL import Image # Import the Image class from PIL

chestct_folder = os.path.join(mednist_folder, 'ChestCT')
if os.path.exists(chestct_folder):
  image_files = [os.path.join(chestct_folder, f) for f in os.listdir(chestct_folder) if f.endswith('.jpeg')]
  num_images_to_plot = 10

  plt.figure(figsize=(15, 5))
  for i in range(num_images_to_plot):
      image_path = image_files[i]
      image = plt.imread(image_path)
      plt.subplot(1, num_images_to_plot, i + 1)
      plt.imshow(image, cmap='gray')
      plt.axis('off')
      width, height = Image.open(image_path).size
      print(f"  Image: {image_path}, Size: {width} x {height}")

  plt.show()

else:
  print("ChestCT folder not found.")

In [None]:
import matplotlib.pyplot as plt
import numpy as np
from PIL import Image # Import the Image class from PIL

abdomenct_folder = os.path.join(mednist_folder, 'AbdomenCT')
if os.path.exists(abdomenct_folder):
  image_files = [os.path.join(abdomenct_folder, f) for f in os.listdir(abdomenct_folder) if f.endswith('.jpeg')]
  num_images_to_plot = 10

  plt.figure(figsize=(15, 5))
  for i in range(num_images_to_plot):
      image_path = image_files[i]
      image = plt.imread(image_path)
      plt.subplot(1, num_images_to_plot, i + 1)
      plt.imshow(image, cmap='gray')
      plt.axis('off')
      width, height = Image.open(image_path).size
      print(f"  Image: {image_path}, Size: {width} x {height}")

  plt.show()

else:
  print("AbdomenCT folder not found.")

In [None]:
import matplotlib.pyplot as plt
import numpy as np
from PIL import Image # Import the Image class from PIL

cxr_folder = os.path.join(mednist_folder, 'CXR')
if os.path.exists(cxr_folder):
  image_files = [os.path.join(cxr_folder, f) for f in os.listdir(cxr_folder) if f.endswith('.jpeg')]
  num_images_to_plot = 10

  plt.figure(figsize=(15, 5))
  for i in range(num_images_to_plot):
      image_path = image_files[i]
      image = plt.imread(image_path)
      width, height = Image.open(image_path).size
      print(f"  Image: {image_path}, Size: {width} x {height}")
      plt.subplot(1, num_images_to_plot, i + 1)
      plt.imshow(image, cmap='gray')
      plt.axis('off')

  plt.show()

else:
  print("CXR folder not found.")


### Decathlon datasets

- The Dataset command to automatically download the data of Medical Segmentation Decathlon challenge (http://medicaldecathlon.com/) and generate items for training, validation or test.

- It will also load these properties from the JSON config file of dataset.

- Syntax:
```python
DecathlonDataset(root_dir, task, section, download=False, seed=0, val_frac=0.2, progress=True)
```
Parameters:
- **root_dir** – local directory for caching and loading the MSD datasets.

- **task** – Task to download and execute: one item of the list
    - “Task01_BrainTumour”
    - “Task02_Heart”
    - “Task03_Liver”
    - “Task04_Hippocampus”
    - “Task05_Prostate”
    - “Task06_Lung”
    - “Task07_Pancreas”
    - “Task08_HepaticVessel”
    - “Task09_Spleen”
    - “Task10_Colon”

- **section** – expected data section: training or validation.

- **download** – whether to download and extract the Decathlon from resource link, default is False. if expected file already exists, skip downloading even set it to True. user can manually copy tar file or dataset folder to the root directory.

- **val_frac** – percentage of validation fraction in the whole dataset, default is 0.2.

- **seed** – random seed to randomly shuffle the datalist before splitting into training and validation, default is 0.
  - **Note**: Set same seed for training and validation sections.

- **progress** – whether to display a progress bar when downloading dataset and computing the transform cache content.



In [None]:
from monai.apps import DecathlonDataset

# Specify the task number you want to access (e.g., Task04_Hippocampus)
task_num = "Task04_Hippocampus"

# Create a DecathlonDataset instance for the specified task
train_decathlondataset = DecathlonDataset(root_dir=dir_path, task=task_num, section="training", download=True,val_frac=0.1)
validation_decathlondataset = DecathlonDataset(root_dir=dir_path, task=task_num, section="validation", download=False,val_frac=0.1)

In [None]:
print(f"The length of the training dataset is {len(train_decathlondataset)} and the type of vaiable is {type(train_decathlondataset)}")
print(f"The length of the validation dataset is {len(validation_decathlondataset)} and the type of vaiable is {type(validation_decathlondataset)}")
print(dir(validation_decathlondataset))
# Access the 'image' and 'label' dictionaries
for i in range(len(validation_decathlondataset)):
    data = validation_decathlondataset[i]
    image = data["image"]
    label = data["label"]
    print(f"Image: {image.shape}, Label: {label.shape}")

### TCIA Dataset

- The Dataset to automatically download the data from a public The Cancer Imaging Archive (TCIA) dataset and generate items for training, validation or test. [https://www.cancerimagingarchive.net/](https://www.cancerimagingarchive.net/)
- Syntax:
```Python
class monai.apps.TciaDataset(root_dir, collection, section, transform=(), download=False)
```

- **Massive Public Database**: TCIA provides a huge collection of de-identified medical images (like CT scans, MRIs, and histopathology slides) across a wide range of cancer types. This allows researchers to access diverse data for analysis, development of image-based diagnostic tools, and discovery of new disease insights.

- **Open and Free**: All the data in TCIA is freely available to the public. This open access promotes collaboration, accelerates research, and encourages the development of innovative cancer imaging applications.

- **Standardized Format**: TCIA uses the DICOM (Digital Imaging and Communications in Medicine) standard for storing and distributing images. This ensures compatibility and makes it easier for researchers to use the data with various image processing and analysis tools.

In [None]:
!pip install pydicom
import pydicom
from monai.apps import TciaDataset

# Specify the collection you want to access (e.g., "Lung Phantom")
collection = "Lung Phantom"

# Create a TciaDataset instance for the specified collection
tcia_dataset = TciaDataset(root_dir=dir_path, collection=collection, section="training", download=True)



In [None]:
print(f"The length of the TCIA dataset is {len(tcia_dataset)} and the type of variable is {type(tcia_dataset)}")

print(dir(tcia_dataset))

## Other datasets

### MEDMNIST
- A large-scale MNIST-like collection of standardized biomedical images, including 12 datasets for 2D and 6 datasets for 3D.
- All images are pre-processed into 28x28 (2D) or 28x28x28 (3D) with the corresponding classification labels, so that no background knowledge is required for users.
- Covering primary data modalities in biomedical images, MedMNIST is designed to perform classification on lightweight 2D and 3D images with various data scales (from 100 to 100,000) and diverse tasks (binary/multi-class, ordinal regression and multi-label)


<img src="https://raw.githubusercontent.com/MedMNIST/MedMNIST/main/assets/medmnistv2.jpg" width="1738px" height="942px">

In [None]:
!pip install medmnist
from medmnist import INFO, Evaluator


In [None]:
from medmnist import INFO, Evaluator
import medmnist

# Print available datasets and their information
for dataset_name in INFO.keys():
  print(f"Dataset: {dataset_name}")
  print(f"    Description: {INFO[dataset_name]['description']}")
  # Check if the 'input_shape' key exists before accessing it
  if 'input_shape' in INFO[dataset_name]:
    print(f"    Data shape: {INFO[dataset_name]['input_shape']}")
  else:
    print(f"    Data shape: Not available in INFO")  # Handle the case where 'shape' is missing
  print(f"    Labels: {INFO[dataset_name]['label']}")
  print(f"    Download URL: {INFO[dataset_name]['url']}")
  print("-" * 20)


### Downloading data

In [None]:
from medmnist import PathMNIST

# Create an instance of the PathMNIST dataset
dataset = PathMNIST(root=dir_path,split='train', download=True)

# Get a list of attributes using dir()
attributes = dir(dataset)

# Print the attributes
for attr in attributes:
    print(attr)

### Unzip

In [None]:
!unzip /content/pathmnist.npz -d /content/PathMNIST

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Load the validation images from the .npy file
val_images = np.load('/content/PathMNIST/val_images.npy')
print(val_images.shape)

In [None]:
val_labels = np.load('/content/PathMNIST/val_labels.npy')
print(val_labels.shape)

### Visualization

Labels: {'0': 'adipose', '1': 'background', '2': 'debris', '3': 'lymphocytes', '4': 'mucus', '5': 'smooth muscle', '6': 'normal colon mucosa', '7': 'cancer-associated stroma', '8': 'colorectal adenocarcinoma epithelium'}
    

In [None]:

# Create a 3x3 grid for visualization
fig, axes = plt.subplots(3, 3, figsize=(10, 10))

# Iterate through the first 9 images in the validation dataset
for i in range(9):
  row = i // 3
  col = i % 3
  axes[row, col].imshow(val_images[i], cmap='gray')
  axes[row, col].set_title(f"Image {i+1}\nSize: {val_images[i].shape}\nLabel:{val_labels[i]}]")
  axes[row, col].axis('off')

plt.tight_layout()
plt.show()

Folder to view the medmnist mapping: [medmnist_fileid](https://drive.google.com/drive/folders/1A_99qH_c-J0p_SatwSiaP_i1CvLUOzVo)

## Tasks


### Task 1

- Plot the already downloaded MEDNIST dataset such that an image from each of its class is plotted

In [None]:
import matplotlib.pyplot as plt
import os

mednist_folder = os.path.join(os.getcwd(), 'MedNIST')

if os.path.exists(mednist_folder):
  class_images = {}
  for subfolder in os.listdir(mednist_folder):
    subfolder_path = os.path.join(mednist_folder, subfolder)
    if os.path.isdir(subfolder_path):
      image_files = [os.path.join(subfolder_path, f) for f in os.listdir(subfolder_path) if f.endswith('.jpeg')]
      if image_files:
        class_images[subfolder] = image_files[0]  # Take the first image from each class

  if class_images:
    num_classes = len(class_images)
    rows = (num_classes + 3) // 4  # Adjust rows based on number of classes
    cols = min(num_classes, 4)
    plt.figure(figsize=(15, rows * 3.5))

    for i, (class_name, image_path) in enumerate(class_images.items()):
      plt.subplot(rows, cols, i + 1)
      image = plt.imread(image_path)
      plt.imshow(image, cmap='gray')
      plt.title(class_name)
      plt.axis('off')

    plt.tight_layout()
    plt.show()
else:
  print("MedNIST folder not found.")

### Task 2

- Download the decathlon dataset for Spleen, such that you have train:val::60:40.
- Load samples into a dataset variable to visualize the mid slice of the 3D volume along with their labels

In [None]:
task_num = "Task09_Spleen"

# Create a DecathlonDataset instance for the specified task with a 60:40 train:val split
train_decathlondataset = DecathlonDataset(root_dir=dir_path, task=task_num, section="training", download=True, val_frac=0.4, seed=123)
validation_decathlondataset = DecathlonDataset(root_dir=dir_path, task=task_num, section="validation", download=False, val_frac=0.4, seed=123)

print(f"The length of the training dataset is {len(train_decathlondataset)}")
print(f"The length of the validation dataset is {len(validation_decathlondataset)}")

# Load samples into a dataset variable to visualize the mid slice of the 3D volume along with their labels
for i in range(min(len(validation_decathlondataset), 3)):  # Visualize the first 3 validation samples
    data = validation_decathlondataset[i]
    image = data["image"]
    label = data["label"]
    print(f"Image shape: {image.shape}, Label shape: {label.shape}")

    mid_slice_image = image[image.shape[0] // 2, :, :]
    mid_slice_label = label[label.shape[0] // 2, :, :]

    plt.figure(figsize=(10, 5))
    plt.subplot(1, 2, 1)
    plt.imshow(mid_slice_image, cmap='gray')
    plt.title('Mid-Slice Image')
    plt.axis('off')

    plt.subplot(1, 2, 2)
    plt.imshow(mid_slice_label, cmap='gray')
    plt.title('Mid-Slice Label')
    plt.axis('off')

    plt.show()