# Data Preparation Tutorial
Contacts: eyuboglu@stanford.edu, gangus@stanford.edu

How to prepare data for our multi-task, weak supervision framework.  We use  [HDF5](https://www.hdfgroup.org/solutions/hdf5/), a file format optimized for high-dimensional heterogeneous datasets (like ours!). Using HDF5 allows us to store all of the data (i.e. scan, report, and metadata) for each exam in our dataset in one place. Unfortunately, many of you are probably unfamiliar with the HDF5 interface, so below we walk you through how to prepare an HDF5 dataset with volumetric imaging data for use in our framework!

In this notebook we:  

1. Go through the motions of preparing an HDF5 dataset for use with our framework. We prepare the dataset with dummy data, but show how you can replace a few functions with custom ones for loading your own data. 

2. Show how we can use this HDF5 dataset with the PyTorch `Dataset` classes we've implemented such as `pet_ct.learn.datasets.MTClassifierDataset`. (TODO)

## Setup
Import various packages. Make sure you're in an environment with the `pet_ct` package installed.

In [7]:
# import requirements
%load_ext autoreload
%autoreload 2

import os
import json

import numpy as np 
import pandas as pd
import h5py

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [3]:
# TODO: change to package directory
os.chdir("/Users/sabrieyuboglu/Documents/sabri/research/projects/fdg-pet-ct/pet-ct")

experiment_dir = "tutorials/01_data"

## 1) Creating a Toy HDF5 Dataset

The dataloading interface we provide in pet_ct.learn.datasets expect data stored in a particular format within an [HDF5](https://www.hdfgroup.org/solutions/hdf5/) file. Below we walk through the process of generating a HDF5 file for use in our framework. We generate random data to fill the dataset with the `get_toy_exam` method below, but if you'd like to use your own data, simply implement the `get_exam` method below to fetch a volumetric imaging exam in your dataset. 

It's also worth noting that we use the [`h5py` package](https://www.h5py.org/), a Pythonic interface to the HDF5 file format. If you want to dive into the documentation for any of the h5py functions and classes we use below check out the [documentation](http://docs.h5py.org/en/stable/).

In [30]:
def get_toy_exam(): 
    """ Generates a simulated PET/CT exam with random
    data.
    """       
    num_slices = np.random.randint(low=20, high=30)
    ct_images = np.random.normal(size=(num_slices, 512, 512))
    pet_images = np.random.normal(size=(num_slices, 128, 128))
    report = "IMPRESSION: This is a toy report."
    return ct_images, pet_images, report

## TODO: Implement a get_exam method to work with your data
def get_exam(exam_path: str):
    """ Write your own method for fetching an imaging exam in your
    dataset.
    args:
        exam_path (str): the path to the exam data in your filesystem
    returns:
        ct_images (np.ndarray): a numpy array of shape (num_slices, 512, 512)
            where num_slices can be variable. 
        pet_images (np.ndarray): a numpy array of shape (num_slices, 128, 128)
            where num_slices can be variable. 
    """
    ct_images = None  #  (num_slices, 512, 512)
    pet_images = None  # (num_slices, 128, 128)
    return ct_images, pet_images

`tutorials/01_data/toy_dataset/exams.csv`: We've built a CSV of toy exams indexed by "exam_id", a unique identifier for each exam, and with a column "exam_path" specifying where in the filesystem the data is stored. 

TODO: Build a CSV for your data with matching format.

In [39]:
exams_df = pd.read_csv("tutorials/01_data/toy_dataset/exams.csv", index_col=0)
from IPython.display import display
with pd.option_context('display.max_rows', 10):
    display(exams_df)

Unnamed: 0_level_0,exam_path,patient_id
exam_id,Unnamed: 1_level_1,Unnamed: 2_level_1
xid6412379,data/toy_dataset/exams/xid6412379,pid8389665
xid1706528,data/toy_dataset/exams/xid1706528,pid5241154
xid3418793,data/toy_dataset/exams/xid3418793,pid7287107
xid9559466,data/toy_dataset/exams/xid9559466,pid2357355
xid8868720,data/toy_dataset/exams/xid8868720,pid7656390
...,...,...
xid2711892,data/toy_dataset/exams/xid2711892,pid2306946
xid9200258,data/toy_dataset/exams/xid9200258,pid7468753
xid9725040,data/toy_dataset/exams/xid9725040,pid2414282
xid7079800,data/toy_dataset/exams/xid7079800,pid9483594


In [37]:
# build the dataset
dataset_path = "data/toy_dataset/data.hdf5"
file = h5py.File(dataset_path, 'a')
file.create_group("exams")
exams = file["exams"]

for exam_id, exam in exams_df.iterrows():
    ct_images, pet_images, report = get_toy_exam()

    exams.create_group(exam_id)
    # note: we use one HDF5 dataset for each exam in our dataset, hence the
    # the perhaps confusing function name `create_dataset`
    exams[exam_id].create_dataset("ct", data=ct_images)
    exams[exam_id].create_dataset("pet", data=pet_images)
    exams[exam_id].create_dataset("report", 
                                          data=np.string_(report), 
                                          dtype=h5py.special_dtype(vlen=str))

file.close()