# Preparing the dataset

This notebook can be used to reproduce the full dataset used for training.

First download *and* extract the following data to a folder ```<path-to/IXI>```:
- [T1 images](http://biomedic.doc.ic.ac.uk/brain-development/downloads/IXI/IXI-T1.tar)
- [T2 images](http://biomedic.doc.ic.ac.uk/brain-development/downloads/IXI/IXI-T2.tar)
- [PD images](http://biomedic.doc.ic.ac.uk/brain-development/downloads/IXI/IXI-PD.tar)
- [MRA images](http://biomedic.doc.ic.ac.uk/brain-development/downloads/IXI/IXI-MRA.tar)

The following cell sets up the paths for the data:

In [None]:
ixi_raw = '<path-to/IXI>'               # set to the folder containing the extracted IXI data
ds_raw = '<path-to/Dataset600_IXI>'     # set to the Dataset600_IXI folder checked out with this repository

Execute the following cell once to copy the extracted files to the dataset folder:

In [None]:
import SimpleITK as sitk
import os
from glob import glob

cnt = 0
fps = glob(os.path.join(ixi_raw, '*', 'IXI*-*-*-*.nii.gz'), recursive=True)
print(f"Found {len(fps)} files in the IXI source folder")

for fp in fps:
    fn = os.path.basename(fp)
    bn, ext = fn.split('.', maxsplit=1)
    name, inst, nr, mod = bn.split('-')
    img_fn = f'{bn}_0000.{ext}'
    lbl_fn = f'{bn}.{ext}'
    img_fp = None
    if inst in ('HH', 'Guys'):
        img_fp = os.path.join(ds_raw, 'imagesTr', img_fn)
        lbl_fp = os.path.join(ds_raw, 'labelsTr', lbl_fn)
    elif inst == 'IOP':
        img_fp = os.path.join(ds_raw, 'imagesTs', img_fn)
        lbl_fp = os.path.join(ds_raw, 'labelsTs', lbl_fn)
        
    if img_fp is not None and os.path.exists(lbl_fp):
        cnt += 1
        if not os.path.exists(img_fp):
            print(f'copying: {fn} to {img_fp}')
            img = sitk.ReadImage(fp)
            img = sitk.DICOMOrient(img, 'RAI')
            sitk.WriteImage(img, img_fp, True)
        
assert(cnt == 2218)
print("Dataset is complete!")

            

Note, that not all patients/ids available in the IXI dataset are used.
In the following, we refer to IDs as the first number part in the filename, i.e., 002 for ```IXI002-Guys-0828-MRA.nii.gz```.
Images have been excluded where either:
* the MRA or T1 image where not included, 13 automatically excluded IDs:
    * Training: 014, 116, 182, 213, 225, 255, 302, 578, 589, 651
    * Test: 233, 309, 580
* the ground-truth generation algorithm yielded incorrect results, 14 manually screened and excluded IDs:
    * Training: 113, 169, 229, 244, 299, 415, 440, 457, 458, 468, 490, 585
    * Test: 371, 478 

The final dataset should contain 2218 images and respective labels: 1938 images in imagesTr/labelsTr and 280 images in imagesTs/labelsTs. 