# Data Preperation (Manual)

Assumes you had prepared data according to manual method shown in Youtube video (https://www.youtube.com/watch?v=M3ZWfamWrBM)

Prior Steps involved:
1. Create 'dicom_file' folder (inside 'data') to store all dicom intermediate data
2. Create 'images' and 'labels' folders in 'dicom_file' to store all input(data) and output(labels)
3. For each patient, use 3D Slicer to convert their image and segmentation data into images and labels
4. Create 'dicom_group' folder to store all subsampled intermediate data
5. Create 'images' and 'labels' folders in 'dicom_group' to store all input(data) and output(labels)
6. Create 'nifti_files' folder to store nifti outputs
7. Create 'images' and 'labels' folders in 'nifti_files' to store all input(data) and output(labels)
8. Create 'task_data' folder to store final data location
9. Create 'TrainVolumes', 'TrainSegmentation', 'TestVolumes', 'TestSegmentation' folders in 'task_data' to store seperation of data for each use case

For detailed explanation and final folder layout, see readme.txt in 'data' folder.
Also, if you are integrating 'nifti_files' that has been processed by other group members, you would only need folders from 6. onwards. Processed 'nifti_files' from other group members can be directly copied or moved into 'nifti_files'.

In [1]:
# define folder containin dicom and nifti intermediates

in_images_dir = "data/dicom_file/images"
out_images_dir = "data/dicom_groups/images"
out_nifti_img_dir = "data/nifti_files/images/"

in_labels_dir = "data/dicom_file/labels"
out_labels_dir = "data/dicom_groups/labels"
out_nifti_lbl_dir = "data/nifti_files/labels/"

out_nifti_img_full_dir = "data/nifti_files/images_full/"
out_nifti_lbl_full_dir = "data/nifti_files/labels_full/"

# define folder to store testing and training folders

train_images_dir = "data/task_data/TrainVolumes/"
train_labels_dir = "data/task_data/TrainSegmentation/"
test_images_dir = "data/task_data/TestVolumes/"
test_labels_dir = "data/task_data/TestSegmentation/"

train_images_full_dir = "data/task_data/TrainVolumes_full/"
train_labels_full_dir = "data/task_data/TrainLabels_full/"
test_images_full_dir = "data/task_data/TestVolumes_full/"
test_labels_full_dir = "data/task_data/TestLabels_full/"

# define number of slices
num_slices = 64

# define proportion of test and training data (0-1)
train_proportion = 0.8

In [2]:
%load_ext autoreload
%autoreload 2

In [3]:
# import required packages

import os
from glob import glob
import shutil
import logging
import numpy as np

from preporcess import create_groups, dcm2nifti

### Step 1: Split DICOM files into similar sized data 
Before we actually split the DICOM data, first print the list of directories to be targetted for confirmation. This steps assumed that you have already converted DICOMM series data from NIFTI as per youtube video instruction (see top) and placed them in 'dicomm_file' folder. CT images are stored in 'images' while segmentation maskes are stored in 'labels' folder. In either folder, each patient's DICOMM series data is stored in 1 folder with a unique name.

In [5]:
# print image data
for patient in sorted(glob(in_images_dir + "/*")):
    print(patient)

# print label data
for patient in sorted(glob(in_labels_dir + "/*")):
    print(patient)

data/dicom_file/images/liver_105
data/dicom_file/images/liver_106
data/dicom_file/images/liver_107
data/dicom_file/images/liver_108
data/dicom_file/images/liver_109
data/dicom_file/images/liver_110
data/dicom_file/images/liver_111
data/dicom_file/images/liver_112
data/dicom_file/images/liver_113
data/dicom_file/images/liver_114
data/dicom_file/images/liver_115
data/dicom_file/images/liver_116
data/dicom_file/images/liver_117
data/dicom_file/images/liver_118
data/dicom_file/images/liver_119
data/dicom_file/images/liver_120
data/dicom_file/images/liver_121
data/dicom_file/images/liver_122
data/dicom_file/images/liver_123
data/dicom_file/images/liver_124
data/dicom_file/images/liver_125
data/dicom_file/images/liver_126
data/dicom_file/images/liver_127
data/dicom_file/images/liver_128
data/dicom_file/images/liver_129
data/dicom_file/images/liver_130
data/dicom_file/labels/liver_105
data/dicom_file/labels/liver_106
data/dicom_file/labels/liver_107
data/dicom_file/labels/liver_108
data/dicom

After verifying above folders target, runs the splitting tool provided by original author. This code assumes you have created a 'dicomm_groups' folder which has 'image' and 'labels' folder within it. Both 'image' and 'labels' need not have any other folder within them.
WARNING: original code moves data to save space.
Data is moved to a new set of folders in 'dicomm_groups'. Folder layout of 'dicomm_groups' is similar to 'dicomm_file' however instead of representing a single patient data, folders now represent an extracted 3D segment from the patient. For cases where there is not enough data to form a new segment, the data is left at the old file.

In [6]:
# split images
create_groups(in_images_dir, out_images_dir, num_slices)
# split labels
create_groups(in_labels_dir, out_labels_dir, num_slices)

### Optional: Moves split DICOM files back into original files 

In the case where we need to change the method by which splitting happens, or we would like to undo the above progress, the bottom code moves files back from 'dicomm_groups' to 'dicomm_file' and delete folders representing the previous segmented data. This will reduce the need to regenerate data when we need to redo the segmentation process.

In [4]:
# move back images
for patient in glob(in_images_dir + "/*"):
    head, tail = os.path.split(patient)
    for sub_patient in glob(out_images_dir + "/" + tail + "*"):
        if len(os.listdir(sub_patient)) != 0:
            for file in glob(sub_patient + "/*"):
                shutil.move(file, patient)
        shutil.rmtree(sub_patient)
        
# move back labels
for patient in glob(in_labels_dir + "/*"):
    head, tail = os.path.split(patient)
    for sub_patient in glob(out_labels_dir + "/" + tail + "*"):
        if len(os.listdir(sub_patient)) != 0:
            for file in glob(sub_patient + "/*"):
                shutil.move(file, patient)
        shutil.rmtree(sub_patient)

### Step 2: Convert data back into nifti file format

Now that the data has been segmented, we need to convert them back to NIFTI file format. Prior to this step, it is asummed that a 'nifti_files' folder containing 'images' and 'labels'. No folders are required to be in either of these files. The code below will convert DICOMM series data from 'dicommm_groups' to NIFTI data in 'nifti_files'.

Note: when incorporating processed data from other group members, you can also download data from the group data drive and directly place them in 'nifti_files'.

In [None]:
# convert images
dcm2nifti(out_images_dir, out_nifti_img_dir)
# convert labels
dcm2nifti(out_labels_dir, out_nifti_lbl_dir)

### Step 3: Move files into training and testing folders

Split data from 'nifti_files' into testing and training data in 'task_data'. This folder is also where the network will read from to run the training. The network will also pull test data for generating accuracy metrics to quantify how well it performs. Note you can define the seed below to determine if splitting of testing and trainning data should be a deterministic process.

In [12]:
# function to check that image and label file name matches
def assert_data_labels_match(images, labels):
    assert(len(images) == len(labels))
    for img_name, lbl_name in zip(images, labels):
        assert(os.path.basename(img_name) == os.path.basename(lbl_name))


# load files to be moved
images = sorted(glob(out_nifti_img_dir + "/*.nii.gz"))
labels = sorted(glob(out_nifti_lbl_dir + "/*.nii.gz"))

# run checks first
assert(train_proportion > 0.0 and train_proportion < 1.0) # correct proportion
assert_data_labels_match(images, labels) # image and label name matches

# randomly pull N data for training depending on proportion
N = int(round(len(images) * train_proportion))
print('Num of training data:',N)
print('Num of test data:', len(images) - N)
train_ind = np.full((len(images)), False, dtype=bool)
np.random.seed(seed=123)
train_ind[np.random.choice(len(images), N, replace=False)] = True

for ind, (image, label) in enumerate(zip(images, labels)):
    if train_ind[ind]:
        shutil.move(image, train_images_dir)
        shutil.move(label, train_labels_dir)
    else:
        shutil.move(image, test_images_dir)
        shutil.move(label, test_labels_dir)

Num of training data: 683
Num of test data: 171


### Optional: Move files back to Nifti folder to be redeployed

For cases where you need to rerandomize or resplit the testing or training data, the below script moves the NIFTI files back from 'task_data' into 'nifti_files'.

In [5]:
# fetch all files
train_images = glob(train_images_dir + "/*.nii.gz")
train_labels = glob(train_labels_dir + "/*.nii.gz")
test_images = glob(test_images_dir + "/*.nii.gz")
test_labels = glob(test_labels_dir + "/*.nii.gz")

# move train images
for train_image in train_images:
    shutil.move(train_image, out_nifti_img_dir)
print('Moved',len(train_images),'train images')
# move train labels
for train_label in train_labels:
    shutil.move(train_label, out_nifti_lbl_dir)
print('Moved',len(train_labels),'train labels')
# move test images
for test_image in test_images:
    shutil.move(test_image, out_nifti_img_dir)
print('Moved',len(test_images),'test images')
# move test labels
for test_label in test_labels:
    shutil.move(test_label, out_nifti_lbl_dir)
print('Moved',len(test_labels),'test labels')

Moved 683 train images
Moved 683 train labels
Moved 171 test images
Moved 171 test labels


### Additional steps: Full images and labels without slicing

Same as above but for moving full data instead of sliced data into train_data

In [8]:
# load files to be moved
images_full = sorted(glob(out_nifti_img_full_dir + "/*.nii.gz"))
labels_full = sorted(glob(out_nifti_lbl_full_dir + "/*.nii.gz"))

# run checks first
assert len(images_full) > 0
assert len(images_full) == len(labels_full)
# assert_data_labels_match(images, labels) # image and label name matches

# randomly pull N data for training depending on proportion
N = int(round(len(images_full) * train_proportion))
print('Num of training data:',N)
print('Num of test data:', len(images_full) - N)
train_ind = np.full((len(images_full)), False, dtype=bool)
np.random.seed(seed=123)
train_ind[np.random.choice(len(images_full), N, replace=False)] = True

for ind, (image, label) in enumerate(zip(images_full, labels_full)):
    if train_ind[ind]:
        shutil.move(image, train_images_full_dir)
        shutil.move(label, train_labels_full_dir)
    else:
        shutil.move(image, test_images_full_dir)
        shutil.move(label, test_labels_full_dir)

Num of training data: 105
Num of test data: 26


Same as above but for moving full data instead of sliced data back to nifti files

In [7]:
# fetch all files
train_full_images = glob(train_images_full_dir + "/*.nii.gz")
train_full_labels = glob(train_labels_full_dir + "/*.nii.gz")
test_full_images = glob(test_images_full_dir + "/*.nii.gz")
test_full_labels = glob(test_labels_full_dir + "/*.nii.gz")

# move train images
for train_image in train_full_images:
    shutil.move(train_image, out_nifti_img_full_dir)
print('Moved',len(train_full_images),'train full images')
# move train labels
for train_label in train_full_labels:
    shutil.move(train_label, out_nifti_lbl_full_dir)
print('Moved',len(train_full_labels),'train full labels')
# move test images
for test_image in test_full_images:
    shutil.move(test_image, out_nifti_img_full_dir)
print('Moved',len(test_full_images),'test full images')
# move test labels
for test_label in test_full_labels:
    shutil.move(test_label, out_nifti_lbl_full_dir)
print('Moved',len(test_full_labels),'test full labels')

Moved 105 train full images
Moved 105 train full labels
Moved 26 test full images
Moved 26 test full labels


## Data Preperation Complete!

Congratulations, you have completed data preperation. You can move on to train.py to train your model.