All the Jupyter Notebooks are available at https://github.com/neuro-ml/dpipe_tutorial

# Tutorials on Deep Pipe

The tutorials introduce the library called **Deep Pipe**, which is useful for medical image analysis, including preprocessing, data augmentation, performance validation and final prediction.

## Tutorial 1: Dataset preparation

In the current tutorial we will take a look at the splitters and batch iterators.

### Imports:

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

# import deep pipe library
# how to install: https://github.com/neuro-ml/deep_pipe/blob/master/README.md
import dpipe

Import dataset:

In [3]:
from dpipe.dataset.brats import Brats2017

*The examples of the input arguments can be found at https://github.com/neuro-ml/deep_pipe/blob/master/config_examples/assets/data_source/.*

**Note:** We are using IITP machine with stored Brats dataset:
https://github.com/neuro-ml/deep_pipe/blob/master/config_examples/assets/data_source/iitp/brats.config

In [23]:
dataset = Brats2017(data_path = "/nmnt/t01-ssd/brats2017/train", metadata_rpath = "metadata.csv")

### I. Splitters

In [7]:
import dpipe.split.cv_111 as cv

### i) Standart train / val / test split: `cv_111` function

*(From docstrings)* Function **cv_111** splits the dataset's ids into triplets (train, validation, test).
    The test ids are determined as in the standard K-fold cross-validation setting:
    for each fold a different portion of 1/K ids is kept for testing.
    The remaining (K-1)/K ids are split into train and validation sets according to `val_size`.
    
**Returns: ** Sequence of triplets

In [24]:
splits = cv.cv_111(dataset, val_size=0.2, n_splits=4)
print('The number of splits: %d' % len(splits))

train, val, test = splits[0]

print('\nThe length of dataset: %d ' % len(dataset.ids))
print('The length of train set: %d' % len(train))
print('The length of val set: %d' % len(val))
print('The length of test set: %d' % len(test))

The number of splits: 4

The length of dataset: 285 
The length of train set: 170
The length of val set: 43
The length of test set: 72


### ii) Group train / val / test split: `group_cv_111` function

*(From docstrings)* Function **group_cv_111** splits the dataset's ids into triplets (train, validation, test) keeping all the objects from a group in the same set (either train, validation or test). The test ids are determined as in the standard K-fold cross-validation setting: for each fold a different portion of 1/K ids is kept for testing. The remaining (K-1)/K ids are split into train and validation sets according to `val_size`. The splitter guarantees that no objects belonging to the same group will end up in different sets.
    
**Returns: ** Sequence of triplets

**Note: ** Dataset must have a `group` property.

In [34]:
import dpipe.dataset.wrappers as wp

In [40]:
new_dataset = wp.add_groups_from_df(dataset, 'cancer_type')

In [43]:
splits = cv.group_cv_111(new_dataset, val_size=0.2, n_splits=4)
print('The number of splits: %d' % len(splits))

train, val, test = splits[0]

print('\nThe length of dataset: %d ' % len(dataset.ids))
print('The length of train set: %d' % len(train))
print('The length of val set: %d' % len(val))
print('The length of test set: %d' % len(test))

ValueError: Cannot have number of splits n_splits=4 greater than the number of samples: 2.