<a href="https://colab.research.google.com/github/vzinche/training_ML_for_image_analysis_EBI/blob/master/training_script.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The first thing we have to do is to (unexpectedly!) take a look at the data.

We are going to work with Kaggle 2018 Data Science Bowl data.
To start with go the [data webpage](https://www.kaggle.com/c/data-science-bowl-2018), read the description and the evaluation parts (what do we have to do?), and then check the 'data' tab to see the data structure. 

As you can see only the 'stage1_train' part has ground truth available. Thus, we will to use it for training and evaluation. Additionally, we could test our model on 'stage1_test' or 'stage2_test', but this way we could judge the model performance only visually - we don't have the ground truth to get any numbers. 

To make things easier, let's start with downloading just the 'stage1_train' and 'stage1_test' data, that I stored on my Google Drive: 


In [0]:
!wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1O66UElt2ZfhLXUKKX_nTxmIXh6fMA2rT' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1O66UElt2ZfhLXUKKX_nTxmIXh6fMA2rT" -O kaggle_data.zip && rm -rf /tmp/cookies.txt

Remember that you can execute any bash command from the Notebook if you preceed the command name with '!'. 

And please check whether the downloaded archive is around 80M (the value after the progress bar [      <=>           ]). Sometimes Google fails :)

Those of you who like bash can play around with unzipping the data into nice folders. The rest of you an just run the following:

In [0]:
#@title

!unzip -qq kaggle_data.zip && rm kaggle_data.zip
!mkdir nuclei_train_data && unzip -qq stage1_train.zip -d nuclei_train_data/ && rm stage1_train.zip
!mkdir nuclei_test_data && unzip -qq stage1_test.zip -d nuclei_test_data/ && rm stage1_test.zip

Don't forget to periodically proofread what's happening by listing the contents of your directory with '!ls'

In [0]:
!ls -ltrh

Now take a moment to think about the data you have. 

What will be the training data, the evaluation data and the testing data? How do you split it? 

Hint: you have ground truth only for 'stage1_train'. You might want to split it into the training and evaluation data. Normally, we would also want our test data to have some ground truth as well, since we would report the accuracy on the test data. In this case it's not necessary, since (if we would have participated in the actual Data Science Bowl) we would have gotten the accuracy value after submitting our solutions. 

Now think at which stage you would split the data. Do you want to split it into separate folders in bash or already in python?
(You can just run the next cell to split it with bash)

In [0]:
#@title

!mkdir nuclei_eval_data && n=0 && for file in nuclei_train_data/*; do test $n -eq 0 && mv "$file" nuclei_eval_data/; n=$((n+1)); n=$((n%4)); done
!ls nuclei_train_data | wc -l && ls nuclei_eval_data | wc -l

Now let's import the libraries we might need! I've added some, that are strictly necessary (in my opinion :), but you might need to add some more

In [0]:
%matplotlib inline
%load_ext tensorboard
import os
import math
import matplotlib.pyplot as plt
import numpy as np
from PIL import Image
import torch
from torch.utils.data import Dataset, DataLoader
from torch.utils.tensorboard import SummaryWriter
import torch.nn as nn
from torch.nn import functional as F
from torchvision import transforms, utils

What I would normally start with in any machine learning pipeline is writing a dataloader. Luckily most of the functionality is already provided by PyTorch, but what you need to do is to write a class, that will actually supply the dataloader with training samples - a Dataset.

Please take a moment to read about it [here](https://pytorch.org/docs/stable/data.html?highlight=dataset#torch.utils.data.Dataset) and [here](https://pytorch.org/tutorials/beginner/data_loading_tutorial.html#dataset-class).



The main idea: any Dataset class should have two methods: \_\_len\_\_ that returns the dataset length (the number of element) and \_\_getitem\_\_ that, given an index, returns input (image) and target (ground truth).

Now try writing your own dataset. Think what you should include there (what do you want to do with your data?)

In [0]:
#your code here

Otherwise, try making this already written prototype work (you should take care of all the TODO's)

In [0]:
#@title
#any PyTorch dataset class should inherit the initial torch.utils.data.Dataset
class NucleiDataset(Dataset):
    """ A PyTorch dataset to load cell images and nuclei masks """
    def __init__(self, YOUR_ARGS, transform=None):     # which arguments would you need to build a dataset?
        self.transform = transform    # we might want to apply some transformations to your data
        self.YOUR_ARG = YOUR_ARG      # TODO save the variables that you pass as arguments 
        self.ANOTHER_ARG? = ANOTHER_ARG?
        self.samples = # TODO we need to get a list of all the training samples
        self.inp_transforms = transforms.Compose([transforms.Grayscale(),     # we'll have a mix of coloured and greyscale images, let's train a network on something consistent
                                                  transforms.ToTensor(),      # all the data would have to be transformed to Torch tensors before training
                                                  transforms.Normalize([0.5], [0.5])    # we would normally want to normalize our data
                                                  ])

    def __len__(self):
        return len(self.samples)

    def __getitem__(self, idx):
        img_name = # TODO How would you get an image name given its id?
        image = Image.open(img_name)    # we'll be using Pillow library for reading files
        image = image.convert("RGB")    # this is just to adjust the axis order
        image = self.inp_transforms(image)
        masks_dir = # TODO What is the folder with the corresponding masks?
        if not os.path.isdir(masks_dir):    # for the testing case, when we don't have ground truth
            return image
        masks_list = os.listdir(masks_dir)
        # if you haven't noticed already, each mask is stored in a separate image
        # we have to iterate through all of them and sum them up
        # TODO: your code here
        if self.transform is not None:
            image, mask = self.transform([image, mask])
        return image, mask



Once you are ready with your dataset class, let's load it!

In [0]:
TRAIN_DATA_PATH = # insert the path here
train_data = NucleiDataset(TRAIN_DATA_PATH)

And let's take this simple function to show the dataset:

In [0]:
def show_dataset(dataset):
    idx = np.random.randint(0, len(dataset))    # take a random sample
    img, mask = dataset[idx]                    # get the image and the nuclei masks
    f, axarr = plt.subplots(1, 2)               # make two plots on one figure
    axarr[0].imshow(img[0])                     # show the image
    axarr[1].imshow(mask[0])                    # show the masks
    _ = [ax.axis('off') for ax in axarr]        # remove the axes
    plt.show()

In [0]:
show_dataset(train_data)

As you can probably see, if you clicked enough times, some of the images are really huge! What happens if we load them into memory and run the model on them? We might run out of memory. That's why normally, when training networks on images or volumes one has to be really careful about the sizes. In practice, you would want to regulate their size. Additional reason for restraining the size is: if we want to train in batches (faster and more stable training), we need all the images in the batch to be of the same size. That is why we prefer to either resize or crop them.

Here is a function (well, actually a class), that will apply a transformation 'random crop'. Notice that we apply it to images and masks simultaneously to make sure they correspond, despite the randomness.

In case anybody is wondering why we have to bother to write a whole class for it instead of simply coping the images directly in the dataset: we want to keep the code modular. We want to write one dataset object, and then we can try all the possible transforms with this one dataset. Similarly, we want to write one Randomcrop transform object, and then we can reuse it for any other image datasets we night have in the future. 

In [0]:
class RandomCrop(object):
    """Crop randomly the input image and the output mask"""
    def __init__(self, crop_size):
        assert isinstance(crop_size, (int, tuple, list))
        if isinstance(crop_size, int):
            self.output_size = (crop_size, crop_size)
        else:
            assert len(crop_size) == 2
            self.crop_size = crop_size

    def __call__(self, sample):
        assert len(sample) == 2
        image, mask = sample
        w, h = image.shape[1:]
        new_w, new_h = self.output_size
        top = np.random.randint(0, h - new_h) if h - new_h > 0 else 0
        left = np.random.randint(0, w - new_w) if w - new_w > 0 else 0
        image = image[:, left: left + new_w, top: top + new_h]
        mask = mask[:, left: left + new_w, top: top + new_h]
        return image, mask

PS: PyTorch already has quite a bunch of all possible data transforms, so if you need one, check [here](https://pytorch.org/docs/stable/torchvision/transforms.html). The biggest problem with them is that they are clearly separated into transforms applied to PIL images (_remember, we initially load the images as PIL.Image?_) and Torch tensors (_remember, we converted the images into tensors by calling transforms.ToTensor()?_). This can be incredibly annoying if for some reason you might need to transorm your images to tensors before applying any other transforms or you don't want to use PIL library at all. 

Now let's get a new dataset with cropping and check it

In [0]:
train_data = NucleiDataset(TRAIN_DATA_PATH, RandomCrop(256))

In [0]:
show_dataset(train_data)

And the same for the evaluation data:

In [0]:
EVAL_DATA_PATH = # insert your path here
eval_data = NucleiDataset(EVAL_DATA_PATH, RandomCrop(256))

In [0]:
show_dataset(eval_data)

Now comes a harder part :D 


We need to define the architecture of the model to use. I suggest a [U-Net](https://lmb.informatik.uni-freiburg.de/people/ronneber/u-net/) that has proven to steadily outperform the other architectures in segmenting biological and medical images.

The image of the model precisely describes all the building blocks you need to use to create it. All of them can be found in the list of PyTorch layers (modules) [here](https://pytorch.org/docs/stable/nn.html#convolution-layers).

So if you feel like it, you can try to construct the model yourself. If not, luckily for you, from my personal experience: whatever new you want to try out in the Deep Learning field, if it has been published, most likely you will find the code somewhere in the internet (GitHub!). So feel free to google 'U-Net pytorch' and see what you find. You still have to read through the code you've found to make sure it makes sense and doesn't use any weird libraries that we don't want to install here.

__Additional note__: even if you're able to write the model/layer/loss/whatever yourself, it makes more sense to first look for it in the internet, because it is pretty likely that you can simply find something better implemented (more efficient, numerically stable, etc). But don't forget to cite!

In [0]:
class UNet(nn.Module):
    def __init__(self, ARGS):     # you start with initializing the model with needed parameters
        #TODO: your code here

    def forward(self, x):         # this is the function that will actually take your input (images) and process them to generate the output (segmentation)
        #TODO: your code here

The next step to do would be writing a loss function - a metric that will tell us how close we are to the desired output. This metric should be differentiable, since this is the value to be backpropagated. The are [multiple losses](https://lars76.github.io/neural-networks/object-detection/losses-for-segmentation/) we could use.

Take a moment to think which one is better to use. If you are not sure, don't forget that you can always _google_!
Before you start implementing the loss yourself, take a look at losses already implemented in [PyTorch](https://pytorch.org/docs/stable/nn.html#loss-functions). You can also look for implementations on GitHub. 

TODO: find what a Focal loss is. Try to explain the rationale behind using it. Try implementing it yourself or find some satisfactory implementation on GitHib.  

In [0]:
#TODO : your focal (or some other) loss here

Let's also define a helper function that will calculate [Intersection Over Union](https://www.pyimagesearch.com/2016/11/07/intersection-over-union-iou-for-object-detection/) for us. Why do we need it? Because that is the metric that will be used to evaluate our submissions (check the 'evaluation' tab on the dataset website). Why we don't use it as a loss directly? It's not differentiable and has 'not the best' surface properties.

In [0]:
def get_iou(prediction, mask):
    iou = (prediction & mask).sum().float() / (prediction | mask).sum().float()
    return iou

Now let's write a function that will take care of all the training steps, namely, for every training sample (or training batch), take a forward step, compute the loss, backpropagate, update. 

In [0]:
def train(model, optimizer, train_loader, test_loader, num_epochs=10):
    for epoch in range(num_epochs):

        # print a nice progress bar
        print('Epoch {}/{}'.format(epoch+1, num_epochs))
        print('-' * 10)

        model.train()   # set the model to the training state

        #TODO: initialize any values you want to track (loss, accuracy, iou?)

        for images, masks in train_loader:
            optimizer.zero_grad()    # erase all the gradients from the previous steps
            outputs = model(images)    # predict
            predictions = (outputs > 0.5)    # binarize the predictions to get a mask

            # you might want to be able to visually judge how the training is going
            # use the show_dataset function as a template to write a function
            # that will display input images, ground truth masks and the model predictions
            # use it to show the progress, let's say, every 10th iteration
            # TODO: write show_images function and apply it every 10th iteration 

            # TODO: apply the loss (you need the output and the ground truth)
            loss = 

            # TODO: calculate the pixel-wise accuracy (just take the mean of predictions==target)
            accuracy = 

            # TODO: calculate the IoU (use the function you wrote before)
            iou = 

            loss.backward()    # compute the gradients for every neuron
            optimizer.step()    # update the weights!

            # TODO: update the values you are tracking

        # TODO: for every epoch print the values that we are tracking
        # Remember that we interested in numbers per sample, not per whole dataset 
        # (e.g., accuracy=53.6 will not tell you anything meaningful)

        # every epoch we want to check how the model performs on a previously unseen data
        # for this we need to run the model on our evaluation data and check the values
        # For testing purposes you would want to have evaluation state of you model - 'model.eval()'
        # Moreover, you wouldn't want to calculate gradients or backpropagate now
        # So in order to save time you would run all the iterations with - 'with torch.no_grad():'
        # TODO: write the evaluation
    
    return model

Now when we have the main train function, let's prepare all we need for training

In [0]:
# get the actual PyTorch loaders!
train_dataloader = DataLoader(train_data, batch_size=5, shuffle=True) # you can adjust the batch size to fit your memory
eval_dataloader = DataLoader(eval_data, batch_size=5)
# create a model
model = UNet(PARAMETERS) # TODO: which parameters do you need to initialize your UNet?
# and an optimizer - it will take care of updating our parameters properly
optimizer = torch.optim.SGD(model.parameters(), lr=0.001, momentum=0.9)

Take a moment to [read](https://machinelearningmastery.com/understand-the-dynamics-of-learning-rate-on-deep-learning-neural-networks/) about the importance of setting a proper learning rate.

Now we're good to go. Let's see how the training goes!

In [0]:
model = train(model, optimizer, train_dataloader, eval_dataloader, num_epochs=10)

The first thing you might notice is that the training is moving incredibly slow. That's because we have quite some heavy computations to run. A thing to consider once you have an idea you want to use Deep Learning for image processing: you might want to train on a GPU!

Luckily Google Colab Notebooks have GPU's available. For this you need to go to Edit -> Notebook Settings -> Hardware accelerator and choose GPU. Note: this will restart the whole notebook and clean your home directory. You would have to rerun all the cells, including the ones where you download and unzip the data. 

Now we'll have to tell PyTorch to train on GPUs. Firstly, let's confirm it's available.

In [0]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")   # check if we have GPUs
print(device)   # this should print 'cuda:0'

What we have to do now is to bring all the data we need for traing to the device we use for training. It is done by calling the method to() on you data, for instance:

`model = model.to(device)`

What I suggest is to add device as an argument for your function, and then transfer everything to this device. Then you can just set this device to gpu or cpu while calling the train function.

Hint: you would want to send all the inputs and targets, as well as the model to GPU. For some postprossesing (e.g. visualising images) you would need to send it back to cpu by calling `data.cpu()`


TODO: rerun the same training on GPU. Enjoy the speed improvement! 

The next step (I naively assume we still have time) is to implement a nicer way to track the progress of our training, than simply printing all the metrics every epoch. For this we would want to use an amazing tool called [tensorboard](https://www.tensorflow.org/tensorboard). It is developed by TensorFlow, but can be integrated with PyTorch as well. 

There is an amazing tutorial on what can be visualised with TensorBoard in PyTorch [here](https://pytorch.org/tutorials/intermediate/tensorboard_tutorial.html). For now we are just interested in tracking our metrics (scalars). For this we just need to create [a summary writer](https://pytorch.org/tutorials/intermediate/tensorboard_tutorial.html#tensorboard-setup) and [add our scalars there](https://pytorch.org/tutorials/intermediate/tensorboard_tutorial.html#tracking-model-training-with-tensorboard).

Adding scalars is as simple as calling this function:

`writer.add_scalar(scalar_name, scalar_value, step_number)` 

where the step number is (epoch number * dataset size + iteration count)

In [0]:
# TODO: modify your train function to add your metrics (loss, accuracy and iou) to the writer every epoch 
# HINT: you might want to add the writer as an argument to your train function

In [0]:
# TODO: now initialize your summary writer (mind the directory)

In [0]:
%tensorboard --logdir YOUR_DIR_HERE    # launch tensorboard from inside the notebook

Now run the training and enjoy tracking the metrics!

For people who have time and want to play around: 

*   try to visualize the graph (model) that we are using with TensorBoard
*   try to write your own data transform (e.g., RandomRotate)
*   try to visualize the images-maks-predictions in TensorBoard every nth epoch
*   test alternative loss functions

