# Histopatologic Cancer Detection

## Authors

| Name             | ID number | Email                                  |
| :--------------- | :-------- | :------------------------------------- |
| Tomasz Gil       | 127295    | tomasz.gil@student.put.poznan.pl       |
| Łukasz Kobyłecki | 127292    | lukasz.kobylecki@student.put.poznan.pl |

## Description

According to [Histopathologic Cancer Detection](https://www.kaggle.com/c/histopathologic-cancer-detection/) competition description page on Kaggle:
> In this competition, you must create an algorithm to identify metastatic cancer in small image patches taken from larger digital pathology scans. The data for this competition is a slightly modified version of the PatchCamelyon (PCam) benchmark dataset (the original PCam dataset contains duplicate images due to its probabilistic sampling, however, the version presented on Kaggle does not contain duplicates).

## Approach

We will create a classifier using a convolutional neural network (using DenseNet architecture) and train it using provided data examples. During training we will use the following methods and mechanisms:
* data augmentation
* transfer learning
* fine tuning with discriminitive learning rates

In [None]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

## Importing packages

We will use `fastai`, which is an open deep learning library for Python. It provides powerful abstactions over machine learning pipeline elements for various types of problems and data sources. It allows to iterate quickly on the problems, comes with a functions that help choose parameters for neural network and delivers most popular architectures with predefined parameters. This library acts as a wrapper on PyTorch models, allowing you to define your own layers for more customized solutions.

As the domain of this problem computer vision classification, we will heavily use contents of `fastai.vision` subpackage.

Documentation: https://docs.fast.ai/  
Source code: https://github.com/fastai/fastai

In [None]:
from fastai import *
from fastai.vision import *
from torchvision.models import * 
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import os

## Data

Input data is very straightforward - we are given histopatologic images, all of which are associated with a boolean value. The value indicates whether image contains cancer cells or not. True values are mapped to 1, false values to 0. We will treat those values as classes for our classification.

Having said that, we take a look at how many images we have assiciated with each class.

In [None]:
path = Path("../input/histopathologic-cancer-detection/")
train_labels = pd.read_csv(path/"train_labels.csv")
train_labels.head()

In [None]:
classes = pd.unique(train_labels["label"]);
for i in classes:
    print("{} items in class {}".format(len(train_labels[train_labels["label"] == i]), classes[i]))

### Data augmentation

We will use this technique to reduce data requirements. It allows to modify model inputs and essentially created new images during training in order to effectively increase data size. There are many types of data augmentation such as rotation, flip, padding, perspective warping, etc. This is extremely easy and effective for problems in the domain of computer vision.

Below we create a object with parameters for augmentation. Since for this problem image orientation should not have any influence on the predicted class, we allow images to be flipped in both axises. Further we also specify how much the images can be rotated, zoomed and warped. Finally, we also specified parameters for changing the brightness of generated images.

In [None]:
transforms = get_transforms(
                do_flip=True, 
                flip_vert=True, 
                max_rotate=10.0, 
                max_zoom=1.1, 
                max_lighting=0.05,
                max_warp=0.2,
                p_affine=0.75,
                p_lighting=0.75
            )

In order to make sure that our results are reproducible, we will set a constant seed for random number generator.

In [None]:
np.random.seed(22)

### Loading input data

Data for deep learning models is stored in an object called `DataBunch`. It contains:
* training set - data the model is learning with
* validation set - data that model does not look at when training, used to calculate and print metrics
* (optionally) test set - data without labels
* data loader - used by learner to load data into the memory

As the `DataBunch` object will be used by the deep learning model, we need to pass two more important parameters. First is data size - images used by our model will be squares of given size. Second one is batch size - this parameter is used to load and process images (how many images are processed by GPU at once). Those two parameters should be changed with respect two one another and, as a result, they influence how fast our model is learning. Finally, we pass parameters for transformations used for data augmentation in `transforms` object.

Intitially we will start with `32x32` pixel images. Also after creating creating the object with our data, we will normalize it using `imagenet_stats` properties - statictics from [ImageNet](http://www.image-net.org/) which is an image database, containing hundreds and thousands of images for each category, widely used by researchers, educators and students.

In [None]:
IMG_SIZE = 32
BATCH_SIZE = 256

In [None]:
data = ImageDataBunch.from_csv(path, folder = 'train', csv_labels = "train_labels.csv",
                               test = 'test', suffix=".tif", size = IMG_SIZE, bs = BATCH_SIZE,
                               ds_tfms = transforms)
data.path = pathlib.Path('.')
stats = data.batch_stats()        
data.normalize(stats)

We can verify if the data has been loaded correctly. First we check `data.classes` attribute which stores classes that will be used as output varibles for classification model; `data.c` contains the number of classes. Finally, we can see how loaded images look like, by showing a small sample of them.

In [None]:
print(data.classes)
data.c

In [None]:
data.show_batch(rows=5, figsize=(12,9))

## Creating a model

We create our classifier using one of the convolutional neural network's extensions called [DenseNet](https://arxiv.org/abs/1608.06993) (which stands for Densely Connected Convolutional Network). This iteration on neural network architecture makes it substantially deeper, more accurate and efficient to train. Connections between layers close to both input and output layers are much shorter. Thanks to that DenseNet offers many advantages over standard CNN (Convolutional Neural Network), few of which are strengthened feature propragation or reduced number of parameters.

We will use `cnn_learner` class from `fastai` which creates a learner object containing the model that we are training with defined architecure and metrics which will be calculated after each round of training. Additionally, our model will be pretrained on `ImageNet` dataset. This makes the very first layers come with predefined parameters, which means our model initially has pretty good understaning of basic shapes, curves, edges and other generic elements that build each and every image, regardless of the domain. We will use that as a base, which gives us a head start, since our model needs to learn how to use those basic shapes to understand and recognize the problem that we need to solve. This technique is called transfer learning.

For the training part we will use accuracy as a metric.

In [None]:
arch = models.densenet121
learn = cnn_learner(data, arch, pretrained = True, metrics = [accuracy])

## Training the model

Training takes place in iterations called epochs. During each epoch our model gets a chance to look at each and every one of input images exactly once. Based on that it updates its parameters. There are two important parameters for training which are number of epochs and learning rate. First one defines how many times we show each image to our model, second one defines how fast we move through solution space - how rapidly the model updates its parameters.

### Learning rates and number of epochs

Both parameters, in conjuncion, influence how quickly our model is learning. One caviat is that those parameters have to be chosen experimentaly, as there's no automatic way to find a good learning rate or number of epochs. We need to make sure that both of those parameters are high enough so that we can train the model in finite time and low enough so that we avoid overfitting or model divergence.

`fastai` delivers convenient methods of determining those parameters. For learning rate we will use `lr_find`, which plots model loss with respect to learning rate. As a rule of thumb, we should choose a rate, where the value of the function decreases the fastest with minimal variation. This ensure model parameters updates are convergent and relatively fast. In terms of number of epochs, we need to pay attention to `train_loss` and `validation_loss` metrics (loss function calculated against training and validation sets). Here we should strive for keeping those values close one to another, both steadily decreasing, with `validation_loss` slighlty higher than `train_loss`.



In [None]:
learn.lr_find()
learn.recorder.plot()

Standard way of training the model is using a `fit` function, which follows the training method described before. We will use a variation of this method called `fit_one_cycle`. It has the same interface (number of epochs and learning rate), though treats learning rates slighlty differently.

It uses increasing learning rates at the beginning of each epoch to make sure our model moves in the right direction and help explore the entire solution space. At the end is incorporates a technique called learning rate annealing, which means learning rate decreases at ther end, preventing the model to diverge.

In [None]:
learn.fit_one_cycle(6, 1e-2)

### Fine tuning

As we said before, our model comes already pretrained. What it means, for the initial epochs we are not updating the very first layers but only the last ones. In this configuration we can afford to move faster (update parameters quicker), since our model has a solid base.

In order to futher improve the model we will use `unfreeze` function, which allows all model paramters to be updated during training. Having that in mind, we need to be more careful with the pace of training and decrease learning rates. We will use discriminative learning rates. This means instead of specifying one learing rate values, we can give a learning rate range between two values. First value will be applied to initial layers, last to final layers and the rest equally distributed.

In [None]:
learn.unfreeze()
learn.lr_find()
learn.recorder.plot()

In [None]:
learn.fit_one_cycle(2, max_lr = slice(1e-5,1e-3))

## Transfer learning revisited

We have already used transfer learning method, when we trained our model on top of an exising model pretrained on ImageNet dataset. Now, we have a model that is relatively good at recognizing cancer in `32x32` histopatologic images. In order to avoid overfitting, we can reuse the data set, but scaling images to `64x64`. This will appear as a totally new dataset for our model. This is the part were we use tranfer learning again - we take a model that we have, previously trained on `32x32` sized images, and train it a new dataset of `64x64` sized images.

We will follow the same training scheme as previously:
1. Create a dataset.
2. Choose a learning rate.
2. Train freezed model.
3. Unfreeze model.
4. Decrease learning rate, use a range rather than a single value.
5. Fine-tune unfreezed model.

In [None]:
IMG_SIZE = 64
BATCH_SIZE = 128

In [None]:
data = ImageDataBunch.from_csv(path, folder = 'train', csv_labels = "train_labels.csv",
                               test = 'test', suffix=".tif", size = IMG_SIZE, bs = BATCH_SIZE,
                               ds_tfms = transforms)
data.path = pathlib.Path('.')
stats = data.batch_stats()      
data.normalize(stats)

In [None]:
learn.freeze()
learn.lr_find()
learn.recorder.plot()

In [None]:
learn.fit_one_cycle(6, 1e-2)

In [None]:
learn.unfreeze()
learn.lr_find()
learn.recorder.plot()

In [None]:
learn.fit_one_cycle(2, max_lr = slice(1e-4,1e-3))

## Evaluation

According to [evaluation overview](https://www.kaggle.com/c/histopathologic-cancer-detection/overview/evaluation) for Histopatologic Cancer Detection contest on Kaggle:
> Submissions are evaluated on area under the ROC curve between the predicted probability and the observed target.

Until this point we have been using accuracy as a metric, which gave good insight into how our model was doing and whether it was improving or not. In order to see, how out model would score in Kaggle competition, we used `roc_auc_score` metric from `sklearn`.

Receiver operating characteristic (ROC): https://en.wikipedia.org/wiki/Receiver_operating_characteristic

In [None]:
from sklearn.metrics import roc_auc_score

def auc_score(y_pred, y_true, tens=True):
    score = roc_auc_score(y_true, torch.sigmoid(y_pred)[:,1])
    if tens:
        score = tensor(score)
    else:
        score = score
    return score

Interesting evaluation tool to use, especially for classification, is confusion matrix. It gives information about how many your model assigned cerain class with regard to what class they actually belong to. This is especially important for this case - the biggest problem is when histopatologic image contains cancer cells, while our model predicts that it does not.

In [None]:
interpretation = ClassificationInterpretation.from_learner(learn)
interpretation.plot_confusion_matrix()

Finally we will print both accuracy and AUC score of trained model.

In [None]:
predictions,y = learn.TTA()
acc = accuracy(predictions, y)
print('Final accuracy of the model: {} %.'.format(acc * 100))
prediction_score = auc_score(predictions,y).item()
print('Final AUC of the model: {}.'.format(prediction_score))

## Creating a submission file

Last step is generating Kaggle submition file, based on a sample provided by the platform. For this purpose we will use test dataset and generate predictions using our model.

In [None]:
submissions = pd.read_csv(path/'sample_submission.csv')
id_list = list(submissions.id)
predictions,y = learn.TTA(ds_type=DatasetType.Test)
prediction_list = list(predictions[:,1])
prediction_dict = dict((key, value.item()) for (key, value) in zip(learn.data.test_ds.items, prediction_list))
prediction_ordered = [prediction_dict[path/('test/' + id + '.tif')] for id in id_list]
submissions = pd.DataFrame({'id':id_list,'label':prediction_ordered})
submissions.to_csv("submission_result.csv",index = False)