# Histopathologic Cancer Detection

#### Ömer Faruk Yaşar - 21527577
#### Furkan Kaya - 21527161

We will establish an algorithm that tries to predict whether metastatic cancer is present in small image patches (96x96px) from high resolution pathology scans. Our data set will be a modified data set obtained by subtracting duplicates from the PatchCamelyon (PCam) dataset. Using Convolutional Neural Network, we aim to achieve a successful binary classification. To measure our success, we will use the area under the ROC curve metric.

## Table of Content

[Problem](#problem)   
[Data Understanding](#data_understanding)   
[Data Preparation](#data_preparation)   
[Modeling](#modeling)   
[Evaluation](#evaluation)   
[References](#references)

## Problem <a class="anchor" id="problem"></a>[](http://)

For pathologists, the diagnostic procedure is a very time-consuming and effortful process. They need to make a very detailed examination on **high resolution digital pathological scans** (like 100.000x100.000px). And they are likely to overlook small metastases in this large image. Due to this small overlook, the doctor can diagnose the patient is not cancer, even though the person has cancer. Although we will not completely solve this problem, we want to detect small metastases (**binary classification problem**)  with machine learning techniques and reduce the workload of pathologists, using these high-resolution digital scans fragmented as 96x96px. In this way, pathologists will know which regions to focus on in these large digital pathological scans and will spend less time.

In short, we have **96x96px** images taken from large pathological scans, and we want to find out whether we can classify whether these images have cancerous tissue at high accuracy using **deep learning techniques**.

## Data Understanding<a class="anchor" id="data_understanding"></a>

The dataset that we will use is a subset of the original [PCam dataset](https://github.com/basveeling/pcam) which in the end is derived from the [Camelyon16 Challenge dataset](https://camelyon16.grand-challenge.org/Data/), which contains 400 H&E stained whole slide images of sentinel lymph node sections that were acquired and digitized at 2 different centers using a 40x objective. The PCam's dataset including this one uses 10x undersampling to increase the field of view, which gives the resultant pixel resolution of 2.43 microns.

It consists of **277.493** color images (96x96px) extracted from **histopathologic scans of lymph node sections**. Each image is annotated with a binary label indicating presence of metastatic tissue. The train data we have here contains **220,025** images and the test set contains **57,468** images.

In [None]:
import os

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import matplotlib.patches as patches
import seaborn as sns
sns.set(style="darkgrid")
from PIL import Image

import torch
from torch.utils.data import TensorDataset, DataLoader,Dataset
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader
from torch.utils.data.sampler import SubsetRandomSampler
from torch.optim.lr_scheduler import StepLR, ReduceLROnPlateau, CosineAnnealingLR
import torch.optim as optim
from torch.optim import lr_scheduler

import torchvision
import torchvision.transforms as transforms

from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

In [None]:
train_images_folder_path = "../input/histopathologic-cancer-detection/train"
test_images_folder_path = "../input/histopathologic-cancer-detection/test"
labels_file_path = '../input/histopathologic-cancer-detection/train_labels.csv'

In [None]:
labels = pd.read_csv(labels_file_path)
labels.head()

Here **id**s are used to match the filenames of the images. **Label 0** means negative that is an image without cancer tissue. **Label 1** means positive that is an image with a cancerous tissue.

In [None]:
fig, ax = plt.subplots(figsize=(5,5))
ax.pie([labels.label.value_counts()[0], labels.label.value_counts()[1]], labels=['Negative', 'Positive'], autopct='%1.1f%%');

We have a training data set with a distribution of approximately **60:40** (60% negative, 40% positive).

In [None]:
sns.countplot(x='label', data=labels);

> *A positive label indicates that the center 32x32px region of a patch contains at least one pixel of tumor tissue. Tumor tissue in the outer region of the patch does not influence the label. This outer region is provided to enable the design of fully-convolutional models that do not use any zero-padding, to ensure consistent behavior when applied to a whole-slide image.* (Source: https://github.com/basveeling/pcam)

In [None]:
fig, ax = plt.subplots(2,5, figsize=(20,8))
fig.suptitle('Histopathologic scans of lymph node sections')

# Negative Samples
for i, image_id in enumerate(labels[labels.label == 0]['id'][:5]):
    path = os.path.join(train_images_folder_path, image_id)
    ax[0,i].imshow(Image.open(path + '.tif'))
    box = patches.Rectangle((32,32), 32, 32, linewidth=5, edgecolor='g', facecolor='none')
    ax[0,i].add_patch(box)
ax0 = ax[0,0].set_ylabel('Negative samples')

# Positive Samples
for i, image_id in enumerate(labels[labels.label == 1]['id'][:5]):
    path = os.path.join(train_images_folder_path, image_id)
    ax[1,i].imshow(Image.open(path + '.tif'))
    box = patches.Rectangle((32,32), 32, 32, linewidth=5, edgecolor='r', facecolor='none')
    ax[1,i].add_patch(box)
ax1 = ax[1,0].set_ylabel('Positive samples')

As we can see from randomly received image files, it is impossible for us to look at the picture and make a label estimate, it is a very challenging process even for the professionals of this domain. Since it is decided to classify this by taking many features together such as cell density, cell color distribution etc., finding an estimate using all these features using deep learning methods will make things much easier.

## Data Preparation<a class="anchor" id="data_preparation"></a>

In machine learning overfitting and genaralization are big problems. To solve this problems you should have diverse and huge dataset. To have that kind of dataset naturally is not possible most of the time because collecting and labeling the data is challenging and expensive process. So in order to increase diversity in our dataset we apply some **data augmentation** techniques to our dataset for both avoiding overfitting and obtain better model. Random horizontal flip and vertical flip, small degrees of random rotation and normalization were applied on train images. All this operations do not increase dataset size each of them applied to images during training with 0.5 probability by that way we obtain different versions of same images in different epochs. This process increase our dataset's diversity. To do data augmentation we use pytorch's data loader and transformation classes which is very handy and easy to use. First we describe which transforms going to perform and than apply while creating train and test sets. We also split **70% to 30%** of our train data for validation.

In [None]:
train_indices, validation_indices = train_test_split(labels.label, stratify=labels.label, test_size=0.3)

In [None]:
data_transformations_train = transforms.Compose([
    transforms.Pad(64, padding_mode='reflect'),
    transforms.RandomHorizontalFlip(),
    transforms.RandomVerticalFlip(),
    transforms.RandomRotation(15),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])

data_transformations_test = transforms.Compose([
    transforms.Pad(64, padding_mode='reflect'),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])

In [None]:
fig, ax = plt.subplots(1,6, figsize=(24,4))

image_path = os.path.join(train_images_folder_path, 'c18f2d887b7ae4f6742ee445113fa1aef383ed77.tif')
image_original = Image.open(image_path)

center_crop = transforms.CenterCrop(64)

composition = transforms.Compose([
    transforms.Pad(64, padding_mode='reflect'),
    transforms.RandomHorizontalFlip(),
    transforms.RandomVerticalFlip(),
    transforms.RandomRotation(15)
])

ax[0].imshow(image_original)
ax[1].imshow(center_crop(image_original))
ax[2].imshow(transforms.functional.hflip(image_original))
ax[3].imshow(transforms.functional.vflip(image_original))
ax[4].imshow(transforms.functional.rotate(image_original, 15))
ax[5].imshow(composition(image_original))

ax0 = ax[0].set_xlabel('Original image')
ax1 = ax[1].set_xlabel('Pad')
ax2 = ax[2].set_xlabel('Horizontal Flip')
ax3 = ax[3].set_xlabel('Vertical Flip')
ax4 = ax[4].set_xlabel('Rotation')
ax5 = ax[5].set_xlabel('Composition of these transformations')

Since some transformation occurs randomly in the composition section, the result may vary each time.

In [None]:
class PCamDataset(Dataset):
    def __init__(self, data_folder, data_type, transform, labels_dict={}):
        self.data_folder = data_folder
        self.data_type = data_type
        self.image_files_list = [image_file_name for image_file_name in os.listdir(data_folder)]
        self.transform = transform
        self.labels_dict = labels_dict
        if self.data_type == 'train':
            self.labels = [labels_dict[image_file_name.split('.')[0]] for image_file_name in self.image_files_list]
        else:
            self.labels = [0 for _ in range(len(self.image_files_list))]

    def __len__(self):
        return len(self.image_files_list)

    def __getitem__(self, index):
        image_file_name = os.path.join(self.data_folder, self.image_files_list[index])
        image = Image.open(image_file_name)
        image = self.transform(image)
        image_id = self.image_files_list[index].split('.')[0]
        if self.data_type == 'train':
            label = self.labels_dict[image_id]
        else:
            label = 0
        return image, label

In [None]:
image_labels_dict = { image:label for image, label in zip(labels.id, labels.label) }

dataset = PCamDataset(data_folder=train_images_folder_path, data_type='train', transform=data_transformations_train, labels_dict=image_labels_dict)
test_set = PCamDataset(data_folder=test_images_folder_path, data_type='test', transform=data_transformations_test)

train_sampler = SubsetRandomSampler(list(train_indices.index))
valid_sampler = SubsetRandomSampler(list(validation_indices.index))

batch_size = 64

train_loader = DataLoader(dataset, batch_size=batch_size, sampler=train_sampler)
valid_loader = DataLoader(dataset, batch_size=batch_size, sampler=valid_sampler)
test_loader = DataLoader(test_set, batch_size=batch_size)

## Modeling<a class="anchor" id="modeling"></a>

As we mentioned earlier parts of this notebook, we try to solve this binary classification problem with deep learning approach to be more specific we are going to use Convolutional Neural Networks which is widely used in visual classification, detection problems lately. There are lots of popular CNN architectures with pretrained models which are VGG, ResNet etc. And we are going to choose and adopt from one of them to our problem. By doing that, we are avoiding from trainin a network from scratch and also increase our accuracy. So it's efficent in terms of both time and accuracy. This process called as transfer learning in literature. It's widely used in deep learning solutions. There are two versions of transfer learning; you can train the weights that you tuned to your network or freeze it. We are going to try both approach in our experiments. As a choice of network we choose ResNet (Residual Networks) because of several reasons; ResNet are widely used in many visual problem in machine learning because in deep learning while the layers increased, we start to have a problems while trying to optimize and train our networks due to vanishing/exploiding gradients. ResNet offer a solution for this problem with residual connections between layers and avoid to lose earlier features that we learned in earlier layers. We decided on using ResNet with pretrained weigths on imagenet dataset but ResNet has several versions according to layer size. We train and evaluate on validation set with ResNet18,ResNet50 and ResNet101 and observe that all of them perform almost same on our problem so we decided to continue our project with ResNet18 because it's the shallowest network among them so it's easier to train. Than after we try to pick best hyperparameters for our network. Hyperparameter is a parameter that defined before training process and can not updated during this process so it is something that we can not learn during our training.Batch size, learning rate , loss function, epoch number are hyperparameters that we have in our project so we are trying to pick optimal one by doing experiments with different values for each of them.

In [None]:
model = torchvision.models.resnet18(pretrained=True)
for i, param in model.named_parameters():
    param.requires_grad = False

We add a 2-feature linear layer so that the model can do binary classification into the last fully connected layer.

In [None]:
num_features = model.fc.in_features
model.fc = nn.Linear(num_features, 2)

In [None]:
model.cuda()
criterion = nn.BCEWithLogitsLoss()
optimizer = optim.SGD(model.fc.parameters(), lr=0.001, momentum=0.9)
exp_lr_scheduler = lr_scheduler.StepLR(optimizer, step_size=5, gamma=0.1)

In [None]:
n_epochs = 4
patience = 10
p = 0
stop = False
valid_loss_min = np.Inf

train_loss_epoch = []
val_loss_epoch = []
val_auc_epoch = []


for epoch in range(1, n_epochs+1):

    train_loss = []
    train_auc = []

    for batch_i, (data, target) in enumerate(train_loader):

        data, target = data.cuda(), target.cuda()

        optimizer.zero_grad()
        output = model(data)
        loss = criterion(output[:,1], target.float())
        train_loss.append(loss.item())
        
        a = target.data.cpu().numpy()
        b = output[:,-1].detach().cpu().numpy()
        train_auc.append(roc_auc_score(a, b))

        loss.backward()
        optimizer.step()
    
    exp_lr_scheduler.step()
    
    train_loss_epoch.append(np.mean(train_loss))
    
    model.eval()
    
    val_loss = []
    val_auc = []
    
    for batch_i, (data, target) in enumerate(valid_loader):
        data, target = data.cuda(), target.cuda()
        output = model(data)

        loss = criterion(output[:,1], target.float())

        val_loss.append(loss.item()) 
        a = target.data.cpu().numpy()
        b = output[:,-1].detach().cpu().numpy()
        val_auc.append(roc_auc_score(a, b))

    val_loss_epoch.append(np.mean(val_loss))
    val_auc_epoch.append(np.mean(val_auc))
    
    print(f'Epoch {epoch}, train loss: {np.mean(train_loss):.4f}, valid loss: {np.mean(val_loss):.4f}, train auc: {np.mean(train_auc):.4f}, valid auc: {np.mean(val_auc):.4f}')
    
    valid_loss = np.mean(val_loss)
    if valid_loss <= valid_loss_min:
        print('Validation loss decreased ({:.6f} --> {:.6f}).  Saving model ...'.format(
        valid_loss_min,
        valid_loss))
        torch.save(model.state_dict(), 'model.pt')
        valid_loss_min = valid_loss
        p = 0

    # check if validation loss didn't improve
    if valid_loss > valid_loss_min:
        p += 1
        print(f'{p} epochs of increasing val loss')
        if p > patience:
            print('Stopping training')
            stop = True
            break        
    if stop:
        break

## Evaluation<a class="anchor" id="evaluation"></a>

Our evaluation metric is the **Area Under the Curve (AUC) score**. It is one of the most important evaluation metrics for checking any classification model’s performance. It tells how much model is capable of distinguishing between classes. Higher AUC, better the model is at distinguishing between healthy tissue and cancerous tissue.

In [None]:
model = torchvision.models.resnet18(pretrained=True)

num_features = model.fc.in_features
model.fc = nn.Linear(num_features, 2)

model.cuda()
model.load_state_dict(torch.load('model.pt'))
model.eval();

In [None]:
plt.plot(train_loss_epoch, label='Train Loss')
plt.plot(val_loss_epoch, label='Validation Loss')
plt.xlabel("Epochs")
plt.ylabel("Loss")
plt.legend();

In [None]:
plt.plot(val_auc_epoch, label='Validation AUC')
plt.xlabel("Epochs")
plt.ylabel("Area Under the Curve")
plt.legend();

In [None]:
preds = []
for batch_i, (data, target) in enumerate(test_loader):
    data = data.cuda()
    output = model(data)
    pr = output[:,1].detach().cpu().numpy()
    for i in pr:
        preds.append(i)

In [None]:
test_preds = pd.DataFrame({'imgs': test_set.image_files_list, 'preds': preds})
test_preds['imgs'] = test_preds['imgs'].apply(lambda x: x.split('.')[0])

In [None]:
sub = pd.read_csv('../input/histopathologic-cancer-detection/sample_submission.csv')
sub = pd.merge(sub, test_preds, left_on='id', right_on='imgs')
sub = sub[['id', 'preds']]
sub.columns = ['id', 'label']
sub.head()

In [None]:
sub.to_csv('submission.csv', index=False)

## References<a class="anchor" id="references"></a>

[1] B. S. Veeling, J. Linmans, J. Winkens, T. Cohen, M. Welling. "Rotation Equivariant CNNs for Digital Pathology". arXiv:1806.03962

[2] Ehteshami Bejnordi et al. Diagnostic Assessment of Deep Learning Algorithms for Detection of Lymph Node Metastases in Women With Breast Cancer. JAMA: The Journal of the American Medical Association, 318(22), 2199–2210. doi:jama.2017.14585

[3] He, Kaiming, Xiangyu Zhang, Shaoqing Ren and Jian Sun. “Deep Residual Learning for Image Recognition.” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016): 770-778.

[4] Encyclopedia of Science Education, 2015. Transfer of Learning. pp.1079-1079.

https://www.kaggle.com/qitvision/a-complete-ml-pipeline-fast-ai

https://www.kaggle.com/artgor/simple-eda-and-model-in-pytorch

https://www.kaggle.com/ashishpatel26/hc-detection-using-pytorch-resnet-101

https://www.kaggle.com/abhinand05/histopathologic-cancer-detection-using-cnns

**Disclaimer!** <font color='grey'>This notebook was prepared by <student name(s)> as a term project for the *BBM469 - Data Intensive Applications Laboratory* class. The notebook is available for educational purposes only. There is no guarantee on the correctness of the content provided as it is a student work.

If you think there is any copyright violation, please let us [know](https://forms.gle/BNNRB2kR8ZHVEREq8). 
</font>