## Table of Contents

1. [Preparation & Understanding the data structure](#prep)
2. [Exploratory Data Analysis](#eda)
3. [Data Preprocessing](#data)
4. [Defining the Model](#model)
5. [Training the Model](#train)
6. [Making & Visualising Predictions](#pred)

# Preparation & Understanding the data structure <a class="anchor" id="prep"></a>

### Importing packages

In [None]:
import numpy as np
import pandas as pd
import os
from os import listdir
import cv2

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set()
from PIL import Image
from glob import glob
from skimage.io import imread

import sklearn
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

import torch 
import torch.nn as nn
import torch.nn.functional as F
import torchvision
import torchvision.transforms as transforms
from torch.utils.data import TensorDataset, DataLoader, Dataset
import torch.optim as optim

import time
import copy
from tqdm import tqdm_notebook as tqdm

import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=UserWarning)
warnings.filterwarnings("ignore", category=FutureWarning)

print('Imports complete')

### Configurations

In [None]:
# Model Parameters
num_epochs = 20
batch_size = 128
num_classes = 2
learning_rate = 0.001

# Device configuration
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
device

### Loading and understanding the data structure

In [None]:
base_dir = '../input/histopathologic-cancer-detection/'
print(os.listdir(base_dir))

In [None]:
labels = pd.read_csv(base_dir + "train_labels.csv")
labels.head()

In [None]:
labels.shape

In [None]:
labels.info()

This file contains the ids of images for training and their labels for cancer. 

In [None]:
train_path = base_dir + "train/"
test_path = base_dir + "test/"
train_files = listdir(train_path)
test_files = listdir(test_path)

In [None]:
train_files[:5]

In [None]:
test_files[:5]

In [None]:
# Number of images in train and test
print("Train size: ", len(train_files))
print("Test size: ", len(test_files))

In [None]:
print((len(train_files)/(len(train_files)+len(test_files)))*100, (len(test_files)/(len(train_files)+len(test_files)))*100)

The directories train and test contain the actual images with 79.3% and 20.7% of the total images respectively.

In [None]:
sub = pd.read_csv(base_dir + "sample_submission.csv")
sub.head()

In [None]:
sub.shape

In [None]:
sub.info()

This file contains the ids of test images and all the labels are set to 0. We need to modify the labels in this file according to our predictions.

# Exploratory Data Analysis <a class="anchor" id="eda"></a>

### Visualizing the number of patches with cancer vs without cancer.

In [None]:
plt.pie(labels.label.value_counts(), labels=['No Cancer', 'Cancer'], colors=['#90EE91', '#F47174'], autopct='%1.1f')
plt.show()

### Visualizing healthy and cancer patches

In [None]:
positive_images = np.random.choice(labels[labels.label==1].id, size=50, replace=False)
negative_images = np.random.choice(labels[labels.label==0].id, size=50, replace=False)

**Cancer patches**

In [None]:
fig, ax = plt.subplots(5, 10, figsize=(20,10))

for n in range(5):
    for m in range(10):
        img_id = positive_images[m + n*10]
        image = Image.open(train_path + img_id + ".tif")
        ax[n,m].imshow(image)
        ax[n,m].grid(False)
        ax[n,m].tick_params(labelbottom=False, labelleft=False)

**Healthy patches**

In [None]:
fig, ax = plt.subplots(5, 10, figsize=(20,10))

for n in range(5):
    for m in range(10):
        img_id = negative_images[m + n*10]
        image = Image.open(train_path + img_id + ".tif")
        ax[n,m].imshow(image)
        ax[n,m].grid(False)
        ax[n,m].tick_params(labelbottom=False, labelleft=False)

**Analysis**

Visualising cancerous and healthy patches, it is hard to identify metastatic cancer for an untrained eye. One observation could be that the healthy patches have higher contrast than the cancerous patches. However, this observation doesn't seem to be applicable on all the images. It would be interesting to see what criterion pathologists use for identification of metastatic cancer!

# Data Preprocessing <a class="anchor" id="data"></a>

### Splitting the data into train and validation sets

In [None]:
train, val = train_test_split(labels, stratify=labels.label, test_size=0.1)
print(len(train), len(val))

I have split the train data into train and validation sets in the ratio 9:1.

**Plotting the positive and negative ratio in train and val sets**

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(10,4))

sns.countplot(train.label, palette="Blues", ax=ax[0])
ax[0].set_title("Train dataset")
for i, rows in enumerate(train['label'].value_counts().values):
    ax[0].annotate(int(rows), xy=(i, rows), ha='center')
sns.countplot(val.label, palette="Greens", ax=ax[1])
ax[1].set_title("Validation dataset")
for i, rows in enumerate(val['label'].value_counts().values):
    ax[1].annotate(int(rows), xy=(i, rows), ha='center')

### Custom Dataset

I have created a dataset that loads an image patch, converts it to RGB, performs the augmentation if it's desired, and returns the image and its label.

In [None]:
class CancerDataset(Dataset):
    
    def __init__(self, df_data, data_dir = './', transform=None):
        super().__init__()
        self.df = df_data.values
        self.data_dir = data_dir
        self.transform = transform
        
    def __len__(self):
        return len(self.df)
    
    def __getitem__(self, index):
        img_name,label = self.df[index]
        img_path = os.path.join(self.data_dir, img_name + '.tif')
        image = cv2.imread(img_path)
        if self.transform is not None:
            image = self.transform(image)
        return image, label

### Data Augmentation

Now to increase the data size, I have applied transformation like flipping and rotation to the train dataset, and then converted the datasets into tensors.

In [None]:
transform_train = transforms.Compose([transforms.ToPILImage(),
                                  transforms.RandomHorizontalFlip(), 
                                  transforms.RandomVerticalFlip(),
                                  transforms.RandomRotation(20), 
                                  transforms.ToTensor(),
                                  transforms.Normalize(mean=[0.5, 0.5, 0.5],std=[0.5, 0.5, 0.5])])

transform_val = transforms.Compose([transforms.ToPILImage(),
                                  transforms.ToTensor(),
                                  transforms.Normalize(mean=[0.5, 0.5, 0.5],std=[0.5, 0.5, 0.5])])

transform_test = transforms.Compose([transforms.ToPILImage(), 
                                  transforms.ToTensor(),
                                  transforms.Normalize(mean=[0.5, 0.5, 0.5],std=[0.5, 0.5, 0.5])])

In [None]:
train_dataset = CancerDataset(df_data=train, data_dir=train_path, transform=transform_train)
val_dataset = CancerDataset(df_data=val, data_dir=train_path, transform=transform_val)
test_dataset = CancerDataset(df_data=sub, data_dir=test_path, transform=transform_test)

### Creating pytorch dataloader

* The training data is shuffled after epochs so that the batches in the epochs are different every time and the model doesn't learn in a specific sequence.
* The last batch is dropped as it might contain less images than the batch size.

In [None]:
train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, drop_last=True)
val_dataloader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False, drop_last=True)
test_dataloader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

In [None]:
print(len(train_dataloader), len(val_dataloader), len(test_dataloader))

# Defining the Model <a class="anchor" id="model"></a>

I am using a CNN as the model with 5 layers.

In [None]:
!pip install efficientnet_pytorch
from efficientnet_pytorch import EfficientNet
model = EfficientNet.from_pretrained('efficientnet-b1')

In [None]:
# Unfreeze model weights
'''for param in model.parameters():
    param.requires_grad = True
'''

In [None]:
#for param in model.parameters():
    #param.requires_grad=False

#orginally, it was:
#(classifier): Linear(in_features=1792, out_features=1000, bias=True)


#we are updating it as a 2-class classifier:
'''model.classifier = nn.Sequential(
    nn.Linear(in_features=1280, out_features=512), #1792 is the orginal in_features
    nn.ReLU(), #ReLu to be the activation function
    nn.Dropout(p=0.5),
    nn.Linear(in_features=512, out_features=128),
    nn.ReLU(),
    nn.Dropout(p=0.5),
    nn.Linear(in_features=256, out_features=num_classes), 
)

model'''

In [None]:
#model = model.to('cuda')

In [None]:
from torchvision import models



In [None]:
#importing the pretrained EfficientNet model
use_cuda = torch.cuda.is_available()
model = EfficientNet.from_pretrained('efficientnet-b0')

# Freeze weights
for param in model.parameters():
    param.requires_grad = False
in_features = model._fc.in_features


# Defining Dense top layers after the convolutional layers
model._fc = nn.Sequential(
    nn.BatchNorm1d(num_features=in_features),    
    nn.Linear(in_features, 256),
    nn.ReLU(),
    nn.BatchNorm1d(256),
    nn.Linear(256, 128),
    nn.ReLU(),
    nn.BatchNorm1d(num_features=128),
    nn.Linear(128, 64),
    nn.ReLU(),
    nn.BatchNorm1d(num_features=64),
    nn.Dropout(0.5),
    nn.Linear(64, 2),
    )
if use_cuda:
    model = model.cuda()
    
model

Printing the training model.

### Loss and Optimizer

This task is a binary classification problem that has two classes, 1 for cancer positive images and 0 for cancer negative images. For loss function, I have used cross entropy loss.
I have used adam for optimizer.

In [None]:
# selecting loss function
criterion = nn.CrossEntropyLoss()

#using Adam classifier
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Training the Model <a class="anchor" id="train"></a>

Building the training loop for the model. It prints the loss and accuracy for training and validation after each epoch.
For accuracy, I have calculated the area under the ROC curve between the predicted probability and the observed target.
The losses and accuracies are also saved in an array for further evaluation of the model.

In [None]:
train_losses = []
val_losses = []
train_auc = []
val_auc = []
train_auc_epoch = []
val_auc_epoch = []
best_acc = 0.0
min_loss = np.Inf

since = time.time()

for e in range(num_epochs):
    
    train_loss = 0.0
    val_loss = 0.0
    
    # Train the model
    model.train()
    for i, (images, labels) in enumerate(tqdm(train_dataloader, total=int(len(train_dataloader)))):
        images = images.to(device)
        labels = labels.to(device)
        
        # Forward pass
        outputs = model(images)
        loss = criterion(outputs, labels)
        
        # Backward and optimize
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        # Loss and accuracy
        train_loss += loss.item()
        y_actual = labels.data.cpu().numpy()
        y_pred = outputs[:,-1].detach().cpu().numpy()
        train_auc.append(roc_auc_score(y_actual, y_pred))
    
    # Evaluate the model
    model.eval()
    for i, (images, labels) in enumerate(tqdm(val_dataloader, total=int(len(val_dataloader)))):
        images = images.to(device)
        labels = labels.to(device)
        
        # Forward pass
        outputs = model(images)
        loss = criterion(outputs, labels)
        
        # Loss and accuracy
        val_loss += loss.item()
        y_actual = labels.data.cpu().numpy()
        y_pred = outputs[:,-1].detach().cpu().numpy()
        val_auc.append(roc_auc_score(y_actual, y_pred))
    
    # Average losses and accuracies
    train_loss = train_loss/len(train_dataloader)
    val_loss = val_loss/len(val_dataloader)
    train_losses.append(train_loss)
    val_losses.append(val_loss)
    training_auc = np.mean(train_auc)
    validation_auc = np.mean(val_auc)
    train_auc_epoch.append(training_auc)
    val_auc_epoch.append(validation_auc)
    
    # Updating best validation accuracy
    if best_acc < validation_auc:
        best_acc = validation_auc
        
    # Saving best model
    if min_loss >= val_loss:
        torch.save(model, 'best_model.pb')
        torch.save(model.state_dict(), 'best_model.pt')
        min_loss = val_loss
    
    print('EPOCH {}/{}'.format(e+1, num_epochs))
    print('-' * 10)
    print("Train loss: {:.6f}, Train AUC: {:.4f}".format(train_loss, training_auc))
    print("Validation loss: {:.6f}, Validation AUC: {:.4f}\n".format(val_loss, validation_auc))

time_elapsed = time.time() - since
print('Training completed in {:.0f}m {:.0f}s'.format(time_elapsed // 60, time_elapsed % 60))
print('Best validation accuracy: {:4f}'.format(best_acc))

### Plotting training history

**Loss Convergence**

In [None]:
import gc
gc.collect()
plt.figure(figsize=(20,5))
plt.plot(train_losses, '-o', label="train")
plt.plot(val_losses, '-o', label="val")
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.title("Loss change over epoch")
plt.legend()

**Accuracy trend**

In [None]:
plt.figure(figsize=(20,5))
plt.plot(train_auc_epoch, '-o', label="train")
plt.plot(val_auc_epoch, '-o', label="val")
plt.xlabel("Epoch")
plt.ylabel("Accuracy")
plt.title("Accuracy over epoch")
plt.legend()

### Loading the best model

In [None]:
model.load_state_dict(torch.load('best_model.pt'))

# Making & Visualising Predictions <a class="anchor" id="pred"></a>

### Predictions on test dataset

I have used my best model to make predictions on the test dataset.

In [None]:
model.eval()

predictions = []

for i, (images, labels) in enumerate(tqdm(test_dataloader, total=int(len(test_dataloader)))):
    images = images.to(device)
    labels = labels.to(device)
    
    outputs = model(images)
    pred = outputs[:,1].detach().cpu().numpy()
    
    for j in pred:
        predictions.append(j)

### Modifying the submission file

Now I am using the predictions made by the model to create a submission file.

In [None]:
sub['label'] = predictions
sub.to_csv('submission.csv', index=False)
sub.info()

### Visualising predictions

First I have written a function to convert the image from tensor and then displayed some of the test images along with their predicted result. For a probability less than 0.5, images are labelled 'Healthy', otherwise they are labelled 'Cancer'.

In [None]:
test_images = np.random.choice(sub.id, size=50, replace=False)     

fig, ax = plt.subplots(5, 10, figsize=(20,10))

for n in range(5):
    for m in range(10):
        img_id = test_images[m + n*10]
        image = Image.open(test_path + img_id + ".tif")
        pred = sub.loc[sub['id'] == img_id, 'label'].values[0]
        label = "Cancer" if(pred >= 0.5) else "Healthy"  
        ax[n,m].imshow(image)
        ax[n,m].grid(False)
        ax[n,m].tick_params(labelbottom=False, labelleft=False)
        ax[n,m].set_title("Label: " + label)