## Introduction to Problem and Notebook

Over the last 50 years, several scoring systems have been proposed that allow pathologists to grade tumours based on their appearance. In addition to inter-pathologist variability, some systems which require quantitative analysis, are too time-consuming to use in a routine clinical setting. Automation of this process through machine learning and computer vision is therefore potentially a high-impact topic of research.

From [Kaggle]("https://www.kaggle.com/c/histopathologic-cancer-detection/overview"): "In this competition, you must create an algorithm to identify metastatic cancer in small image patches taken from larger digital pathology scans. The data for this competition is a slightly modified version of the PatchCamelyon (PCam) benchmark dataset."

The main influences on this notebook are certainly the [Joni Juvonen notebook]("https://www.kaggle.com/qitvision/a-complete-ml-pipeline-fast-ai") entry using Fast AI and the [Antonio de Perio blog post]("https://towardsdatascience.com/histopathological-cancer-detection-with-deep-neural-networks-3399be879671") on Towards Data Science. 

Several other excellent resources were also useful for direction. I shall cite these as we go and also summarise at the end of the notebook.

At a high-level this notebook is organised as follows:
* Background on how we approach the challenge and information relevant to the problem
* Exploratory data analysis
* Modelling the classification problem
* Model performance

## Notes on Approaching the Challenge

We began by information gathering. This was certainly a useful step in order to confront some of the author's initial biases; namely that a CNN coded from scratch CNN would certainly be the only solution. In addition, researching the problem benefits us with context, clarity, and insights into the data. 

The notebook is built on the following general approach:
* Information gathering
* Split the problem into smaller parts (see "Plan Overview" at the end of the notebook)
* Execute
* Refine plan as more information and obstacles become apparent

The review paper [Deep neural network models for computational histopathology: A survey]("https://arxiv.org/pdf/1912.12378.pdf") and article [Artificial Intelligence in Pathology]("https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6344799/") were extremely helpful in understanding that there is a consensus on the machine learning approach to classification of PCam. In particular, they highlight patch-level CNNs and transfer-learning as the preffered approach. We also take this approach here:
* "Methods based on CNNs have consistently outperformed other handcrafted methods in a variety of deep learning"
* "The most popular and widely adopted technique in digital pathology is the use of transfer learning approach"

We discuss CNNs, transfer-learning, and the PCam data set in more detail later in the notebook. 

In analysing model performance we pay close attention to false negatives, i.e., when metastastic cancer exists but is not detected. This seems to be an important statistic. According to [Saba 2020]("https://reader.elsevier.com/reader/sd/pii/S1876034120305633?token=104840BB146B11F0239FAFE45FC879C93B6D4A7CDDED333EADDB2F5FA1BDE8BA672FD2C0C22DB5753011ED959877D131") GLOBOCAN statistics show that 18.1m new cancer cases emerged in 2018 that caused 9.6m deaths. If we take this (approx. 50%) death rate and apply the false-negative rate 3% the trained model in [our Fast AI notebook]("https://www.kaggle.com/qitvision/a-complete-ml-pipeline-fast-ai") we are potentially missing 299k untreated deaths. Unfortunately [Shapley Values]("https://towardsdatascience.com/making-sense-of-shapley-values-dc67a8e4c5e8") are not available for deep-learning models but Fast AI has suggested visual attention maps to address interpretability.

## Useful Terminology

The following definitions and terms helped provide clarity and make swifter modelling choices:
* Metastasis is the spread of cancer from the point of onset to different parts of the body. Metastases most commonly develop when cancer cells break away from the main tumor and enter the bloodstream or lymphatic system. These systems carry fluids around the body
* TIF files contain an image saved in the Tagged Image File Format (TIFF), a high-quality graphics format
* Whole slide images (WSIs) are high quality digital scans of glass slides used in pathology
* Patch-based learning splits the WSI into smaller (non-overlapping) patches which are used as training records for a machine-learning model. After the patch-level CNN is trained, another ML model is often developed for the whole image level decision

## Introduction to Data

PCam (PatchCamelyon) is a binary classification image dataset containing labeled low-resolution images of lymph node sections extracted from digital histopathological scans in tif format. A positive label indicates that the center 32x32px region of a patch contains at least one pixel of tumor tissue.

PCam is derived from the Camelyon16 Challenge which contains 400 H&E stained WSIs of sentinel lymph node sections. The process of patch production is described in its [creator's post]("http://basveeling.nl/posts/pcam/") and the associated git repository. PCam was created specifically for machine learning reasearches to be able to approach the problem of cancer detection as easily as they would the MNIST or CIFAR10.

## Initial Data Load

We will load data directly from the Kaggle directory. We will load only the training-data from the Kaggle competition as this is labelled and can be used for training our model and assessing performance.

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import os
import gc
import cv2
import matplotlib.pyplot as plt
import matplotlib.patches as patches
import random
from fastai import *
from fastai.vision import *
from torchvision.models import resnet50
from PIL import Image
from sklearn.utils import shuffle
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from skimage.restoration import estimate_sigma

In [None]:
data = pd.read_csv('/kaggle/input/train_labels.csv')
train_path = '/kaggle/input/train'
# quick look at the label stats
print("Value counts of data labels:")
print(data['label'].value_counts())
print("\nSample data labels:")
print(data.head())

We have 131k negative and 89k positive labels leading to an imbalance of 60%-40% indexed by the image ID. This does not seem imbalanced enough to warrant methods such as weighting our positive class, subsampling, or oversampling.

As it helps to visualise the data we will load 5 random samples of the positive and negative classes. We highlight the 32x32px region where we must search for evidence of metastastic cancer.

In [None]:
# As per qitvision reverse the cv2 colours from bgr to rgb - we will then see the same images as directly from PCam;
# plus I wouldn't want this to affect a pretrained model's ability

def readImage(path):
    # OpenCV reads the image in bgr format by default
    bgr_img = cv2.imread(path)
    # We flip it to rgb for visualization purposes
    b,g,r = cv2.split(bgr_img)
    rgb_img = cv2.merge([r,g,b])
    return rgb_img

In [None]:
# random sampling
shuffled_data = shuffle(data, random_state = 27)

In [None]:
fig, ax = plt.subplots(2,5, figsize=(20,8))
fig.suptitle('Histopathologic scans of lymph node sections',fontsize=20)
# Negatives
for i, idx in enumerate(shuffled_data[shuffled_data['label'] == 0]['id'][:5]):
    path = os.path.join(train_path, idx)
    ax[0,i].imshow(readImage(path + '.tif'))
    # Create a Rectangle patch
    box = patches.Rectangle((32,32),32,32,linewidth=4,edgecolor='b',facecolor='none', linestyle=':', capstyle='round')
    ax[0,i].add_patch(box)
ax[0,0].set_ylabel('Negative samples', size='large')
# Positives
for i, idx in enumerate(shuffled_data[shuffled_data['label'] == 1]['id'][:5]):
    path = os.path.join(train_path, idx)
    ax[1,i].imshow(readImage(path + '.tif'))
    # Create a Rectangle patch
    box = patches.Rectangle((32,32),32,32,linewidth=4,edgecolor='r',facecolor='none', linestyle=':', capstyle='round')
    ax[1,i].add_patch(box)
ax[1,0].set_ylabel('Tumor tissue samples', size='large')

## Exploratory Data Analysis

We will now begin to examine the data quality and the high-level properties of the data at the class level.

The following code:
* Records all images which are dark (larges pixel value < 10)
* Records all images which are light (lowest pixel value > 245)
* Records a robust wavelet-based estimator of the (Gaussian) noise standard deviation
* Checks if any images are loaded with the wrong data type (np.int8)

Due to time constraints we only load 10k images. Testing has shown that the conclusions/insights scale to more data.

In [None]:
# As we count the statistics, we can check if there are any completely black or white images
dark_th = 10      # If no pixel reaches this threshold, image is considered too dark
bright_th = 245   # If no pixel is under this threshold, image is considerd too bright
too_dark_idx = []
too_bright_idx = []
bad_dtypes = []
noise_array_pos = np.array([])
noise_array_neg = np.array([])
means_pos = np.array([])
means_neg = np.array([])
stds_pos = np.array([])
stds_neg = np.array([])

N = 10000 # max length of positive and negative samples
positive_samples = []
negative_samples = []

iterable = shuffled_data["id"].head(10000)
#for i, idx in tqdm(enumerate(iterable), 'computing statistics...(220025 it total)'):
for i, idx in enumerate(iterable): 
    # What is the label
    label = shuffled_data.loc[shuffled_data["id"] == idx, "label"].values[0]
    
    # Read image
    path = os.path.join(train_path, idx)
    img = readImage(path + '.tif')
    imagearray = img.reshape(-1,3)
    
    # Check for anomylous data types:
    if img.dtype is not np.dtype('uint8'):
        bad_dtypes.append(idx)
        
    # Build separate positive and negative samples for examination later
    if label == 1 and len(positive_samples) < N:
        positive_samples.append(img)
        # Record "noise" of image
        noise_array_pos = np.append(noise_array_pos, estimate_sigma(img, multichannel=True, average_sigmas=True))
        # Record mean and std of each image
        means_pos = np.append(means_pos, np.mean(imagearray))
        stds_pos = np.append(stds_pos, np.std(imagearray))
    if label == 0 and len(negative_samples) < N:
        negative_samples.append(img)
        # Record "noise" of image
        noise_array_neg = np.append(noise_array_neg, estimate_sigma(img, multichannel=True, average_sigmas=True))
        # Record mean and std of each image
        means_neg = np.append(means_neg, np.mean(imagearray))
        stds_neg = np.append(stds_neg, np.std(imagearray))
        
    # is this too dark
    if(imagearray.max() < dark_th):
        too_dark_idx.append(idx)
        continue # do not include in statistics
    
    # is this too bright
    if(imagearray.min() > bright_th):
        too_bright_idx.append(idx)
        continue # do not include in statistics

In [None]:
positive_samples = np.array(positive_samples)
negative_samples = np.array(negative_samples)
gc.collect()

In [None]:
print("# of samples which are too dark: " + str(len(too_dark_idx)))
print("# of samples which are too light: " + str(len(too_bright_idx)))
print("# of samples which have the wrong dtype: " + str(len(bad_dtypes)))

We can see that there are no samples which do not have int dtypes. In addition we can only find a minority of samples which are either very light or very dark. As per [qitvision]("https://www.kaggle.com/qitvision/a-complete-ml-pipeline-fast-ai") we do not expect these anomolies to affect results.

The following code plots the distribution of RGB values for positive and negative classes

In [None]:
nr_of_bins = 256 #each possible pixel value will get a bin in the following histograms
fig,axs = plt.subplots(4,2,sharey=True,figsize=(8,8),dpi=150)

#RGB channels
axs[0,0].hist(positive_samples[:,:,:,0].flatten(),bins=nr_of_bins,density=True)
axs[0,1].hist(negative_samples[:,:,:,0].flatten(),bins=nr_of_bins,density=True)
axs[1,0].hist(positive_samples[:,:,:,1].flatten(),bins=nr_of_bins,density=True)
axs[1,1].hist(negative_samples[:,:,:,1].flatten(),bins=nr_of_bins,density=True)
axs[2,0].hist(positive_samples[:,:,:,2].flatten(),bins=nr_of_bins,density=True)
axs[2,1].hist(negative_samples[:,:,:,2].flatten(),bins=nr_of_bins,density=True)

#All channels
axs[3,0].hist(positive_samples.flatten(),bins=nr_of_bins,density=True)
axs[3,1].hist(negative_samples.flatten(),bins=nr_of_bins,density=True)

#Set image labels
axs[0,0].set_title("Positive samples (N =" + str(positive_samples.shape[0]) + ")");
axs[0,1].set_title("Negative samples (N =" + str(negative_samples.shape[0]) + ")");
axs[0,1].set_ylabel("Red",rotation='horizontal',labelpad=35,fontsize=12)
axs[1,1].set_ylabel("Green",rotation='horizontal',labelpad=35,fontsize=12)
axs[2,1].set_ylabel("Blue",rotation='horizontal',labelpad=35,fontsize=12)
axs[3,1].set_ylabel("RGB",rotation='horizontal',labelpad=35,fontsize=12)
for i in range(4):
    axs[i,0].set_ylabel("Relative frequency")
axs[3,0].set_xlabel("Pixel value")
axs[3,1].set_xlabel("Pixel value")
fig.tight_layout()

nr_of_bins = 64 #we use a bit fewer bins to get a smoother image
fig,axs = plt.subplots(1,2,sharey=True, sharex = True, figsize=(8,2),dpi=150)
axs[0].hist(np.mean(positive_samples,axis=(1,2,3)),bins=nr_of_bins,density=True);
axs[1].hist(np.mean(negative_samples,axis=(1,2,3)),bins=nr_of_bins,density=True);
axs[0].set_title("Mean brightness, positive samples");
axs[1].set_title("Mean brightness, negative samples");
axs[0].set_xlabel("Image mean brightness")
axs[1].set_xlabel("Image mean brightness")
axs[0].set_ylabel("Relative frequency")
axs[1].set_ylabel("Relative frequency");

In agreement with [gomezp]("https://www.kaggle.com/gomezp/complete-beginner-s-guide-eda-keras-lb-0-93") (although here we flip RGB accoring to OpenCV) we can see that:
* Negative samples are bimodel in terms of brightness (one bright peak and one dark prak)
* Negative samples have skewed to darker RGB channels in general
* Positive samples have a lighter G channel than RB

Next we plot the "average" positive and negative class pictures where positive images should appear less green and brighter in general.

In [None]:
# Plot average images
fig,axs = plt.subplots(1,2,sharey=True, sharex = True, figsize=(8,2),dpi=150)
axs[0].imshow(Image.fromarray(np.mean(positive_samples,axis=(0)), 'RGB'))
axs[1].imshow(Image.fromarray(np.mean(negative_samples,axis=(0)), 'RGB'))
axs[0].set_title("Average image, positive samples");
axs[1].set_title("Average image, negative samples");

We also examine the correlation between different features. Initially we have 96x96x3 = 27,648 features per image which is too large to process quickly. We settle for correlations of (i) the RGB means of each pixel for (ii) the center of each patch. 

In [None]:
fig, axs = plt.subplots(1,2,sharey=True, sharex = True, figsize=(8,2),dpi=150)
axs[0].set_title("Correlation of pixels, positives");
axs[1].set_title("Correlation of pixels, negatives");
temp = np.mean(positive_samples[:,32:64,32:64,:], axis = 3).flatten().reshape(positive_samples.shape[0], 32*32)
temp = pd.DataFrame(data = temp, columns = [str(i) for i in range(temp.shape[1])])
sns.heatmap(temp.corr(), ax = axs[0])
temp = np.mean(negative_samples[:,32:64,32:64,:], axis = 3).flatten().reshape(negative_samples.shape[0], 32*32)
temp = pd.DataFrame(data = temp, columns = [str(i) for i in range(temp.shape[1])])
sns.heatmap(temp.corr(), ax = axs[1])
fig.show()

Here positive and negative samples look very similar to the eye.

Next, we show the scatter plots of means vs standard deviations of pixel values (across RGB).

In [None]:
fig, axs = plt.subplots(1,2,sharey=True, sharex = True, figsize=(8,2),dpi=150)
axs[0].set_title("Mean vs Std, positives");
axs[1].set_title("Mean vs Std, negatives");
axs[0].scatter(means_pos, stds_pos)
axs[1].scatter(means_neg, stds_neg)

There is a lot of overlap in the distributions. Positives seem to come from a smaller sample - which is natural given we have far less observations here. Negative samples seem to have an order of magnitude more outliers. One interesting non-overlapping section is the low/mid-mean/mid-std region that only positive samples seem to occupy. 

In [None]:
# Free memory
temp = None
axs = None
gc.collect()

### A Note on Noise

We plot the Gaussian noise distribution estimated via skimage.restoration.estimate_sigma. We see a bimodal distribution for noise with the main peak close to zero. In general lower numbers mean more noise.

In [None]:
nr_of_bins = 100 #we use a bit fewer bins to get a smoother image
fig,axs = plt.subplots(1,2,sharey=True, sharex = True, figsize=(8,2),dpi=150)
axs[0].hist(noise_array_pos.flatten(),bins=nr_of_bins,density=True);
axs[1].hist(noise_array_neg.flatten(),bins=nr_of_bins,density=True);
axs[0].set_title("Noise estimation of positives");
axs[1].set_title("Noise estimation of negatives");
for i in [0, 1]:
    axs[i].set_xlabel("Noise estimation")
    axs[i].set_ylabel("Relative frequency")

Gaussian noise distribution is estimated via skimage.restoration.estimate_sigma. We see a bimodal distribution for negative-class noise with the main peak close to zero. The low peak may warrent a noise cancelling method such as [wavelet denoising]("https://scikit-image.org/docs/dev/auto_examples/filters/plot_denoise_wavelet.html#sphx-glr-auto-examples-filters-plot-denoise-wavelet-py").

We can also study the PCA of the images. The below plots show us that total variance in pixel observations (RGB across 96x96 pixels) explained by the first 2000 principal components is around 90% for both positive and negative samples. A higher variance seems to be explained by fewer principal components for negative samples in general which ties in with the idea that negative samples are more "noisy".

In [None]:
pca = PCA(700)
fig, axs = plt.subplots(1,2,sharey=True, sharex = True, figsize=(8,2),dpi=150)
axs[0].set_title("PCA variance, positive samples");
axs[0].set_xlabel("number of components")
axs[0].set_ylabel("cumulative explained variance")
axs[1].set_title("PCA variance, negative samples");
axs[1].set_xlabel("number of components")
axs[1].set_ylabel("cumulative explained variance")
pca.fit(positive_samples.flatten().reshape(positive_samples.shape[0], 96*96*3))
axs[0].plot(np.cumsum(pca.explained_variance_ratio_))
pca.fit(negative_samples.flatten().reshape(negative_samples.shape[0], 96*96*3))
axs[1].plot(np.cumsum(pca.explained_variance_ratio_))

In [None]:
# Free memory
pca = None
positive_samples, negative_samples = None, None
gc.collect()

## Data Processing

For image-data processing we use ImageDataBunch (PyTorch) directly as in [Antonio de Perio's notebook.]("https://github.com/humanunsupervised/humanunsupervised.com/blob/master/pcam/pcam-cancer-detection.ipynb"). This is over other methods such as ImageDataGenerator in Keras.

Scaling, renormalisation, centering, and even data augmentation and batching is done for free by ImageDataBunch through the function get_transforms. Normalisation is done using imagenet_stats.

In [None]:
tfms = get_transforms(do_flip=True)

bs=64 # also the default batch size
data = ImageDataBunch.from_csv(
    '/kaggle/input/', 
    ds_tfms=tfms, 
    size=224, 
    suffix=".tif",
    folder="train", 
    test="test",
    csv_labels="train_labels.csv", 
    bs=bs)

data.normalize(imagenet_stats)

## Model Choise

Results from previous sections indicate that (i) there are some distributional differences between positive and negative classes, (ii) many features are required to explain differences, (iii) noise (and therefore overfitting) may be an issue. 

We will work with resnet50 due to the successful implementation [here]("https://github.com/humanunsupervised/humanunsupervised.com/blob/master/pcam/pcam-cancer-detection.ipynb") and the glowing reviews [here]("https://arxiv.org/pdf/1912.12378.pdf") and [here]("https://towardsdatascience.com/an-overview-of-resnet-and-its-variants-5281e2f56035").

Resnet50 is a 50-layer deep convolutional neural network (CNN). It can handle large amounts of data and features. In addition, using ImageDataBunch for data augmentation and regularisation we can combat overfitting.

### Quick Notes on CNNs

CNNs general start with several layers of data-processing inorder to reduce noise and increase the visibility of important features which are easier to learn by neural networks. The important components of a CNN can be summarised as follows:
* Convolutional Layer; This layer applies some filters to an image to remove low and high level features
* Stride; Is the value which indicates that the filter will shift the image in increments of "Stride" pixels (steps)
* Padding; After convolution, to manage the size difference between the input and outputs
* Activation Layer; Non-linearity through activation function
* Pooling Layer; Reduces the offset between the new representation and the number of parameters within the neural network. Generally creates smaller outputs that contain enough information
* Flattening Layer; Data to 1d for the Fully-Connected Layer
* Fully-Connected Layer; The deep NN to learn the features and make predictions

### Training
We will train via transfer learning. We will specifically use the one [cycle policy](https://arxiv.org/abs/1803.09820). As per [qitvision]("https://www.kaggle.com/qitvision/a-complete-ml-pipeline-fast-ai"), the fastai library has implemented a training function for one cycle policy that we can use with only a few lines of code.

The method relies on finding the optimal `learning rate` and `weight decay` (L2 penalty) values. The optimal lr is just before the base of the loss and before the start of divergence. It is important that the loss is still descending where we select the learning rate.

In the following code we use the qitvision optimal learning rates and weight decay. This is because of the time-cost of an optimal search and the assumption that the data set and the modelling framework is equivalent.

For the readers information we plot the evolution of the learning rate and the evolution of the loss function. The results show that the lr increases and then decreasese. In addition the loss functions of the training and validation sets seem to smoothly decrease and do not diverge.

In [None]:
def getLearner():
    return create_cnn(data, models.resnet50, pretrained=True, path='.', metrics=error_rate, ps=0.5, callback_fns=ShowGraph)

learner = getLearner()

In [None]:
max_lr = 2e-2
wd = 1e-4

# 1cycle policy
learner.fit_one_cycle(cyc_len=5, max_lr=max_lr, wd=wd)

In [None]:
preds, y, loss = learner.get_preds(with_loss=True)
# get accuracy
acc = accuracy(preds, y)
print('The accuracy is {0} %.'.format(acc*100))
# plot learning rate of the one cycle
learner.recorder.plot_lr()

In [None]:
learner.save('./pcam_resnet50_frozen')

We see that the final accuracy is impressive. However, this is using resnet50 weights trained to another data set and we may acheive better performance by retraining.

## Finetuning

Finetuning involves unfreezing the pretrained model, lowering the learning rate, and continuing training. This has proven to improve results in computer vision. 

The following code executes this. We can also see the loss function evolving over training. There is a divergence of training and validation performance indicating some overfitting to the training set.

In [None]:
# load the baseline model
learner.load('./pcam_resnet50_frozen')

# unfreeze
learner.unfreeze()

# Fit new model with lower learning rates
learner.fit_one_cycle(cyc_len=5, max_lr=slice(4e-5,4e-4))
learner.recorder.plot_losses()

In [None]:
learner.save('./pcam_resnet50_finetuned')

## Model Performance

The following code gives us the final performance of the model in terms of accuracy, the confusion matrix, the area under curve, and the precision/recall curve. In addition, we study examples of false negatives using fastai's ClassificationInterpretation. Here areas of the images important to classification are highlighted as a heatmap.

In [None]:
preds,y, loss = learner.get_preds(with_loss=True)
# get accuracy
acc = accuracy(preds, y)
print('The accuracy is {0} %.'.format(acc))

In [None]:
# Confusion matrix
interp = ClassificationInterpretation.from_learner(learner)
interp.plot_confusion_matrix(title='Confusion matrix')

In [None]:
from sklearn.metrics import roc_curve, auc
# probs from log preds
probs = np.exp(preds[:,1])
# Compute ROC curve
fpr, tpr, thresholds = roc_curve(y, probs, pos_label=1)

# Compute ROC area
roc_auc = auc(fpr, tpr)
print('ROC area is {0}'.format(roc_auc))

plt.figure()
plt.plot(fpr, tpr, color='darkorange', label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', linestyle='--')
plt.xlim([-0.01, 1.0])
plt.ylim([0.0, 1.01])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")

In [None]:
losses,idxs = interp.top_losses()
interp.plot_top_losses(16, figsize=(16,16))

In [None]:
def pr_curve(preds, y):
    """
    Function to create precision recall curve
    
    Inputs
    ------
    preds - {tf} [prob of negative class, prob of positive class] for each observation
    y - {np.array} actual class value
    
    Outputs
    -------
    recall - {np.array} recall for different probability thresholds
    precision - {np.array} precision for different probability thresholds
    """
    temp_df = pd.DataFrame(data = np.array(preds), columns = ["negative_prob", "positive_prob"])
    temp_df["actual"] = y
    
    precision = np.array([])
    recall = np.array([])
    for threshold in range(0, 100, 10):
        threshold /= 100.
        temp_df["predicted_class"] = temp_df["positive_prob"] > threshold
        temp_df["predicted_class"] = temp_df["predicted_class"].astype(int)
        temp_df["true_positives"] = temp_df.apply(lambda x: 1 if x["actual"] == 1 and x["predicted_class"] == 1 else 0, axis = 1)
        temp_df["false_positives"] = temp_df.apply(lambda x: 1 if x["actual"] == 0 and x["predicted_class"] == 1 else 0, axis = 1)
        temp_df["true_negatives"] = temp_df.apply(lambda x: 1 if x["actual"] == 0 and x["predicted_class"] == 0 else 0, axis = 1)
        temp_df["false_negatives"] = temp_df.apply(lambda x: 1 if x["actual"] == 1 and x["predicted_class"] == 0 else 0, axis = 1)
        
        p = temp_df["true_positives"].sum() / float(temp_df["true_positives"].sum() + temp_df["false_positives"].sum())
        r = temp_df["true_positives"].sum() / float(temp_df["true_positives"].sum() + temp_df["false_negatives"].sum())
        
        precision = np.append(precision, p)
        recall = np.append(recall, r)
    
    return recall, precision

In [None]:
recall, precision = pr_curve(preds, y)

fig,axs = plt.subplots(1,1,sharey=True,figsize=(3,3),dpi=150)
axs.set_title("P/R Curve for Positive Class");
axs.set_xlabel("Recall")
axs.set_ylabel("Precision")

#RGB channels
axs.plot(recall, precision)

## Interpretation
* There may be overfitting after fine tuning
* Fine tuning does improve results on the validation set
* The AUC and precision/recall results look positive but there is room for improvement
* False negatives with features highlighted show us that whitespace (potentially the brightness) may be leading to missclassifcation

We should most likely submit the transfer-learned model before fine tuning.

## Potential Next Steps
* We still do not know the model performance on unseen (testing data)
* Noise reduction and more regularisation may help with overfitting
* More methodical tuning over epochs and hyper-parameters would certainly improve performance
* More features such as chemistry and phenotypic information could be used to help with the classification
* Due to noise we may be susceptable to adversarial examples

## Unforeseen Obstacles

Some challenges that I had to overcome in creating this notebook:
* Enabling GPU for kaggle
* Enabling an environment with fastai v1.0
* Being unable to save data in \input\

For reference, here is the original plan to execute on the notebook:
1. Load data
* Plot samples and highlight 32x32 centre
* Look for nulls and duplicates 
* Examine data balance --> over/under sampling is not needed
* Examine RGB distribution
* Plot mean vs sd for RGB --> look for outliers and anomalies
* Correlation of pixles
* PCA and number of important features --> may be needed for feature selection
* PCA for positives vs negatives
* Estimate noise --> may need adversarial examples if noise is higher (1/255 precision of images)
* Normalise data
* Create crops and augmented images
* Choose model - pretrained NN - ResNet50
* Image clustering on ResNet50 features
* Shuffle date + split into training testing
* Transfer learning --> use callback to save the best weights
* Regularisation
* Fine tunining
* Examine loss function for overfitting
* Examine precision/recall matrix
* Investigate P/R curve
* Examine permutation importance of false negatives --> how are these different from true negatives
* Summarise references

## References

### Blogs
1. https://towardsdatascience.com/histopathological-cancer-detection-with-deep-neural-networks-3399be879671
    - Jupyter notebook uses fast.ai to augment PCam and transfer-learn/fine-tune a Resnet50 model. Further regularisation techniques are used.
2. https://towardsdatascience.com/metastasis-detection-using-cnns-transfer-learning-and-data-augmentation-684761347b59
    - Transfer learns with NasNet on augmented PCam. Predicts on augmented images and averages output.
3. https://towardsdatascience.com/an-overview-of-resnet-and-its-variants-5281e2f56035
    - The core idea of ResNet is introducing a so-called “identity shortcut connection” that skips one or more layers. 
4. Introduced serveral papers that propose interesting variants of ResNet or provide insightful interpretation of it.
    - It is quite clear why removing a couple of layers in a ResNet architecture doesn’t compromise its performance too much — the architecture 
    - has many independent effective paths and the majority of them remain intact after we remove a couple of layers. 
5. http://basveeling.nl/posts/pcam/
    - Datasets such as Camelyon require developing intricate dataloaders, that take care of balancing different types of tissue, 
    - and can efficiently work with the terabytes of data available.
    - Packs the clinically-relevant task of metastasis detection into a straight-forward binary image classification task, akin to CIFAR-10 and MNIST.
6. https://medium.com/bbm406f19/week-1-histopathological-cancer-detection-30649d8dd847
    - 4 week project to classify PCam images using Vgg16. References Antony de Perio.
7. https://towardsdatascience.com/feature-importance-with-neural-network-346eb6205743
    - Permutation importance is calculated after a model has been fitted
    - If I randomly shuffle a single feature in the data, leaving the target and all others in place, how would that affect the final prediction performances?
8. https://geertlitjens.nl/post/getting-started-with-camelyon/
    - Test set accuracy is 0.8563 with Vgg16

### Kaggle
1. https://www.kaggle.com/emackie/a-complete-ml-pipeline-fast-ai/edit
2. https://www.kaggle.com/emackie/complete-beginner-s-guide-eda-keras-lb-0-93/edit
    - Good comparison of positive an negative samples at EDA stage
    - Builds and trains CNN ROC fast model
3. https://www.kaggle.com/artgor/simple-eda-and-model-in-pytorch
    - Defines their own 3-level CNN
    - Then goes to ResNet
4. https://www.kaggle.com/fmarazzi/baseline-keras-cnn-roc-fast-10min-0-925-lb
    - Loads data in batches to avoid memory error
    - Augments images with ImageDataGenerator
    - Defines own 3-level CNN
    - Does normalisation
    - Predicts on testing set directly (not on augmented images + averaging)
5. https://www.kaggle.com/cc786537662/initial-eda-image-processing
    - Interesting EDA for image processing

### Papers
1. https://reader.elsevier.com/reader/sd/pii/S1876034120305633?token=ABA60CEBDF6831129168852D8363149C13690BB0BDBAFC23E032ED6FE278A1836BA034EEA21BE7726614DB65F2AC5EF1
    - Systematic review of current techniques in diag-nosis and cure of several cancers affecting human body badly. Not including PCam
    - According toGLOBOCAN statistics, about 18.1 million new cancer cases emergedin 2018 that caused 9.6 million cancer deaths
    - Contains a brief description and comparison of currentreported techniques, results, datasets for cancer
2. https://arxiv.org/pdf/1912.12378.pdf
    - Survey of over 130 papers.
3. https://arxiv.org/pdf/1912.03609v3.pdf
    - Investigate to what extent alternative variants of Artificial Neural Networks (ANNs) are susceptible to adversarial attacks
    - On PCam
    - Stochastic ANNs are more robust but this is more prevalent in MNIST
4. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6344799/
    - The history of artificial intelligence in the medical domain, recent advances in artificial intelligence applied to pathology, and future prospects
    - Normalization is one of the techniques used to reduce digital immaging
5. https://arxiv.org/abs/1806.03962
    - Propose a new model for digital pathology segmentation, based on the observation that histopathology images are inherently symmetric under rotation and reflection
6. https://arxiv.org/pdf/1905.03151.pdf
    - This paper advocates against permute-and-predict methods for interpreting black box functions
    - Forces the original model to extrapolate to regions where there is little to no data
    - Recommends explicitly removing features, conditional permutations, or model distillation methods
7. https://www.nature.com/articles/nmeth.4397.pdf
    - Image-based cell profiling is a high-throughput strategy for the quantification of phenotypic differences among a variety of cell populations
    - Interesting EDA

### Other references
1. https://datascience.stackexchange.com/questions/29223/exploratory-data-analysis-with-image-datset
    - Standard practices for EDA with image data
2. https://keras.io/guides/transfer_learning/
    - Transfer learning and fine tunining in Keras. Explanation of trainable = False.
3. https://stackoverflow.com/questions/2440504/noise-estimation-noise-measurement-in-image
    - Scikit Image has an noise estimate sigma function which works with colour images