# Introduction

### Histopathologic Cancer Detection. 

In this competition, we had to build and train a Convolutional Neural Network nodel which can accurately identify metastatic cancer in small image patches taken from larger digital pathology scans. 

The data for this competition is a slightly modified version of the PatchCamelyon (PCam) benchmark dataset. In this dataset, we are provided with a large number of small pathology images to classify. Files are named with an image id. The train_labels.csv file provides the ground truth for the images in the train folder. We are predicting the labels for the images in the test folder. A positive label indicates that the center 32x32px region of a patch contains at least one pixel of tumor tissue. Tumor tissue in the outer region of the patch does not influence the label. This outer region is provided to enable fully-convolutional models that do not use zero-padding, to ensure consistent behavior when applied to a whole-slide image.

# Import namespaces

In [None]:
# Import Numpy for array operations, panda for dataframe and matplotlib for plotting.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import seaborn as sns

# SciKitLearn for Confusion Matrix and other statistics
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, roc_auc_score, plot_confusion_matrix

# Tensorflow and Keras to build CNN Model Architechture
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras import layers
from tensorflow.keras.preprocessing.image import ImageDataGenerator

# Pickle and os for saving the files
import pickle
import os
import itertools

# Parameters
This param will be used to reduce the test and validation datasets so that the notebook can run faster.

In [None]:
SAMPLE_SIZE = 0.8

In [None]:
BATCH_SIZE = 64
IMG_SIZE = 96
RANDOM_SEED = 1

In [None]:
DATASET = "../input/histopathologic-cancer-detection/"
train_path = DATASET+"train"
test_path = DATASET+"test"

# Validation dataset
Since the validation dataset has the true label column which is missing in test dataset, We will use it to evaluate the final model. 
We will find out accuracy, confusion matrix, classification reports etc

In [None]:
valid = pd.read_csv(DATASET+'train_labels.csv', dtype=str)
valid['path'] = valid.id+'.tif'
valid.head()

In [None]:
print('Validation Dataset Size:', valid.shape)

Reduce the validation dataset to run the notebook faster.

In [None]:
# Reduce the Validation Dataset size to run the predictions faster
valid, ignore = train_test_split(valid, test_size=SAMPLE_SIZE, random_state=RANDOM_SEED, stratify=valid.label)
print('Validation Dataset Size:', valid.shape)

### Distribution
In the validation image dataset, close to 60% are benign and and 40% are malignant. 

In [None]:
categories = ["Benign", "Malignant"]
plt.figure(figsize=(10,10));
plt.pie(valid.label.value_counts(), labels=categories, startangle=55, 
        autopct='%1.2f%%', colors=sns.color_palette('flare')[0:2], shadow=True);
plt.axis('off')
plt.show();

# Test Dataset
We will use the test dataset to find out the probability distribution of the predictions.

In [None]:
test = pd.read_csv(DATASET+'sample_submission.csv', dtype=str)
test['path'] = test.id+'.tif'
test.head()

In [None]:
print('Validation Dataset Size:', valid.shape)

Reduce the test dataset to run the notebook faster.

In [None]:
# Reduce the test Dataset size to run the predictions faster
test, ignore = train_test_split(test, test_size=SAMPLE_SIZE, random_state=RANDOM_SEED, stratify=test.label)
print('Test Dataset Size:', test.shape)

# Final Model (VGG16)
We have built so many models thru this past six weeks. We selected VGG16 as the final model for submission because of the private score of 0.93 and and public score of 0.94. Though the PyTorch model achieved higher public score it is not able to matchup in terms of private score. Even the Ensemble model has a better public score but the time it takes to train so many models and then ensemble them is huge compared to the score gains.

| Model Architecture  	| Image Size  | Epochs  	| Private Score  	| Public Score   | Notebook  | 
|---	|---	|---	|---	|---	|---	|
| Simple Model  | 96x96  	| 30  	| [0.8288](https://www.kaggle.com/leopoldtchomgwi/lt-cancer-detection-v01-submission-revised)  	| 0.8669  	|[[LT] Cancer Detection Simple Model](https://www.kaggle.com/leopoldtchomgwi/lt-cancerdetection-v01-with-balanced-target-dist)   	|
| VGG16  	| 96x96  | 60  	| [0.9376](https://www.kaggle.com/lokanathpatro/lp-cancer-detection-models-submission?scriptVersionId=81777457)  	| 0.9449  	| [[LP] Cancer Detection VGG16 Model](https://www.kaggle.com/lokanathpatro/lp-cancer-detection-vgg16-model)  |
| ResNet50  	| 96x96  | 40  	| [0.9029](https://www.kaggle.com/lokanathpatro/lp-cancer-detection-models-submission?scriptVersionId=81778571)  	| 0.8798  	| [[LP] Cancer Detection ResNet50 Model](https://www.kaggle.com/lokanathpatro/lp-cancer-detection-resnet50-model)  |
| Ensemble VGG16, ResNet50, Simple Model <br/> with Logistic Regression | 96x96  | 40   | [0.8370](https://www.kaggle.com/lokanathpatro/lp-cancer-detection-ensemble-model-with-lr?scriptVersionId=81787013) | 0.8569  	|  [[LP] Cancer Detection Ensemble Model with LR](https://www.kaggle.com/lokanathpatro/lp-cancer-detection-ensemble-model-with-lr)	|
| Ensemble VGG16, ResNet50, Simple Model <br/> with Weighted Average | 96x96  | 40  |  [0.9298](https://www.kaggle.com/lokanathpatro/lp-cancer-detection-models-submission?scriptVersionId=81797079) 	|  0.9537 	|  [[LP] Cancer Detection Models Submission](https://www.kaggle.com/lokanathpatro/lp-cancer-detection-models-submission?scriptVersionId=81797079)  	|
| PyTorch  	| 96x96  | 40  |  [0.9263](https://www.kaggle.com/lokanathpatro/lp-cancer-detection-cnn-model-with-pytorch?scriptVersionId=81689603) 	| 0.9642  	| [[LP] Cancer Detection CNN Model with PyTorch]( https://www.kaggle.com/lokanathpatro/lp-cancer-detection-cnn-model-with-pytorch)  |
| Cropping2D Layer Model  	| 96x96  | 25  |  [0.8368](https://www.kaggle.com/gauravsamudra/gs-cancersubmission-v1?scriptVersionId=79504395) 	| 0.8958  	| [[GS] Cancer Detection with Cropping2D Layer](https://www.kaggle.com/gauravsamudra/gs-cancerdetection-v01?scriptVersionId=79421784)  |
| Model With Pre-Cropped Images  	| 32x32  | 25  |  [0.7224](https://www.kaggle.com/leopoldtchomgwi/lt-cancer-detection-cropped-images-submission?scriptVersionId=79658223) 	| 0.7813  	| [[LT] Cancer Detection with Pre-Cropped Images](https://www.kaggle.com/leopoldtchomgwi/lt-cancer-detection-cropped-im)  |
| Model With Augmented Images  	| 96x96  | 25  |  [0.8585](https://www.kaggle.com/jennaward6/team-4-cancer-detection-submit-jw-edit?scriptVersionId=80319158) 	| 0.9124  	| [[JW] Cancer Detection with Augmented Images](https://www.kaggle.com/jennaward6/jw-cancerdetection-v01-with-visualizations?scriptVersionId=80269876)  |
| Simple Model2  	| 96x96  | 20  |  [0.9231](https://www.kaggle.com/leopoldtchomgwi/cancerdetection-submission-final2?scriptVersionId=81039203) 	| 0.9397  	| [[LT] Cancer Detection_Simple Model](https://colab.research.google.com/drive/1_PvSVhBlYdwXEAqbb7a8RABTXgcU14ME?usp=sharing#scrollTo=yYkbG_C6KJ1Y)  |

In [None]:
cnn = keras.models.load_model('../input/lp-cancer-detection-vgg16-model-after-40epochs/LP_HCD_VGG16_Model.h5')
keras.utils.plot_model(cnn,show_shapes=True,show_dtype=True,show_layer_names=True,dpi=60)

# Training history

In [None]:
pickle_file = open("../input/lp-cancer-detection-vgg16-model-after-40epochs/LP_HCD_VGG16_Model_History.pkl", "rb")
history = pickle.load(pickle_file)
pickle_file.close()

In [None]:
epoch_range = range(1, len(history['loss'])+1)

plt.figure(figsize=[20,5])
sns.set_style("darkgrid")
plt.subplot(1,3,1)
sns.lineplot(x=epoch_range,y=history['loss'], label='Training')
sns.lineplot(x=epoch_range,y=history['val_loss'], label='Validation')
plt.xlabel('Epoch'); plt.ylabel('Loss'); plt.title('Loss')
plt.legend()
sns.set_style("darkgrid")
plt.subplot(1,3,2)
sns.lineplot(x=epoch_range,y=history['accuracy'], label='Training')
sns.lineplot(x=epoch_range,y=history['val_accuracy'], label='Validation')
plt.xlabel('Epoch'); plt.ylabel('Accuracy'); plt.title('Accuracy')
plt.legend()
sns.set_style("darkgrid")
plt.subplot(1,3,3)
sns.lineplot(x=epoch_range,y=history['auc'], label='Training')
sns.lineplot(x=epoch_range,y=history['val_auc'], label='Validation')
plt.xlabel('Epoch'); plt.ylabel('AUC'); plt.title('AUC')
plt.legend()
plt.tight_layout()
plt.show()

# Data Generator

In [None]:
datagen = ImageDataGenerator(rescale=1/255)

valid_loader = datagen.flow_from_dataframe(
    dataframe = valid,
    directory = train_path,
    x_col = 'path',
    batch_size = BATCH_SIZE,
    shuffle = False,
    class_mode = None,
    target_size = (IMG_SIZE,IMG_SIZE)
)

test_loader = datagen.flow_from_dataframe(
    dataframe = test,
    directory = test_path,
    x_col = 'path',
    batch_size = BATCH_SIZE,
    shuffle = False,
    class_mode = None,
    target_size = (IMG_SIZE,IMG_SIZE)
)

# Test prdictions

In [None]:
test_probs = cnn.predict(test_loader)
print(test_probs.shape)

### Probability Distribution plot

In [None]:
sns.set_style("darkgrid")
sns.displot(x=test_probs[:,1], height=7, stat='percent', kde=True)
plt.xlabel('Probability')
plt.title("Probability Distribution Plot (Test Dataset)")
plt.show()

# Validation predictions

In [None]:
valid_probs = cnn.predict(valid_loader)
print(valid_probs.shape)

In [None]:
valid['label'] = pd.to_numeric(valid['label'])
valid['prediction'] = np.argmax(valid_probs, axis=1)
valid['probability'] = valid_probs[:,1]

In [None]:
valid.head()

### Probability Distribution Plot

In [None]:
sns.set_style("darkgrid")
sns.displot(data=valid, x='probability', height=7, stat='percent', kde=True)
plt.xlabel('Probability')
plt.title("Probability Distribution Plot (Validation Dataset)")
plt.show()

# Test vs Validation Probability Distribution Plot 

In [None]:
sns.set(rc = {'figure.figsize':(15,8)})
sns.set_style("darkgrid")
sns.histplot(data=valid, x='probability', color="skyblue", label="Sepal Length", kde=True)
sns.histplot(x=test_probs[:,1], color="teal", label="Sepal Length", kde=True)
plt.xlabel('Probability')
plt.title("Probability Distribution Plot (Test vs Validation)")
plt.show()

# Accuracy

Accuracy is one metric for evaluating classification models. Informally, accuracy is the fraction of predictions our model got right. Formally, accuracy has the following definition. 

$
Accuracy=\frac{Number of correct predictions}{Number of total predictons}
$

In [None]:
accuracy = accuracy_score(valid.label, valid.prediction)
print('For our model, the accuracy is %f' % accuracy)

# Confusion matrix w/o Bias

A confusion matrix tell us the percentage of examples from each class in our test set that our model predicted correctly. In the case of an imbalanced dataset like the one we're dealing with, this is a better measure of our model's performance than overall accuracy.

In [None]:
def getLabels(cm):    
    group_names = ["True Benign","False Malignant","False Benign","True Malignant"]
    group_counts = ["{0:0.0f}".format(value) for value in cm.flatten()]
    group_percentages = ["{0:.2%}".format(value) for value in cm.flatten()/np.sum(cm)]
    labels = [f"{v1}\n{v2}\n{v3}" for v1, v2, v3 in zip(group_names,group_counts,group_percentages)]
    return np.asarray(labels).reshape(2,2)

In [None]:
plt.figure(figsize = (21,7))
plt.subplot(1,3,1)
cm=confusion_matrix(valid.label, valid.prediction)
np.set_printoptions(precision=2)
sns.heatmap(cm/np.sum(cm), fmt='', cbar=False, annot=getLabels(cm), annot_kws={"fontsize":20},
            xticklabels=categories, yticklabels=categories, cmap='flare')
plt.title("Confusion Matrix without bias")

plt.subplot(1,3,2)
cm=confusion_matrix(valid.label, [1 if prob>=0.4 else 0 for prob in valid.probability])
np.set_printoptions(precision=2)
sns.heatmap(cm/np.sum(cm), fmt='', cbar=False, annot=getLabels(cm), annot_kws={"fontsize":20},
            xticklabels=categories, yticklabels=categories, cmap='flare')
plt.title("Confusion Matrix with 10% positive bias")

plt.subplot(1,3,3)
cm=confusion_matrix(valid.label, [1 if prob>=0.3 else 0 for prob in valid.probability])
np.set_printoptions(precision=2)
sns.heatmap(cm/np.sum(cm), fmt='', cbar=False, annot=getLabels(cm), annot_kws={"fontsize":20},
            xticklabels=categories, yticklabels=categories, cmap='flare')
plt.title("Confusion Matrix with 20% positive bias")
plt.tight_layout()
plt.show()


# Special Note
When you see the first Confusion Matrix, Accuracy is about 94% but the False benign rate is around 3.45% which is two times higher than False Malignant. When a trusted model predicts Benign the case doesn't go for further review and hence False Benign rate is a very important parameter of trust after Accuracy. And our final model is NOT upto the mark. 

For our submissions, anything more than 50% probability is predicted a Malignant and less than 50% is predicted and Benign. 

The sceond confusion matrix shows that when we add a 10 basis point positive bias to our probabilities and then derive the predictions The accuracy stays same but the False Benign percentage drops significantly. If we stretch further and add 20 basis points Accuracy goes down along with False benign percentage.   

# Classification Report

Classification report allows us to look at Precision and Recall.

Precision is defined as follows

$
Precision\ =\ \frac{TP}{TP+FP}
$

Precision helps us answer the question _"What proportion of positive identifications was actually correct?"_

Recall is is defined as follows

$
Recall\ =\ \frac{TP}{TP+FN}
$

Recall helps us answers the question _"What proportion of actual positives was identified correctly?"_


In [None]:
print(classification_report(valid.label, valid.prediction, target_names=categories))

# Benign Sample Images (True & False)

In [None]:
temp = valid[valid.label==0]
trueBenign = temp[temp.prediction==0]['path'][:4].tolist()
falseBenign = temp[temp.prediction==1]['path'][:4].tolist()

In [None]:
plt.figure(figsize=(16,10))

for i in range(4):
    plt.subplot(2,4,i+1)
    plt.imshow(mpimg.imread(train_path+"/"+trueBenign[i]))
    plt.text(0, -5, f'True Benign', color='k')
    plt.axis('off')
    
    plt.subplot(2,4,i+5)
    plt.imshow(mpimg.imread(train_path+"/"+falseBenign[i]))
    plt.text(0, -5, f'False Benign', color='k')
    plt.axis('off')

plt.tight_layout()
plt.show()

# Malignant Sample Images (True & False) 

In [None]:
temp = valid[valid.label==1]
trueMalignant = temp[temp.prediction==1]['path'][:4].tolist()
falseMalignant = temp[temp.prediction==0]['path'][:4].tolist()

In [None]:
plt.figure(figsize=(16,10))

for i in range(4):
    plt.subplot(2,4,i+1)
    plt.imshow(mpimg.imread(train_path+"/"+trueMalignant[i]))
    plt.text(0, -5, f'True Malignant', color='k')
    plt.axis('off')
    
    plt.subplot(2,4,i+5)
    plt.imshow(mpimg.imread(train_path+"/"+falseMalignant[i]))
    plt.text(0, -5, f'False Malignant', color='k')
    plt.axis('off')

plt.tight_layout()
plt.show()