<H2>Histopathologic Cancer Detection</H2>
<br>Identify metastatic tissue in histopathologic scans of lymph node sections 
by Image Processing and Convolutional Neural Networks</br>

<H3>Introduction</H3>
<br>This kernel focuses on evaluation of three different types of popular image augmentation technics and explore how they can affect the classification accuracy of Convolutional Neural Networks</br>
<br></br>

<H5>Data:</H5>
<br>PatchCamelyon (PCam) benchmark dataset</br>
<br>278,000 scans of lymph node sections with labels (cancer/non-cancer)</br>
<br></br>

<H5>Goal:</H5>
<br>Implement different image augmentation algoritms for processing the training image to optimize the CNN model performance</br>
<br></br>

<H5>Content:</H5>
<br>The primary content of this kernel consists of:</br>
<ol>
    <li>Image Cleaning/Prepocessing</li>
    <li>Image Augmentations Crux</li>
    <li>Convolutional Neural Networks (CNN)</li>
    <li>Model Evaluation</li>
    <li>Conclusion</li>
</ol> 

<H3>Image Cleaning/Prepocessing</H3>
<br>One of the most important reasons for prepocessing the input images is to help reduce the training error</br>
<br>In this section, re-balancing the training and testing images with equal number in each class(label) and removing outliers (defective images) are preformed</br>

In [None]:
#Preparing the required packages
from tensorflow import set_random_seed
set_random_seed(101)
import numpy as np
import pandas as pd
import os
import cv2
import shutil
import matplotlib.pyplot as plt
import matplotlib.patches as patches
import random
from sklearn.utils import shuffle
from glob import glob
from sklearn.model_selection import train_test_split
from keras.preprocessing.image import ImageDataGenerator
from keras.layers import Convolution1D, concatenate, SpatialDropout1D, GlobalMaxPool1D, GlobalAvgPool1D, Embedding, \
    Conv2D, SeparableConv1D, Add, BatchNormalization, Activation, GlobalAveragePooling2D, LeakyReLU, Flatten
from keras.layers import Dense, Input, Dropout, MaxPooling2D, Concatenate, GlobalMaxPooling2D, GlobalAveragePooling2D, \
    Lambda, Multiply, LSTM, Bidirectional, PReLU, MaxPooling1D
from keras.layers.pooling import _GlobalPooling1D
from keras.losses import mae, sparse_categorical_crossentropy, binary_crossentropy
from keras.models import Model
from keras.applications.nasnet import NASNetMobile, NASNetLarge, preprocess_input
from keras.optimizers import Adam, RMSprop
from keras.callbacks import ModelCheckpoint, EarlyStopping, ReduceLROnPlateau
from numpy.random import seed
from tensorflow import set_random_seed
from sklearn.metrics import roc_auc_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.layers import Conv2D, MaxPooling2D
from tensorflow.keras.models import Sequential
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau, ModelCheckpoint
import itertools

In [None]:
#Setting the base directory (original training folder)
base_tile_dir = 'histopathologic-cancer-detection/train/'

#Merge the label csv and training images' labels
df = pd.DataFrame({'path': glob(os.path.join(base_tile_dir,'*.tif'))})
df['id'] = df.path.map(lambda x: x.split('/')[1].split('\\')[1].split('.')[0])
labels = pd.read_csv("histopathologic-cancer-detection/train_labels.csv")
df_whole = df.merge(labels, on = "id")

<H5>Removing outliers:</H5>
<br>In the training set, there are 6 completely white and 1 completely black images that will undermine the training process. Therefore they should be removed</br>

In [None]:
#Remove outliers 
#All white
whiteList = ['f6f1d771d14f7129a6c3ac2c220d90992c30c10b',
             '9071b424ec2e84deeb59b54d2450a6d0172cf701', 
             'c448cd6574108cf14514ad5bc27c0b2c97fc1a83', 
             '54df3640d17119486e5c5f98019d2a92736feabc', 
             '5f30d325d895d873d3e72a82ffc0101c45cba4a8', 
             '5a268c0241b8510465cb002c4452d63fec71028a']

#All black
blackList = ['9369c7278ec8bcc6c880d99194de09fc2bd4efbe']

#Remove outliers from training set
for whiteId in whiteList:
    df_whole = df_whole[df_whole['id'] != whiteId]

for blackId in blackList:
    df_whole = df_whole[df_whole['id'] != blackId]

<H5>Re-balancing training/testing labels:</H5>
<br>Since the original data consists of 60 / 40 split of negative to positive samples, here, 10000 samples for both 0 and 1 are selected randomly from training set to balance the labels</br>

In [None]:
### Random Sampling of 10000 for both 0 and 1 cases
SAMPLE_SIZE = 10000

#Class 0
df_0 = df_whole[df_whole['label'] == 0].sample(SAMPLE_SIZE, random_state = 101)
#Class 1
df_1 = df_whole[df_whole['label'] == 1].sample(SAMPLE_SIZE, random_state = 101)

# Concat the dataframes
df_data = pd.concat([df_0, df_1], axis=0).reset_index(drop=True)
# Shuffle
df_data = df_data.sample(frac=1).reset_index(drop=True)
# View the numbers in each class
# df_data['label'].value_counts()

<H3>Image Augmentations Crux</H3>
<br>This section contains code for three different augmentations technics/filters:</br>
<ol>
  <li>Linear Blur and Sharp</li>
  <li>Gaussian Blur</li>
  <li>Random Whitening/Contrast</li>
</ol> 
<br>The primary reason for using image augmentation is to avoid overfitting and generate new images for training</br>
<br>But in our kernel, we try to use them to emphasize the features (in the center 86x86px) and reduce the noise (surrounding 5px) to see if these modifications can help generate better accuracy scores for CNN model</br>

In [None]:
### Curx functions

#Blur the surrounding 5px and sharp the inner 86px
def blur_sharp(color):
    #New imgae matrix placeholder
    matrix = np.zeros((96,96), dtype=np.uint8)
    R_new = color
    
    #Gaussian Blur Kernel on 96-86-96px
    for i in range(1,5):
        for j in range(1,95):
            matrix[i,j] = np.uint8((np.int(R_new[i-1,j-1])/16 + np.int(R_new[i-1,j])/8 + np.int(R_new[i-1,j+1])/16 + 
                           np.int(R_new[i,j-1])/8 + np.int(R_new[i,j])/4 + np.int(R_new[i,j+1])/8 +
                           np.int(R_new[i+1,j-1])/16 + np.int(R_new[i+1,j])/8 + np.int(R_new[i+1,j+1])/16))
    for i in range(91,95):
        for j in range(1,95):
            matrix[i,j] = np.uint8((np.int(R_new[i-1,j-1])/16 + np.int(R_new[i-1,j])/8 + np.int(R_new[i-1,j+1])/16 + 
                           np.int(R_new[i,j-1])/8 + np.int(R_new[i,j])/4 + np.int(R_new[i,j+1])/8 +
                           np.int(R_new[i+1,j-1])/16 + np.int(R_new[i+1,j])/8 + np.int(R_new[i+1,j+1])/16))
    
    for i in range(1,95):
        for j in range(1,5):
            matrix[i,j] = np.uint8((np.int(R_new[i-1,j-1])/16 + np.int(R_new[i-1,j])/8 + np.int(R_new[i-1,j+1])/16 + 
                           np.int(R_new[i,j-1])/8 + np.int(R_new[i,j])/4 + np.int(R_new[i,j+1])/8 +
                           np.int(R_new[i+1,j-1])/16 + np.int(R_new[i+1,j])/8 + np.int(R_new[i+1,j+1])/16))
    
    for i in range(1,95):
        for j in range(91,95):
            matrix[i,j] = np.uint8((np.int(R_new[i-1,j-1])/16 + np.int(R_new[i-1,j])/8 + np.int(R_new[i-1,j+1])/16 + 
                           np.int(R_new[i,j-1])/8 + np.int(R_new[i,j])/4 + np.int(R_new[i,j+1])/8 +
                           np.int(R_new[i+1,j-1])/16 + np.int(R_new[i+1,j])/8 + np.int(R_new[i+1,j+1])/16))
    
    #High-pass sharpening
    for i in range(4,91):
        for j in range(4,91):
            matrix[i,j] = np.uint8(1/8*(np.int(R_new[i-1,j-1])*-1 + np.int(R_new[i-1,j])*-1 + np.int(R_new[i-1,j+1])*-1 + 
                           np.int(R_new[i,j-1])*-1 + np.int(R_new[i,j])*16 + np.int(R_new[i,j+1])*-1 +
                           np.int(R_new[i+1,j-1])*-1 + np.int(R_new[i+1,j])*-1 + np.int(R_new[i+1,j+1])*-1))
    
    #plt.figure(figsize=(10,10))
    final = matrix[1:95,1:95]
    #plt.imshow(final) 
    return final


#Weighted Averaging filter (Gassian Filter)
def weighted_filter(color, b):
    matrix = np.zeros((96,96), dtype=np.uint8)
    R_new = color
    for i in range(1,95):
        for j in range(1,95):
            matrix[i,j] = np.uint8(1/(1+b)/(1+b)*(np.int(R_new[i-1,j-1])*1 + np.int(R_new[i-1,j])*b + np.int(R_new[i-1,j+1])*1 + 
                           np.int(R_new[i,j-1])*b + np.int(R_new[i,j])*b*b + np.int(R_new[i,j+1])*b +
                           np.int(R_new[i+1,j-1])*1 + np.int(R_new[i+1,j])*b + np.int(R_new[i+1,j+1])*1))
    #plt.figure(figsize=(10,10))
    final = matrix[1:95,1:95]
    #plt.imshow(final) 
    return final
    

#Brightness/Contrast adjustment
def bright_contrast(input_img): 
    b,g,r = cv2.split(input_img)
    #Resize to be 94*94 
    b = b[1:95,1:95]
    g = g[1:95,1:95]
    r = r[1:95,1:95]
    rgb_img = cv2.merge([r,g,b])
    RANDOM_BRIGHTNESS = 64  # range (0-100), 0=no change
    RANDOM_CONTRAST = 7   # range (0-100), 0=no change
    
    # Random brightness
    br = random.randint(-RANDOM_BRIGHTNESS, RANDOM_BRIGHTNESS) / 100.
    rgb_img = np.uint8(rgb_img + br)
        
    # Random contrast
    cr = 1.0 + random.randint(-RANDOM_CONTRAST, RANDOM_CONTRAST) / 100.
    rgb_img = np.uint8(rgb_img * cr)
    
    #plt.figure(figsize=(10,10))
    #plt.imshow(rgb_img)
    return rgb_img

#Apply the three different filters above to input image
def apply(img, function):
    R_initial = img[:,:,0]
    G_initial = img[:,:,1]
    B_initial = img[:,:,2]
    
    #Apply function
    if (function == 'weighted_filter'):
        R_final = weighted_filter(R_initial, 2)
        G_final = weighted_filter(G_initial, 2)
        B_final = weighted_filter(B_initial, 2)
    elif (function == 'blur_sharp'):
        R_final = blur_sharp(R_initial)
        G_final = blur_sharp(G_initial)
        B_final = blur_sharp(B_initial)
    elif (function == 'bright_contrast'):
        img_final = bright_contrast(img)
        return img_final
    
    img_final = np.dstack((R_final, G_final))
    img_final = np.dstack((img_final, B_final))
    #plt.figure(figsize=(10,10))
    #plt.imshow(img_final)
    return img_final

In [None]:
#Examples:
img = cv2.imread('histopathologic-cancer-detection/train/d42e09bc5560bb88ef86b34f58e0657381455fa2.tif')
case1 = apply(img, 'weighted_filter')
case2 = apply(img, 'blur_sharp')
case3 = apply(img, 'bright_contrast')

<H5>Results:</H5>

<img src="https://raw.githubusercontent.com/BoyangW/histopathologic-cancer-detection/master/image/linear_filter.png" alt="Linear Filter">

<img src="https://raw.githubusercontent.com/BoyangW/histopathologic-cancer-detection/master/image/gaussian_blur.png" alt="Gaussian Blur">

<img src="https://raw.githubusercontent.com/BoyangW/histopathologic-cancer-detection/master/image/w_c.png" alt="Random Whitening/Contrast">

<H3>Convolutional Neural Networks (CNN)</H3>
<br>This section contains code for creating the pipeline of CNN machine learning, which consists of:</br>
<ol>
  <li>Split Train and Test Sets</li>
  <li>Copy Images to New Directory</li>
  <li>Build CNN Model</li>
  <li>Save Model Locally</li>
</ol> 
<br>Note: The parameters for building CNN Model is the same for all three different runs with different training images:</br>
<ol>
  <li>Original Image</li>
  <li>Augmented Image 1 (Linear Filter)</li>
  <li>Augmented Image 2 (Guassian Blur + White/Contrast)</li>
</ol> 
<br>To run the code for three different datasets, you will need to change the names for making dictory/saving the model and run code to apply filters to the training images (will explain below)</br>
<br>In my code (not shown), I set the original dataset directory to <b>base_dir</b> and <b>augement_1</b> for linear filter, <b>augement_2</b> for the other</br>

In [None]:
# Train and Test
# stratify=y creates a balanced validation set.
y = df_data['label'] #response variable

#Split by 9(df_train)/1(df_val)
df_train, df_val = train_test_split(df_data, test_size=0.10, random_state=101, stratify=y)

print(df_train.shape)
print(df_val.shape)


### Create directory separated from the entire training set
'''
Structure:
    
- base_dir  <------------------------ The name should be changed when running different training sets
    - train_dir
        - no_tumor_tissue
        - has_tumor_tissue
    - val_dir
        - no_tumor_tissue
        - has_tumor_tissue
'''

base_dir = 'histopathologic-cancer-detection/base_dir' # <-------------change the name for run2 and run3
os.mkdir(base_dir)

# train_dir
train_dir = os.path.join(base_dir, 'train_dir')
os.mkdir(train_dir)

# val_dir
val_dir = os.path.join(base_dir, 'val_dir')
os.mkdir(val_dir)

# create new folders inside train_dir
no_tumor_tissue = os.path.join(train_dir, 'a_no_tumor_tissue')
os.mkdir(no_tumor_tissue)
has_tumor_tissue = os.path.join(train_dir, 'b_has_tumor_tissue')
os.mkdir(has_tumor_tissue)


# create new folders inside val_dir
no_tumor_tissue = os.path.join(val_dir, 'a_no_tumor_tissue')
os.mkdir(no_tumor_tissue)
has_tumor_tissue = os.path.join(val_dir, 'b_has_tumor_tissue')
os.mkdir(has_tumor_tissue)

# Set the ID of each image to be the index of table
df_data.set_index('id', inplace=True)

In [None]:
### Transfer train/test images to created folder
# Get a list of train and val images
train_list = list(df_train['id'])
val_list = list(df_val['id'])


# Transfer the training images
for image in train_list:
    
    # the id in the csv file does not have the .tif extension therefore we add it here
    fname = image + '.tif'
    # get the label for a certain image
    target = df_data.loc[image,'label']
    
    # these must match the folder names
    if target == 0:
        label = 'a_no_tumor_tissue'
    if target == 1:
        label = 'b_has_tumor_tissue'
    
    # source path to image
    src = os.path.join('histopathologic-cancer-detection/train/', fname)
    # destination path to image
    dst = os.path.join(train_dir, label, fname)
    # copy the image from the source to the destination
    shutil.copyfile(src, dst)


# Transfer the validation images
for image in val_list:
    
    # the id in the csv file does not have the .tif extension therefore we add it here
    fname = image + '.tif'
    # get the label for a certain image
    target = df_data.loc[image,'label']
    
    # these must match the folder names
    if target == 0:
        label = 'a_no_tumor_tissue'
    if target == 1:
        label = 'b_has_tumor_tissue'
    

    # source path to image
    src = os.path.join('histopathologic-cancer-detection/train/', fname)
    # destination path to image
    dst = os.path.join(val_dir, label, fname)
    # copy the image from the source to the destination
    shutil.copyfile(src, dst)

<br>The code below here provides a function to filter the copied images in training/validation folder and re-write them under same directory</br>
<br>Run the code below only for model 2 or 3 each time after the previous steps</br>

In [None]:
### For Model 2 only

#Augement all the training/testing sets in linear filter
aug_base_dir = 'histopathologic-cancer-detection/augement_1/'
aug_train_dir_1 = aug_base_dir + '/train_dir/a_no_tumor_tissue'
aug_train_dir_2 = aug_base_dir + '/train_dir/b_has_tumor_tissue'
aug_val_dir_1 = aug_base_dir + '/val_dir/a_no_tumor_tissue'
aug_val_dir_2 = aug_base_dir + '/val_dir/b_has_tumor_tissue'

def augment_dir(folder):
    for filename in os.listdir(folder):
        img = cv2.imread(os.path.join(folder,filename))
        if img is not None:
            new_img = apply(img, 'blur_sharp')
            cv2.imwrite(os.path.join(folder,filename), new_img)

#Augment and write back all images for given directory   
augment_dir(aug_train_dir_1)
augment_dir(aug_train_dir_2)
augment_dir(aug_val_dir_1)
augment_dir(aug_val_dir_2)

In [None]:
# For Model 3 only
#apply half Gaussian filter and half bright-contrast filter to training dataset
def augment_dir(folder):
    i = 1
    for filename in os.listdir(folder):
        img = cv2.imread(os.path.join(folder,filename))
        if img is not None:
            if i % 2 == 0: 
                new_img = apply(img, 'weighted_filter')
            else:
                new_img = apply(img, 'bright_contrast')
            cv2.imwrite(os.path.join(folder,filename), new_img)
            i = i + 1

#Augment and write back all images for given directory   
augment_dir(aug_train_dir_1)
augment_dir(aug_train_dir_2)
augment_dir(aug_val_dir_1)
augment_dir(aug_val_dir_2)

<H4>Build CNN Model</H4>
<br>The layers and paramters chose here are inspired by the <a href="https://www.kaggle.com/vbookshelf/cnn-how-to-use-160-000-images-without-crashing">kernal by Marsh</a>, which is ideal in dealing with this particular dataset</br>

In [None]:
### Base Model (with original images)
train_path = 'histopathologic-cancer-detection/base_dir/train_dir' # <----------------Change the dir for run2 and run3
valid_path = 'histopathologic-cancer-detection/base_dir/val_dir'   # <----------------Change the dir for run2 and run3
test_path = 'histopathologic-cancer-detection/test'                

num_train_samples = len(df_train)
num_val_samples = len(df_val)

# Define the batch size and steps
train_batch_size = 10
val_batch_size = 10
train_steps = np.ceil(num_train_samples / train_batch_size) 
val_steps = np.ceil(num_val_samples / val_batch_size) 

### Generators
datagen = ImageDataGenerator(rescale=1.0/255)

IMAGE_SIZE = 96                                                  # <----------------Change size to be 94 for run2 and run3
IMAGE_CHANNELS = 3

train_gen = datagen.flow_from_directory(train_path,
                                        target_size=(IMAGE_SIZE,IMAGE_SIZE),
                                        batch_size=train_batch_size,
                                        class_mode='categorical')

val_gen = datagen.flow_from_directory(valid_path,
                                        target_size=(IMAGE_SIZE,IMAGE_SIZE),
                                        batch_size=val_batch_size,
                                        class_mode='categorical')

test_gen = datagen.flow_from_directory(valid_path,
                                        target_size=(IMAGE_SIZE,IMAGE_SIZE),
                                        batch_size=1,
                                        class_mode='categorical',
                                        shuffle=False)

### Building Convolutional Neural Networks (CNN)
#Parameters
kernel_size = (3,3)
pool_size= (2,2)
first_filters = 32
second_filters = 64
third_filters = 128

dropout_conv = 0.3
dropout_dense = 0.3

#Build Model
model = Sequential()
model.add(Conv2D(first_filters, kernel_size, activation = 'relu', input_shape = (96, 96, 3)))  # <----Change to (94,94,3) for run2 and run3
model.add(Conv2D(first_filters, kernel_size, activation = 'relu'))
model.add(Conv2D(first_filters, kernel_size, activation = 'relu'))
model.add(MaxPooling2D(pool_size = pool_size)) 
model.add(Dropout(dropout_conv))

model.add(Conv2D(second_filters, kernel_size, activation ='relu'))
model.add(Conv2D(second_filters, kernel_size, activation ='relu'))
model.add(Conv2D(second_filters, kernel_size, activation ='relu'))
model.add(MaxPooling2D(pool_size = pool_size))
model.add(Dropout(dropout_conv))

model.add(Conv2D(third_filters, kernel_size, activation ='relu'))
model.add(Conv2D(third_filters, kernel_size, activation ='relu'))
model.add(Conv2D(third_filters, kernel_size, activation ='relu'))
model.add(MaxPooling2D(pool_size = pool_size))
model.add(Dropout(dropout_conv))

model.add(Flatten())
model.add(Dense(256, activation = "relu"))
model.add(Dropout(dropout_dense))
model.add(Dense(2, activation = "softmax"))

model.summary()

<img src="https://raw.githubusercontent.com/BoyangW/histopathologic-cancer-detection/master/image/parameters.png" alt="Paramters">

In [None]:
### Train the model
model.compile(Adam(lr=0.0001), loss='binary_crossentropy', metrics=['accuracy'])

filepath = "histopathologic-cancer-detection/model_1"  # <----------------Change dir for saving model for run2 and run3
checkpoint = ModelCheckpoint(filepath, monitor='val_acc', verbose=1, 
                             save_best_only=True, mode='max')

reduce_lr = ReduceLROnPlateau(monitor='val_acc', factor=0.5, patience=2, 
                                   verbose=1, mode='max', min_lr=0.00001)
                              
                              
callbacks_list = [checkpoint, reduce_lr]

#Get the history log of each step (10) of batches (9000) for training set######
history = model.fit_generator(train_gen, steps_per_epoch=train_steps, 
                    validation_data=val_gen,
                    validation_steps=val_steps,
                    epochs=20, verbose=1,
                   callbacks=callbacks_list)

<H4>History Log Outputs for Model 1, 2 and 3</H4>

<br>Model 1 (Original Image)</br>
<img src="https://raw.githubusercontent.com/BoyangW/histopathologic-cancer-detection/master/image/epoch_2.png" alt="epoch_2">
<img src="https://raw.githubusercontent.com/BoyangW/histopathologic-cancer-detection/master/image/epoch_3.png" alt="epoch_3">
<img src="https://raw.githubusercontent.com/BoyangW/histopathologic-cancer-detection/master/image/epoch_19.png" alt="epoch_19">

<br>Model 2 (Linear-Filtered Image)</br>
<img src="https://raw.githubusercontent.com/BoyangW/histopathologic-cancer-detection/master/image/epoch_1(2).png" alt="epoch_1(2)">
<img src="https://raw.githubusercontent.com/BoyangW/histopathologic-cancer-detection/master/image/epoch_2(2).png" alt="epoch_2(2)">
<img src="https://raw.githubusercontent.com/BoyangW/histopathologic-cancer-detection/master/image/epoch_20(2).png" alt="epoch_20(2)">

<br>Model 3 (Gaussian/White/Contrast Image)</br>
<img src="https://raw.githubusercontent.com/BoyangW/histopathologic-cancer-detection/master/image/epoch_1(3).png" alt="epoch_1(3)">
<img src="https://raw.githubusercontent.com/BoyangW/histopathologic-cancer-detection/master/image//epoch_2(3).png" alt="epoch_2(3)">
<img src="https://raw.githubusercontent.com/BoyangW/histopathologic-cancer-detection/master/image/epoch_20(3).png" alt="epoch_20(3)"> 

<H3>Model Evaluation</H3>
<br>This section contains code for evaluating the results of our trained CNN to test dataset from three models, which consists of:</br>
<ol>
  <li>Accuracy and Loss</li>
  <li>Prediction Accuracy</li>
  <li>AUC Score</li>
  <li>Confusion Matrix</li>
</ol> 
<br>Note: The parameters for loading the saved model need to be changed for comparing the results of three different models</br>

<H4>Accuracy and loss</H4>

In [None]:
#Get accuracy and loss numerical values
model.metrics_names
model.load_weights('histopathologic-cancer-detection/model_1') #<-----------------Change the name for loading model 2 or 3

val_loss, val_acc = \
model.evaluate_generator(test_gen, 
                        steps=len(df_val))

print('val_loss:', val_loss)
print('val_acc:', val_acc)

<br>Results for Model 1 (Original)</br>
<img src="https://raw.githubusercontent.com/BoyangW/histopathologic-cancer-detection/master/image/loss_acc_1.png" alt="loss_acc_1">

<br>Results for Model 2 (Linear-Filtered Image)</br>
<img src="https://raw.githubusercontent.com/BoyangW/histopathologic-cancer-detection/master/image/loss_acc_2.png" alt="loss_acc_2">

<br>Results for Model 3 (Gaussian/White/Contrast Image)</br>
<img src="https://raw.githubusercontent.com/BoyangW/histopathologic-cancer-detection/master/image/loss_acc_3.png" alt="loss_acc_3"> 

In [None]:
# Plot accuracy and loss
acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']

epochs = range(1, len(acc) + 1)

plt.figure(figsize=(15,10))
plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss of original dataset')
plt.legend()


plt.figure(figsize=(15,10))
plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy of original dataset')
plt.legend()

<br>Results for Model 1 (Original)</br>
<img src="https://raw.githubusercontent.com/BoyangW/histopathologic-cancer-detection/master/image/loss_1.png" alt="loss_1" width="600px" height="450px">
<img src="https://raw.githubusercontent.com/BoyangW/histopathologic-cancer-detection/master/image/accuracy_1.png" alt="accuracy_1" width="600px" height="450px">

<br>Results for Model 2 (Linear-Filtered Image)</br>
<img src="https://raw.githubusercontent.com/BoyangW/histopathologic-cancer-detection/master/image/loss_3.png" alt="loss_2" width="600px" height="450px">
<img src="https://raw.githubusercontent.com/BoyangW/histopathologic-cancer-detection/master/image/accuracy_3.png" alt="accuracy_2" width="600px" height="450px">

<br>Results for Model 3 (Gaussian/White/Contrast Image)</br>
<img src="https://raw.githubusercontent.com/BoyangW/histopathologic-cancer-detection/master/image/loss_2.png" alt="loss_3" width="600px" height="450px">
<img src="https://raw.githubusercontent.com/BoyangW/histopathologic-cancer-detection/master/image/accuracy_2.png" alt="accuracy_3" width="600px" height="450px">

<H4>Prediction Accuracy</H4>

In [None]:
#Prediction
predictions = model.predict_generator(test_gen, steps=len(df_val), verbose=1)
df_preds = pd.DataFrame(predictions, columns=['no_tumor_tissue', 'has_tumor_tissue'])

# Get the true labels
y_true = test_gen.classes

# Get the predicted labels as probabilities
y_pred = df_preds['has_tumor_tissue']

<H4>AUC Score</H4>

In [None]:
#AUC
roc_auc_score(y_true, y_pred)

<br>Loss, Accuracy and AUC Score of Three Different Datasets</br>
<img src="https://raw.githubusercontent.com/BoyangW/histopathologic-cancer-detection/master/image/result_table.png" alt="table">

<H4>Confusion Matrix</H4>

In [None]:
#Confusion Matrix
test_labels = test_gen.classes
cm = confusion_matrix(test_labels, predictions.argmax(axis=1))
cm_plot_labels = ['no_tumor_tissue', 'has_tumor_tissue']
plot_confusion_matrix(cm, cm_plot_labels, title='Confusion Matrix')

<br>Results for Model 1 (Original)</br>
<img src="https://raw.githubusercontent.com/BoyangW/histopathologic-cancer-detection/master/image/cm_11.png" alt="cm_1" width="400px" height="350px">

<br>Results for Model 2 (Linear-Filtered Image)</br>
<img src="https://raw.githubusercontent.com/BoyangW/histopathologic-cancer-detection/master/image/cm_22.png" alt="cm_2" width="400px" height="350px">

<br>Results for Model 3 (Gaussian/White/Contrast Image)</br>
<img src="https://raw.githubusercontent.com/BoyangW/histopathologic-cancer-detection/master/image/cm_33.png" alt="cm_3" width="400px" height="350px">

<H3>Conclusion</H3>
<br>Among all three different preprocessed dataset,  data augmented with linear filter tend to have best test accuracy given the same CNN model</br>
<br>These results confirmed our hypothesis that augmentation not only helps reduce overfitting chance, but also optimizes the training/testing error to some extends</br>

<H4>Insights</H4>
<ol>
  <li>Choosing the right augmentation methods/filters depend on the types of images and questions we have</li>
  <li>There’s no universal answer to solve every problem and parameters for filtering and training the model</li>
  <li>Domain knowledge, combined with trial and error are important for optimizing the results with contents</li>
  <li>Larger dataset could be used to generate more accurate result from this kernel</li>
</ol> 