# **INTRODUCTION**

This notebook was created as part of the Kaggle Histopatholic Cancer Detection competition, which challenged participants to identify metastatic tissue in histopathologic scans of lymph node sections. To view the notebook please follow: https://www.kaggle.com/monicab13/ai-project2

# **PROBLEM STATEMENT**

The goal of this project is to create a model which can analyze scans of lymph nodes and make predictions about wether or not that image contains metastatic tissue, i.e. cancer and that can be trained relatively quickly.

In [None]:
import pandas as pd
import numpy as np
from glob import glob 
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.layers import Conv2D, MaxPooling2D
from tensorflow.keras.layers import Dense, Dropout, Flatten, Activation, BatchNormalization
from tensorflow.keras.models import Sequential
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau, ModelCheckpoint
from tensorflow.keras.optimizers import Adam
from skimage.io import imread
import os
import cv2
from sklearn.utils import shuffle
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
import itertools
import shutil
import matplotlib.pyplot as plt
from tqdm import tqdm_notebook,trange
from sklearn.metrics import roc_curve, auc, roc_auc_score
import matplotlib.pyplot as plt
from IPython.display import HTML
%matplotlib inline

SAMPLE_SIZE = 10000
IMAGE_SIZE = 96
IMAGE_CHANNELS = 3

# **DATA**

The dataset contains 220,025 images for training that are labeled with either a 0 or 1. A 0 indicates negative, i.e cancer-free and a 1 indicates positive, i.e. that the image contained cancer. A positive label indicates that the image contains at least one pixel of tumor tissue. Tumor tissue in the outer region of the patch does not influence the label. The dataset contains about 60% negative and 40% positive scans. 

The dataset also contains 57,468 test images which will be used to evaluate our submission. The test samples are unlabeled so they will not be used in building and evaluating our model.

# **LOADING DATA AND EXPLORATORY DATA ANALYSIS**

Here we load in the filenames of the training images and their corresponding lables. A sample of what the resulting dataframe looks like is shown below.

In [None]:
#referenced https://www.kaggle.com/gomezp/complete-beginner-s-guide-eda-keras-lb-0-93
path = "../input/histopathologic-cancer-detection/"
train_path = path + 'train/'
test_path = path + 'test/'

df = pd.DataFrame({'path': glob(os.path.join(train_path,'*.tif'))}) 
df['id'] = df.path.map(lambda x: x.split('/')[4].split(".")[0]) 
labels = pd.read_csv(path+"train_labels.csv")
df = df.merge(labels, on = "id")
df.head()

The training dataset consists of 130,908 negatives and 89,117 positives which is approximately 59.5% and 41.5%, respectively as a shown in the pie chart below.

In [None]:
df['label'].value_counts()

In [None]:
labels = 'Negative', 'Positive' 
sizes = df['label'].value_counts()
fig1, ax1 = plt.subplots()
ax1.pie(sizes, labels=labels, autopct='%1.1f%%',
        startangle=90)
ax1.axis('equal')  
plt.show()

Since the dataset is very large and training a model usng the whole dataset would take a long time, I decided to trim down the number of training samples that would be used. After some trial and error, I decided on using a total of 20,000 total samples, 10,000 positve and 10,000 negative. Sample sizes greater than 20,000 samples did not result in improved model accuracy or area under the ROC curve. This sample size should be large enough represent the full training dataset, while being able to train the model relatively quickly. I decided to make the number of positive and negative samples the same so that accuracy wouldn't be skewed by simply fitting the model towards the more frequent outcome.

In [None]:
df_neg = df[df['label'] == 0].sample(SAMPLE_SIZE, random_state = 113)
df_pos = df[df['label'] == 1].sample(SAMPLE_SIZE, random_state = 113)
data = pd.concat([df_neg, df_pos], axis=0).reset_index(drop=True)
data = shuffle(data)

data['label'].value_counts()

> Once I had the dataset I was going to work with, I read in the first 100 images and then randomly selected a few for viewing to get an idea of what the images I was working with looked like. The sample images can be see below.

In [None]:
#referenced https://www.kaggle.com/gomezp/complete-beginner-s-guide-eda-keras-lb-0-93

def load_images(df, N):
    
    X_img = np.zeros([N,IMAGE_SIZE,IMAGE_SIZE,IMAGE_CHANNELS],dtype=np.uint8) 
    #convert the labels to a numpy array too
    y_img = data['label'].to_numpy()
    #read images one by one, tdqm notebook displays a progress bar
   
        
    for i, row in tqdm_notebook(df.iterrows(), total = N):
       if i == N:
            break
       X_img[i] = cv2.imread(row['path'])
    return X_img, y_img

In [None]:
#referenced https://www.kaggle.com/gomezp/complete-beginner-s-guide-eda-keras-lb-0-93

data.reset_index(inplace = True, drop = True)
X_img, y_img = load_images(data, 100)
fig = plt.figure(figsize=(10, 4), dpi=150)
np.random.seed(113) #we can use the seed to get a different set of random images
for plotNr,idx in enumerate(np.random.randint(0, 100,8)):
   ax = fig.add_subplot(2, 8//2, plotNr+1, xticks=[], yticks=[]) #add subplots
   plt.imshow(X_img[idx]) #plot image
   ax.set_title('Label: ' + str(y_img[idx]))

Since I didn't have the labels for the test data, I split the 20,000 training samples into a test set and a validation set. I decided to use 90% of the images to train the model and to reserve 10% of the images to validate the model. The samples were split between the train and validation sets randomly and stratified on y. I decided to stratify on y in order to keep the number of positives and negatives in each set even once again so the model would not be skewed in the direction of the most frequent outcome.

In [None]:
y = data['label']

df_train, df_val = train_test_split(data, test_size=0.10, random_state=113, stratify=y)

print(df_train.shape)
print(df_val.shape)

In [None]:
df_train['label'].value_counts()


In [None]:
df_val['label'].value_counts()

The next step was creating directories to store the images. I created a test and a validation directory and then images were sorted to the appropriate directories. This seemed to be easier than keeping lists of what was in each sample population and repeatedly reading everything from the initial train directory.

In [None]:
#reference: https://www.kaggle.com/fmarazzi/baseline-keras-cnn-roc-fast-10min-0-925-lb

train_path = 'base_dir/train'
valid_path = 'base_dir/valid'
test_path = '../input/histopathologic-cancer-detection/test'
for fold in [train_path, valid_path]:
    for subf in ["0", "1"]:
        os.makedirs(os.path.join(fold, subf))

In [None]:
data.set_index('id', inplace=True)
data.head()

In [None]:
for image in df_train['id'].values:
    
    fname = image + '.tif'
    label = str(data.loc[image,'label'])
    src = os.path.join('../input/histopathologic-cancer-detection/train', fname)
    dst = os.path.join(train_path, label, fname)
    shutil.copyfile(src, dst) 

for image in df_val['id'].values:
    fname = image + '.tif'
    label = str(data.loc[image,'label'])
    src = os.path.join('../input/histopathologic-cancer-detection/train', fname)
    dst = os.path.join(valid_path, label, fname)
    shutil.copyfile(src, dst)

The images were preprocessed using the keras ImageDataGenerator functions, each pixel x was transformed to the value representing how many standard deviations it was from the mean x value. This makes pixels that vary significantly from the average standout, which may be useful in identifying images that contain cancer.

In [None]:
#reference: https://www.kaggle.com/fmarazzi/baseline-keras-cnn-roc-fast-10min-0-925-lb

num_train_samples = len(df_train)
num_val_samples = len(df_val)
train_batch_size = 32
val_batch_size = 32

train_steps = np.ceil(num_train_samples / train_batch_size)
val_steps = np.ceil(num_val_samples / val_batch_size)

datagen = ImageDataGenerator(preprocessing_function=lambda x:(x - x.mean()) / x.std() if x.std() > 0 else x,
                            horizontal_flip=True,
                            vertical_flip=True)

train_data = datagen.flow_from_directory(train_path,
                                        target_size=(IMAGE_SIZE,IMAGE_SIZE),
                                        batch_size=train_batch_size,
                                        class_mode='binary')

val_data = datagen.flow_from_directory(valid_path,
                                        target_size=(IMAGE_SIZE,IMAGE_SIZE),
                                        batch_size=val_batch_size,
                                        class_mode='binary')


test_data = datagen.flow_from_directory(valid_path,
                                        target_size=(IMAGE_SIZE,IMAGE_SIZE),
                                        batch_size=1,
                                        class_mode='binary',
                                        shuffle=False)

# BUILD THE MODEL

A CNN network using the keras library was selected to build the model because CNN networks are commonly used for image processing/analysis problems. Sequential was selected from keras because this allowed the model to be built layer by layer. The adam optimizer was seleted based on a mix of experimentation and the information in the following article https://medium.com/octavian-ai/which-optimizer-and-learning-rate-should-i-use-for-deep-learning-5acb418f9b2. Of the optimizers Adam seems to learn the fastest and it is more stable than the other optimizers in terms of accuracies with various learning rates. I selected a starting learning rate of .01 by testing a number of incremental learning rates including .005, .008, .01, and .015 and seeing which resulted in the lowest loss, highest accuracy and still was relatively time efficient. ‘binary_crossentropy’ was used for the loss function because either it is cancer or it is not. The remaining inputs were selected following guidance from: https://towardsdatascience.com/building-a-convolutional-neural-network-cnn-in-keras-329fbbadc5f5.

In [None]:
#reference: https://www.kaggle.com/fmarazzi/baseline-keras-cnn-roc-fast-10min-0-925-lb

kernel_size = (3,3) 
pool_size= (2,2)
first_filters = 32
second_filters = 64
third_filters = 128

dropout_conv = 0.3
dropout_dense = 0.5

model = Sequential()
model.add(Conv2D(first_filters, kernel_size, activation = 'relu', input_shape = (IMAGE_SIZE, IMAGE_SIZE, 3)))
model.add(Conv2D(first_filters, kernel_size, use_bias=False))
model.add(BatchNormalization())
model.add(Activation("relu")) 
model.add(Dropout(dropout_conv))

model.add(Conv2D(second_filters, kernel_size, use_bias=False))
model.add(BatchNormalization())
model.add(Activation("relu"))
model.add(Conv2D(second_filters, kernel_size, use_bias=False))
model.add(BatchNormalization())
model.add(Activation("relu"))
model.add(Dropout(dropout_conv))

model.add(Conv2D(third_filters, kernel_size, use_bias=False))
model.add(BatchNormalization())
model.add(Activation("relu"))
model.add(Conv2D(third_filters, kernel_size, use_bias=False))
model.add(BatchNormalization())
model.add(Activation("relu"))
model.add(Dropout(dropout_conv))

model.add(Flatten())
model.add(Dense(256, use_bias=False))
model.add(BatchNormalization())
model.add(Activation("relu"))
model.add(Dropout(dropout_dense))
model.add(Dense(1, activation = "sigmoid"))

# Compile the model
model.compile(Adam(0.01), loss = "binary_crossentropy", metrics=["accuracy"])

# TRAIN THE MODEL

In [None]:
%%time
#reference: https://www.kaggle.com/fmarazzi/baseline-keras-cnn-roc-fast-10min-0-925-lb

reduce = ReduceLROnPlateau(monitor='val_loss', patience=1, verbose=1, factor=0.1)
model.fit_generator(train_data, steps_per_epoch=train_steps, 
                    validation_data=val_data,
                    validation_steps=val_steps,
                    epochs=10, #did not see improvement in accuracy beyond 10 epochs
                   callbacks=[reduce])

# ANALYSIS

I generated predictions for the validation samples and compared those predictions with the actual outcomes. The ROC curve below shows the True Positive versus False Postive rates. Generally the goal with classification problems is to maximize the area under the ROC curve (AUC), trying to get the value to be as close to 1 as possible. My model results in an AUC of over .9 which is sufficiently high for the goal of this project. Thus I will stop tuning the parameters of the model and use this version for submittal.

In [None]:
#reference: https://www.kaggle.com/fmarazzi/baseline-keras-cnn-roc-fast-10min-0-925-lb

y_pred = model.predict_generator(test_data, steps=len(df_val), verbose=1)
fpr, tpr, thresholds_keras = roc_curve(test_data.classes, y_pred)
auc = auc(fpr, tpr)
auc


In [None]:
plt.figure(1)
plt.plot([0, 1], [0, 1])
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend(loc='best')
plt.show()

# PREPARE SUBMISSION

Here we use the actual competition test images and make predictions to submit. You can see a sample of what the submission will look like below.

In [None]:
#referenced https://www.kaggle.com/gomezp/complete-beginner-s-guide-eda-keras-lb-0-93

base_test_dir = '../input/histopathologic-cancer-detection/test'
test_samples = glob(os.path.join(base_test_dir,'*.tif'))
batch_size = 5000
idx = len(test_samples)
for i in range(0, idx, batch_size):
    print("Indexes: %i - %i"%(i, i+batch_size))
    submission_df = pd.DataFrame({'path': test_samples[i:i+batch_size]})
    submission_df['id'] = submission_df.path.map(lambda x: x.split('/')[4].split(".")[0])
    submission_df['image'] = submission_df['path'].map(imread)
    test = np.stack(submission_df["image"].values)
    test = (test - test.mean()) / test.std()
    predictions = model.predict(test)
    submission_df['label'] = predictions
    submission = pd.concat([submission, submission_df[["id", "label"]]])
submission.head()


In [None]:
#reference: https://www.kaggle.com/fmarazzi/baseline-keras-cnn-roc-fast-10min-0-925-lb

shutil.rmtree(train_path) #delete directory path
shutil.rmtree(valid_path) #delete directory path
submission.to_csv("submission.csv", index = False, header = True) #make csv file

In [None]:
pd.read_csv("submission.csv") #check contents of csv file

# CONCLUSION

A CNN model can be used to make predictions about cancer from image scans with reasonable accuracy, and a relatively high auc rate. Although you need a decent sized population of sample data to train a model, I learned the benefit from increasing the number of samples starts to taper off around 20,000 samples in this case. Additionally, training epochs beyond 10 did not seem to improve the model performance significantly. For both of these factors, the benefit of a faster model seemed to outweigh any minimal gains by increasing their values. The model I created can be trained in approximately 10 mintues using GPU acceleration while being relatively reliable. Further advances in nueral networks or just myself gaining a more in depth understanding of them could result in improved accuracy and auc numbers. This shows the power of what machine learning can and could do in medical settings.