# Histopathologic Cancer Detection
### William Egesdal

## Problem Description

The corpus is made up of labeled medical images. The problem is binary classification of cells to determine a positive or negative diagnosis. Two directories of images are provided (train and test), as well as a csv file of labels for the training set. The training and test sets are made up of 220025 and 57458 images respectively. The images are 96x96 pixels and are RGB which gives a flattened dimension of 27648. The files are in .tif format. The problem is well-suited to Convolutional Neural Network (CNN) Classification.

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from matplotlib import pyplot as plt
import tensorflow as tf

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
# for dirname, _, filenames in os.walk('/kaggle/input'):
#     for filename in filenames:
#         print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [2]:
train_labels = pd.read_csv("/kaggle/input/histopathologic-cancer-detection/train_labels.csv")

In [3]:
train_labels.hist()

In [4]:
train_labels.value_counts()

Approximately 60% of the training images are negative and 40% are positive.

In [5]:
img = plt.imread("/kaggle/input/histopathologic-cancer-detection/train/6ab1cdb88dce07be766df5a7b2c7af8edd982ab4.tif", 3)
plt.imshow(img)

Sanity checking an image suggests that the color channels could contain relevant information.

In [6]:
# root, dirs, files
train_directory = "/kaggle/input/histopathologic-cancer-detection/train"
train_file_list = [files for root, dirs, files in os.walk(train_directory)][0]
train_file_df = pd.DataFrame(train_file_list, columns=['filepath'])
train_file_df['id'] = train_file_df['filepath'].apply(lambda x: x[:-4])
combined_df = pd.concat([train_file_df.set_index('id'), train_labels.set_index('id')], axis=1)
combined_df.head()

In [7]:
import shutil

if not os.path.exists("/kaggle/working/train"):
    os.mkdir("/kaggle/working/train")
    
if not os.path.exists("/kaggle/working/train/0"):
    os.mkdir("/kaggle/working/train/0")

if not os.path.exists("/kaggle/working/train/1"):
    os.mkdir("/kaggle/working/train/1")

working_train_directory = "/kaggle/working/train"
    
for _, row in combined_df.iterrows():
    shutil.copyfile(os.path.join(train_directory, row['filepath']), os.path.join(working_train_directory, str(row['label']), row['filepath']))

In order to format the training set for the preprocessor, the images are moved into the working directory and separated by their class into two folders "0" and "1", based on the data extracted from the provided train_labels.csv file. This automates the process of providing labels to the model for fitting.

In [8]:
train_datagen = tf.keras.preprocessing.image.ImageDataGenerator(
    shear_range = 0.2,
    zoom_range = 0.2,
    horizontal_flip = True
)

The preprocessor shears and zooms the images, as well as flips them horizontally at random. This allows the model to generalize better. The process of randomly adjusting images to increase a model's generalization is called augmentation.

In [30]:
train_generator = train_datagen.flow_from_directory(
    working_train_directory,
    target_size=(96, 96),
    batch_size = 128,
    class_mode = 'binary'
)

In [31]:
train_generator.class_indices

In [32]:
# get dimensions of image
img_dim = (img.size / 3) ** 0.5
print(img_dim)
input_layer_size = img.size
print(input_layer_size)

## Model Architecture

In [33]:
model = tf.keras.Sequential()

model.add(tf.keras.layers.Conv2D(32, kernel_size=(3,3), activation="relu",input_shape=(96,96,3)))
model.add(tf.keras.layers.Conv2D(64, (3, 3), activation="relu"))
model.add(tf.keras.layers.MaxPooling2D(2, 2))
model.add(tf.keras.layers.Dropout(0.25))

model.add(tf.keras.layers.Conv2D(64, (3, 3), activation="relu"))
model.add(tf.keras.layers.MaxPooling2D(2, 2))
model.add(tf.keras.layers.Dropout(0.25))

model.add(tf.keras.layers.Conv2D(128, (3, 3), activation="relu"))
model.add(tf.keras.layers.MaxPooling2D(2, 2))
model.add(tf.keras.layers.Dropout(0.4))

model.add(tf.keras.layers.Flatten())
model.add(tf.keras.layers.Dense(64, activation='relu'))
model.add(tf.keras.layers.Dropout(0.3))

model.add(tf.keras.layers.Dense(1, activation='sigmoid'))

lr_schedule = tf.keras.optimizers.schedules.ExponentialDecay(
    initial_learning_rate=0.0003,
    decay_steps=10000,
    decay_rate=0.9)

opt = tf.keras.optimizers.Adam(lr_schedule)

model.compile(optimizer=opt,loss='binary_crossentropy',metrics=[tf.keras.metrics.BinaryAccuracy(),
                       tf.keras.metrics.FalseNegatives()])

model.summary()

The CNN is configured with a repeating pattern of Conv2D with ReLu activation, MaxPooling, and Dropout layers. At the end there is a fully connected Linear layer before the final output layer. The output layer contains a single dimension because the classification problem has 2 classes (0 and 1). I studied multiple examples of CNNs for binary classification and this approach seemed to be well supported by those examples successfully performing binary classification on a set of images. Generally speaking, the rule of thumb for successive layers is to reduce the number of dimensions by 0.5 as stepping through the network. The dropout layers help to isolate important features. Also worth noting is the use of a decaying learning rate on the Adam optimizer. The learning rate will decay exponentially over a set number of steps. The purpose of decaying the learning rate is to accelerate learning at the outset of training but reduce it over time to allow for exploring fine details at the end of the training period.

In [34]:
from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())

In [35]:
history = model.fit(train_generator, steps_per_epoch=100, epochs=10)

The main consideration for the hyperparameters of steps_per_epoch and epochs is RAM. If these values are too high, the amount of RAM needed exceeds the available memory on the GPU. Another consideration is that too low a value would mean the model is training on a limited portion of the dataset, as well as the stepwise decay rate for the learning_rate parameter.

Learning rate is another hugely important hyperparameter that I tuned mostly through trial and error. Too high a learning rate and the model seems to converge on a roughly 60% accuracy rate but not exceeding it. Since roughly 60% of the training data is 0, one can imagine a scenario where the classifier just classifies all the samples are negative and achieves a 60% accuracy rate. Too low a learning rate and the model struggles to fit the data. I experienced this with a large batch size, as false negative rate just spiralled out of control with too high a value.

In [None]:
# model.save("/kaggle/working/113021.hd5")

## Results

In [36]:
# root, dirs, files
test_directory = "/kaggle/input/histopathologic-cancer-detection/test"

In [37]:
import shutil

if not os.path.exists("/kaggle/working/test"):
    os.mkdir("/kaggle/working/test")
    
if not os.path.exists("/kaggle/working/test/predict"):
    os.mkdir("/kaggle/working/test/predict")

working_test_directory = "/kaggle/working/test"
    
for root, dirs, filenames in os.walk(test_directory):
    for filename in filenames:
            shutil.copyfile(os.path.join(test_directory, filename), os.path.join(working_test_directory, 'predict', filename))
        

A similar process as the training set formatting is conducted for moving test files into the working directory, with the exception being that the preprocessor expects a single directory containing the test images inside the root test directory.

In [116]:
test_datagen = tf.keras.preprocessing.image.ImageDataGenerator()

test_generator = test_datagen.flow_from_directory(
    directory=working_test_directory,
    target_size=(96, 96),
    color_mode="rgb",
    batch_size=1,
    class_mode=None,
    shuffle=False
)

predict =model.predict(test_generator)
# predict the class label

In [117]:
predict[:10]

In [112]:
y_classes = predict

Limiting the batch size to 1 and setting shuffle to False retains the ordering of the labels for constructing the submission.csv file.

In [115]:
y_ids = test_generator.filenames
y_ids[10:]

In [120]:
formatted_y_ids = [fid[8:-4] for fid in y_ids]

In [121]:
formatted_y_ids[:10]

In [98]:
len(y_classes)

In [99]:
len(y_ids)

In [100]:
history.history.keys()

In [101]:
plt.plot(history.history['loss'], label='Binary Classification Loss')
plt.plot(history.history['binary_accuracy'], label='Binary Accuracy')
plt.title('Training History')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(loc="upper left")
plt.show()

In [102]:
plt.plot(history.history['false_negatives_4'], label='False Negatives')
plt.title('Training History (False Negatives)')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(loc="upper left")
plt.show()

In [123]:
predict_df = pd.DataFrame({'id':formatted_y_ids, 'label': y_classes.flatten()})

In [124]:
# sanity check the prediction dataframe before export
predict_df.head()

In [125]:
predict_df.to_csv('/kaggle/working/submission.csv', index=False)

The final accuracy of the model on the training set was approximately 80%.