<a href="https://www.kaggle.com/code/dazhengzhu/histopathologic-cancer-detection?scriptVersionId=193995947" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Histopathologic Cancer Detection

## Introduction


Early and accurate cancer detection is critical for improving patient outcomes. Traditional methods relying on human expertise are time-consuming and prone to errors. The increasing volume of histopathological images further exacerbates this challenge. 

This project aims to develop a deep learning-based model capable of accurately classifying histopathological images as cancerous or non-cancerous. By leveraging the Histopathologic Cancer Detection dataset, we will explore the potential of deep learning techniques to assist doctors in their diagnostic process. Successful implementation of this model could significantly enhance cancer diagnosis efficiency and contribute to advancements in medical image analysis.

### Setup

Import Tensorflow and other necessary libraries

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
%matplotlib inline

import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import numpy as np
import PIL
import tensorflow as tf

from tensorflow import keras

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os

print(tf.__version__)
# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

### Data

PatchCamelyon (PCam) packs the clinically-relevant task of metastasis detection into a straight-forward binary image classification task, akin to CIFAR-10 and MNIST. 

The data for this project is a slightly modified version of the PCam benchmark dataset (the original PCam dataset contains duplicate images due to its probabilistic sampling, however, the version presented on Kaggle does not contain duplicates).

## Exploratory Data Analysis

In [None]:
import pathlib

train_path = '/kaggle/input/histopathologic-cancer-detection/train'
test_path = '/kaggle/input/histopathologic-cancer-detection/test'
train_dir = pathlib.Path(train_path).with_suffix('')
test_dir = pathlib.Path(test_path).with_suffix('')

train_imgs = list(train_dir.glob('*.tif'))
test_imgs = list(test_dir.glob('*.tif'))

print(len(train_imgs))
print(len(test_imgs))

The dataset uses total 220025 training images, and 57458 images for testing, its too large, we'll select a subset of images for training and validation later.

#### Preview Images

Now take a look at a few pictures to get a better sense of what the dataset look like.

In [None]:
# Get training and testing files
train_files = os.listdir(train_dir)
test_files = os.listdir(test_dir)

print('Training image files: ')
print(train_files[:10])
print('Testing image files: ')
print(test_files[:10])

In [None]:
# Create a 4x4 plot
nrows = 4
ncols = 4

# Index for iterating over images
pic_index = 0

# Set up matplotlib fig, and size it to fit 4x4 pics
fig = plt.gcf()
fig.set_size_inches(ncols * 4, nrows * 4)

pic_index += 8
next_train_pix = [os.path.join(train_dir, file)
                for file in train_files[pic_index-8:pic_index]]
next_test_pix = [os.path.join(test_dir, file)
                for file in test_files[pic_index-8:pic_index]]

for i, img_path in enumerate(next_train_pix+next_test_pix):
  # Set up subplot; subplot indices start at 1
  sp = plt.subplot(nrows, ncols, i + 1)
  sp.axis('Off') # Don't show axes (or gridlines)

  img = mpimg.imread(img_path)
  plt.imshow(img)

plt.show()

### Data Preprocessing

There are total 220025 images in the train folder, it's a large dataset, we'll select a subset of them for testing and validation.

The image file type is `.tif`, which is not an supported file type with the `keras.utils.image_dataset_from_directory`, fortunately, the `ImageDataGenerator` could help, to use `ImageDataGenerator`, we need to use the label as the name create a new folder named with the label, and then copy the image files to its corresponding label folder.

### Select Train & Validation Samples

To maintain data balance, we'll randomly sample 20000 images from the dataset. Of these, 10000 will have a label of 1, while the remaining 10000 will have a label of 0.

In [None]:
# load labels dataframe
df = pd.read_csv('../input/histopathologic-cancer-detection/train_labels.csv')

# Filter samples with the label
pos_df = df[df['label'] == 1]
neg_df = df[df['label'] == 0]

train_pos_df = pos_df.sample(n=10000, random_state=42)
train_neg_df = neg_df.sample(n=10000, random_state=42)
train_df = pd.concat([train_pos_df, train_neg_df])
# Shuffle the dataframe
train_df = train_df.sample(frac=1, random_state=42).reset_index(drop=True)

# Add file type
train_df.id = train_df.id + '.tif'
train_df.label = train_df.label.astype(str)

train_df.head()

#### Setup Data Generators

Let's setup training and validation data generators. The generators will yield batches of 32 images of size 96x96 and their labels.

We'll use the `keras.preprocessing.image.ImageDataGenerator` class to create generators and using the rescale parameter to normalizing the pixel values to be in the [0,1] range (from original [0, 255] range).

In [None]:
from tensorflow.keras.preprocessing.image import ImageDataGenerator

# Define batch size and image size
BATCH_SIZE = 32
IMG_SIZE = (96, 96)

TRAINING_SUBSET = "training"
VALIDATION_SUBSET = "validation"

train_datagen = ImageDataGenerator(rescale=1./255, validation_split=0.2)
val_datagen = ImageDataGenerator(rescale=1./255, validation_split=0.2)

# Create Image Generator in batches of 32
def create_generator(datagen: ImageDataGenerator, subset: str):
    return datagen.flow_from_dataframe(
        directory = train_dir,
        dataframe = train_df,
        x_col = 'id',
        y_col = 'label',
        subset=subset,
        seed=123,
        target_size=IMG_SIZE,
        batch_size=BATCH_SIZE,
        class_names=['0', '1'],
        class_mode='binary')

train_generator = create_generator(train_datagen, TRAINING_SUBSET)
val_generator = create_generator(val_datagen, VALIDATION_SUBSET)

## Model

### Build a Baseline Model

The images that will go into our convnet are 96x96 color images.

In [None]:
from tensorflow.keras import layers
from tensorflow.keras import Model

# Our input feature map is 96x96x3: 96x96 stands for image height x width pixels, 
# and 3 for the three color channels: R, G, and B
img_input = layers.Input(shape=img_shape)

# First convolution extracts 16 filters that are 3x3
# Convolution is followed by max-pooling layer with a 2x2 window
x = layers.Conv2D(16, 3, activation='relu')(img_input)
x = layers.MaxPooling2D(2)(x)

# Second convolution extracts 32 filters that are 3x3
# Convolution is followed by max-pooling layer with a 2x2 window
x = layers.Conv2D(32, 3, activation='relu')(x)
x = layers.MaxPooling2D(2)(x)

# Third convolution extracts 64 filters that are 3x3
# Convolution is followed by max-pooling layer with a 2x2 window
x = layers.Convolution2D(64, 3, activation='relu')(x)
x = layers.MaxPooling2D(2)(x)

On top of it are two fully-connected layers. Because we are facing binary classification problem, we will end our network with a sigmoid activation function, so that the output of our network will be a single scalar 0 or 1, indicates the probability the image label is 0 or 1.

In [None]:
# Flatten feature map to a 1-dim tensor
x = layers.Flatten()(x)

# Create a fully connected layer with ReLU activation and 512 hidden units
before_output = layers.Dense(512, activation='relu')(x)

# Create output layer with a single node and sigmoid activation
output = layers.Dense(1, activation='sigmoid')(before_output)

# Create model:
# input = input feature map
# output = input feature map + stacked convolution/maxpooling layers + fully
# connected layer + sigmoid output layer
model = Model(img_input, output)

Let's summarize the model:

In [None]:
model.summary()

The Output shape column shows how the size of our feature map evolves in each successive layer. We can observe that the convolution layers reduce the size of the feature maps by a bit due to padding, and each pooling layer halves the feature map.

Next, let's configure the sepecifications for model training. Since we're facing a binary classification problem, we'll select the `BinaryCrossentropy` loss and use the `RMSprop` optimizer, also monitoring the classification accuracy during the training.

In [None]:
def compile_model():
    model.compile(loss=keras.losses.BinaryCrossentropy(),
             optimizer=keras.optimizers.RMSprop(learning_rate=0.001),
             metrics=['accuracy'])
compile_model()

### Training

Let's train our model for 20 epochs, using 50 batches of training data and 12 batches of validation data per epoch. The `history` object used to capture information during the training.

In [None]:
epochs=20
def fit_model():
    return model.fit(
        train_generator,
        steps_per_epoch=80, # total training samples / batch size
        epochs=epochs,
        validation_data=validation_generator,
        validation_steps=24, # total validation samples / batch size
        verbose=2
    )
    
history = fit_model()

### Visualizing Intermediate Representations

To get a feel for what kind of features our convnet has learned, one fun thing to do is to visualize how an input gets transformed as it goes through the convnet.

Let's pick a random training or testing image from the training set, and then generate a figure where each row is the output of a layer, and each image in the row is a specific filter in that output feature map. Rerun this cell to generate intermediate representations for a variety of training images.

In [None]:
# Let's define a new Model that will take an image as input, and will output
# intermediate representations for all layers in the previous model after
# the first.
successive_outputs = [layer.output for layer in model.layers[1:]]
visualization_model = Model(img_input, successive_outputs)

# Let's prepare a random input image from the training set.
img_files = [os.path.join(train_dir, f) for f in train_imgs]
img_path = random.choice(img_files)

img = keras.utils.load_img(img_path, target_size=img_size)  # this is a PIL image
x = keras.utils.img_to_array(img)  # Numpy array with shape (96, 96, 3)
x = x.reshape((1,) + x.shape)  # Numpy array with shape (1, 96, 96, 3)

# Rescale by 1/255
x /= 255

# Let's run our image through our network, thus obtaining all
# intermediate representations for this image.
successive_feature_maps = visualization_model.predict(x)

# These are the names of the layers, so can have them as part of our plot
layer_names = [layer.name for layer in model.layers[1:]]

# Now let's display our representations
for layer_name, feature_map in zip(layer_names, successive_feature_maps):
  if len(feature_map.shape) == 4:
    # Just do this for the conv / maxpool layers, not the fully-connected layers
    n_features = feature_map.shape[-1]  # number of features in feature map
    # The feature map has shape (1, size, size, n_features)
    size = feature_map.shape[1]
    # We will tile our images in this matrix
    display_grid = np.zeros((size, size * n_features))
    for i in range(n_features):
      # Postprocess the feature to make it visually palatable
      x = feature_map[0, :, :, i]
      x -= x.mean()
      x /= x.std()
      x *= 64
      x += 128
      x = np.clip(x, 0, 255).astype('uint8')
      # We'll tile each filter into this big horizontal grid
      display_grid[:, i * size : (i + 1) * size] = x
    # Display the grid
    scale = 20. / n_features
    plt.figure(figsize=(scale * n_features, scale))
    plt.title(layer_name)
    plt.grid(False)
    plt.imshow(display_grid, aspect='auto', cmap='viridis')

## Results And Analysis

### Evaluate Accuracy and Loss for the Baseline Model

Let's plot the training and validation accuracy and loss during training:

In [None]:
# Retrieve a list of accuracy results on training and validation data
# sets for each training epoch

def preview_accuracy():
    acc = history.history['accuracy']
    val_acc = history.history['val_accuracy']

    loss = history.history['loss']
    val_loss = history.history['val_loss']

    plt.figure(figsize=(8, 8))
    plt.subplot(2, 1, 1)
    plt.plot(acc, label='Training Accuracy')
    plt.plot(val_acc, label='Validation Accuracy')
    plt.legend(loc='lower right')
    plt.ylabel('Accuracy')
    plt.ylim([min(plt.ylim()),1])
    plt.title('Training and Validation Accuracy')

    plt.subplot(2, 1, 2)
    plt.plot(loss, label='Training Loss')
    plt.plot(val_loss, label='Validation Loss')
    plt.legend(loc='upper right')
    plt.ylabel('Cross Entropy')
    plt.ylim([0,1.0])
    plt.title('Training and Validation Loss')
    plt.xlabel('epoch')

preview_accuracy()

Below are the observation from the figure:

- Performance: Both training and validation accuracies are generally above 70%, which indicates decent performance, however, there's room for improvement.
- Convergence: The accuracies seems to stabilize somewhat towards the end, but there's still significant fluctuation.
- Overfitting: There's a noticeable gap between training and validation accuracy, with training accuracy often higher. This suggests some degree of overfitting.
- Consistency: The validation accuracy shows more variability than the training accuracy, which is common but ideally should be reduced.

### Addressing overfitting in Baseline Model

We could employ several strategies to address overfitting:
    
**Data Augmentation**

In order to make the most of our few training examples, we will "augment" them via a number of random transformations, so that at training time, **our model will never see the exact same picture twice**. 

Let's create a new training data generator and apply several random transformations:

In [None]:
# Create training data generator with some transformations
train_datagen = ImageDataGenerator(
    rescale=1./255, 
    validation_split=0.2,
    rotation_range=40,
    width_shift_range=0.2,
    height_shift_range=0.2,
    shear_range=0.2,
    zoom_range=0.2,
    horizontal_flip=True,
    fill_mode='nearest'
)

train_generator = create_generator(train_datagen, TRAINING_SUBSET)

**Dropout**

Another popular strategy is adding **dropout** layer to reduce overfitting:

In [None]:
# Add a dropout rate of 0.5
x = layers.Dropout(0.5)(before_output)

# Create output layer with a single node and sigmoid activation
output = layers.Dense(1, activation='sigmoid')(x)
# Create Model
model = Model(img_input, output)

After employ **data augmentation** and **dropout**, let's train our model and preview the results again:

In [None]:
# Compile model -> fit model -> preview model training accuracy
compile_model()
history = fit_model()
preview_accuracy()

We can see below improvements:
- Both training and validation accuracies are close to 78%.
- Both training and validation accuracies shows consistency than before.
- The gap in accuracy between training and validation becomes smaller, this also indicates overfitting reduced.

Ok, this is the baseline model we built empirically, its accuracy looks good but not great, let’s get a taste of what the state-of-the-art (SOTA) Convolutional Neural Network (ConvNet) are like.

### Try EfficientNetV2 Model

EfficientNet is a family of convolutional neural network (CNN) architectures designed to achieve state-of-the-art accuracy while being computationally efficient. Unlike previous methods that arbitrarily scaled network depth, width, or resolution, EfficientNet introduces a compound scaling method that uniformly scales all three dimensions.

EfficientNetV2 is an improved version of the EfficientNet architecture, designed to be even faster to train and more parameter-efficient while maintaining or even surpassing the accuracy of its predecessor.



#### Create EfficientNetV2S model

Let's instantiate a EfficientNetV2S model pre-loaded with weights trained on `ImageNet` first. By specifying the `include_top=False` argument, we load a network that doesn't include the classification layers at the top, which is ideal for feature extraction.

In [None]:
from tensorflow.keras.applications import EfficientNetV2S
from tensorflow.keras.applications.efficientnet_v2 import preprocess_input

base_model = EfficientNetV2S(
    weights='imagenet', 
    include_top=False,
    input_shape=img_shape)

Then we create data augmentation layers to reduce overfitting, and use efficientnet_v2 `preprocess_input` method to rescale the images pixel values from [0, 255] to [-1, 1].

In [None]:
data_augmentation = tf.keras.Sequential([
  layers.RandomFlip('horizontal'),
  layers.RandomRotation(0.2),
])

x = preprocess_input(img_input)
x = data_augmentation(x)

Build `base_model` and feature extractor layers using the Keras Functional API.

In [None]:
# Use training=False as our model contains a BatchNormalization layer.
x = base_model(x, training=False)
# Using Pooling layer to convert the features to a single 1280-element vector per image
x = layers.GlobalAveragePooling2D()(x)
x = layers.Dropout(0.2)(x)
# Apply Dense layer to convert these features into a single prediction per image.
predictions = layers.Dense(1, activation='sigmoid')(x)

final_model = Model(inputs=img_input, outputs=predictions)

Let's take a look at the efficientnet model architecture:

In [None]:
final_model.summary()

In [None]:
learning_rate = 0.0001
final_model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=learning_rate),
              loss=tf.keras.losses.BinaryCrossentropy(),
              metrics=[tf.keras.metrics.BinaryAccuracy(threshold=0.5, name='accuracy')])

history = final_model.fit(train_generator,
                    steps_per_epoch=80, # total training samples / batch size
                    epochs=20,
                    validation_data=validation_generator,
                    validation_steps=24, # total validation samples / batch size)
                    verbose=2
)

preview_accuracy()

Much better! The EfficientNetV2S model give significant higher training & validation accuracies and smaller loss values. We will use the EfficientNetV2S to make prediction.

### Submission

Let's use the `final_model` to make prediction:

In [None]:
test_path = '/kaggle/input/histopathologic-cancer-detection/test'

df_test = pd.read_csv('/kaggle/input/histopathologic-cancer-detection/sample_submission.csv')
df_test['filename'] = df_test.id + '.tif'

test_datagen = ImageDataGenerator(rescale=1.0/255)
test_generator = test_datagen.flow_from_dataframe(
    dataframe = df_test,
    directory = test_path,
    x_col = 'filename',
    batch_size = batch_size,
    shuffle = False,
    class_mode = None,
    target_size = img_size,
)

In [None]:
predictions = final_model.predict(test_generator, verbose=1)

Convert the predictions to the labels:

In [None]:
label_pred = np.where(predictions > 0.5, 1, 0)
label_pred

Apply predictions to submission dataframe:

In [None]:
submission = pd.read_csv('../input/histopathologic-cancer-detection/sample_submission.csv')
submission.label = label_pred
submission.head()

Write dataframe to submission csv file:

In [None]:
submission.to_csv('submission.csv', header = True, index = False)

## Conclusion

This project aimed to explore the CNN models for Histopathologic Cancer Detection. The EfficientNetV2S model achieved 88% average accuracy on the validation dataset, surpassing the baseline model by 18%. 

For the Baseline model, we experienced reduce overfitting techniques such as data augmentation and add dropout layer, these techniques show good results in reducing overfitting. For the EfficientNetV2S model, we used transfer learning from a pre-trained network, add a new classifier on top of the pretrained model and retrain the model, the result shows significant higher accuracy, while the model demonstrated promising results, further improvements could be achieved by exploring fine-tuning.

## Reference

- [Keras Functional API](https://www.tensorflow.org/guide/keras/functional)
- [Tensorflow: Classification](https://www.tensorflow.org/tutorials/images/classification)
- [Tensorflow: Data Augmentation](https://www.tensorflow.org/tutorials/images/data_augmentation)
- [Tensorflow: Transfer learning and fine-tuning](https://www.tensorflow.org/tutorials/images/transfer_learning)
- [EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks](https://arxiv.org/abs/1905.11946) (ICML 2019)
- [Meta Pseudo Labels](https://paperswithcode.com/paper/meta-pseudo-labels) (CVPR 2021)