<a href="https://colab.research.google.com/github/wdeback/dl-assignment/blob/master/DL_assignment_CoV_keras.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Cancer classification

We would like to train a convolutional neural network to detect cancer in histopathological images. A dataset of microscopy images from lymph node sections is available, together with labels indicating the presence of metastatic tissue in each image. 

In this notebook, we have already made a first working version. However, there is a lot of room for improvement. **Can you help and do a code review for us**?

![PCAM](https://raw.githubusercontent.com/basveeling/pcam/master/pcam.jpg)

### Data

##### Images
Our data set consists of a set of **~4000 hematoxylin and eosin (H&E) stained images** of lymph node sections that are extracted from a set of [whole slide images](https://www.mbfbioscience.com/whole-slide-imaging). The original whole slide images were acquired and digitized at 2 different pathology labs using a 40x objective (with resolution 0.243 microns per pixel) but are undersampled at 10x to increase the field of view. The images are in RGB format, have a height and width of 96x96 px and are of type ``uint8``. All images are stored in TIFF format and have a unique identifier in their filename. The image data is stored in the folder ``cov_dataset/images``.

##### Labels
The **image labels indicate the presence of metastatic tissue** in each image and can be found in ``cov_dataset/labels.csv``, associated with the image IDs. A positive label, i.e. ``1``, indicates that the center 32x32 px region of an image contains at least one pixel of tumor tissue. Conversely, the label value ``0`` indicates its absence in the center region. Tissue in the outer region of the patch does not influence the label.

### Task

Generally, our task is to develop an algorithm for the automated detection of metastases in hematoxylin and eosin (H&E) stained images of lymph node sections. More specifically, here, we aim to develop and **train a convolutional neural network (CNN) to classify images** in the provided data set into "benign" or "cancer" classes.

Below, we have already implemented a fully functional working version of a system to train a CNN for this task. However, as you will see, this implementation may be little too simplistic, naive or even wrong here and there.

Now, it is your task to **help us improve this deep learning system** by doing a code review. Think of improvements that can be made to any part of the deep learning process, from data exploration and model construction to network training, evaluation and inference. (The first part including downloading data, imports and utility functions should be okay and can be skipped.)

After executing and reviewing the notebook, you should try and **answer the following questions**:

1. **What do you notice when reviewing this code?** Can you list 3 or 4 erroneous, missing or suboptimal steps in each of the following phases: (1) data exploration/handling, (2) model construction/training and (3) evaluation/inference?

2. **What are your suggestions to improve it?** What would you do to fix these issues and/or improve the system?

3. **How would you implement it?** What deep learning framework, software tools, methods and techniques would you use?


Feel free to play around with this notebook and use all of colab's options and GPU processing to explore data, fix bugs, train models and run experiments.
 
However, *it is **not expected** that you actually implement your solution*. The idea is to take this notebook and your suggestions as a starting point for discussion to demonstrate your experience with methods and techniques in machine learning, deep learning, computer vision and/or medical imaging.

If something is unclear or doesn't work, please do not hesitate to contact us. 

**Good luck and have fun!**

----

- Use the 'Copy to drive' button to copy this notebook to your Google Drive.

- If you're new to colab, check [this video](https://www.youtube.com/watch?v=inN8seMm7UI&list=PLQY2H8rRoyvwLbzbnKJ59NkZvQAW9wLbx&index=3) and [this tutorial](https://colab.research.google.com/github/tensorflow/examples/blob/master/courses/udacity_intro_to_tensorflow_for_deep_learning/l01c01_introduction_to_colab_and_python.ipynb).

----


## Download dataset

- Note that the dataset is stored in a non-persistent way and will be deleted after ending your colab session, which means you will need to download it again.

In [None]:
dataset = "cov_dataset"

In [None]:
import os 
os.chdir('/content/')

# download dataset from GitHub
!if [[ ! -f /content/{dataset}.tar.gz ]]; then echo "=== Downloading {dataset} ==="; wget https://github.com/wdeback/dl-assignment/raw/master/{dataset}.tar.gz /content/{dataset}.tar.gz; fi
# uncompress dataset
!if [[ ! -d /content/{dataset} ]]; then echo "=== Deflating {dataset} ==="; tar -xzf /content/{dataset}.tar.gz; fi

# cd to dataset folder
data_folder = f'/content/{dataset}'
os.chdir(data_folder)
!pwd

## Import modules

In [None]:
import os, glob
import numpy as np
import pandas as pd
from skimage.io import imread
import matplotlib.pylab as plt

# tensorflow / keras
import tensorflow as tf
from tensorflow.keras import applications, models, layers

## Utility functions

- These functions should be okay and can be skipped in the code review.

In [None]:
def read_metadata():
  '''Read metadata file'''
  df = pd.read_csv('labels.csv')
  return df

def read_data(df):
  '''Read TIFF files and their labels'''
  fns = [os.path.join(folder, f'{id}.tif') for folder, id in zip(df['folder'], df['id'])]
  labels = df['label'].values
  images = np.array([imread(fn) for fn in fns])
  return images, labels

def convert_to_onehot(y, n_classes=2):
  '''Convert from index to one-hot notation'''
  return np.eye(n_classes)[y]

def plot_image(x, y, ax=None, show_axis=True):
  '''Plot image and its label'''
  if ax is None:
    fig, ax = plt.subplots(1,1)
  ax.imshow(x)
  ax.set_title(f'Label = {y}')
  if not show_axis:
    ax.axis('off')
  if ax is None:
    plt.show()

def plot_images(xs, ys, show_axis=True):
  '''Plot collection of images and their labels in a square'''
  assert len(xs) == len(ys)
  n_rows = n_cols = int(np.ceil(np.sqrt(len(xs))))
  fig, ax = plt.subplots(n_rows, n_cols, figsize=(n_cols*3, n_rows*3))
  ax = ax.flatten()
  for i, (x, y) in enumerate(zip(xs, ys)):
    plot_image(x, y, ax[i], show_axis=show_axis)
  for a in ax[i+1:]:
    a.set_visible(False)
  plt.show()

def print_details(x):
  '''Print basic info about array'''
  if not isinstance(x, np.ndarray):
    raise TypeError(f'Only useful for np.arrays.')
  
  print(f'Shape: {x.shape}')
  print(f'Type : {x.dtype}')
  print(f'Min  : {x.min()}')
  print(f'Max  : {x.max()}')
  print(f'Mean : {x.mean()}')

def plot_history(history):
  '''Plot keras history'''
  fig, ax = plt.subplots(1, 2, figsize=(12,4))

  ax[0].set_title('Loss')
  ax[0].plot(history.history['loss'])

  ax[1].set_title('Accuracy')
  ax[1].plot(history.history['accuracy'])

  for a in ax:
    a.set_xlabel('Epochs')
    a.grid(True)

  plt.show()


def plot_roc_curve(model, x, y):
  '''Compute FPR, TPR, AUC and plot ROC curve'''
  from sklearn.metrics import roc_curve, auc, plot_roc_curve
  y_pred_probs = model.predict(x)
  fpr, tpr, thresholds = roc_curve(np.argmax(y,axis=-1), y_pred_probs[:,1])
  auc = auc(fpr, tpr)

  fig, ax = plt.subplots(1,1)
  _ = ax.plot(fpr,tpr,label=f'auc={auc:.3f}')
  ax.legend(loc='best')
  ax.set_xlabel('False positive rate')
  ax.set_ylabel('True positive rate')
  ax.set_title('ROC curve')

  # inset
  axins = ax.inset_axes([0.4, 0.2, 0.5, 0.55])
  _ = axins.plot(fpr, tpr)
  size = 0.2
  x1, x2, y1, y2 = -0.05, size, 1-size, 1.05
  axins.set_xlim(x1, x2)
  axins.set_ylim(y1, y2)
  axins.set_xticklabels('')
  axins.set_yticklabels('')
  ax.indicate_inset_zoom(axins)

  plt.show()
  return auc


def plot_prediction(x_sample, y_probs):
  '''Plot image sample alongside the predicted probabilities'''

  # class with highest probability
  y_pred = np.argmax(y_probs)
  # probability of winning class 
  confidence = y_probs[y_pred]

  # plot prediction
  fig, ax = plt.subplots(1,2, figsize=(6,3))
  fig.tight_layout()
  _= fig.suptitle(f"Prediction: Class \"{['benign', 'cancer'][y_pred]}\" with {confidence:.2%} confidence", y=1.1, fontsize=18)
  ax[0].imshow(x_sample)
  ax[0].axis('off')

  ax[1].bar(range(2), y_probs)
  ax[1].set_xticks([0,1])
  ax[1].set_xticklabels(['Benign', 'Cancer'])
  ax[1].set_ylabel('Class probabilities')
  ax[1].set_ylim([0,1])
  plt.show()

## Read metadata

In [None]:
# read dataset into pandas dataframe
df = read_metadata()
df.head()

## Sample of image data and labels

In [None]:
# read and display sample of images/labels
df_sample = df.sample(n=25)
x, y = read_data(df_sample)
plot_images(x, y, show_axis=False)

## Read and explore image data and labels

In [None]:
# read all image data and labels
x, y = read_data(df)

print('Details of x:')
print('-------------')
print_details(x)

print()
print('Details of y:')
print('-------------')
print_details(y)

# convert labels to one-hot notation
y = convert_to_onehot(y)

## Model and training


In [None]:
# load pretrained model
base_model = applications.VGG16(weights='imagenet', include_top=False, input_shape=x.shape[1:], pooling='max')

# add new classifier layers
model = models.Sequential([ base_model, layers.Dense(2, activation='softmax') ])

# compile model
model.compile(loss='categorical_crossentropy', 
              optimizer=tf.optimizers.SGD(lr=1e-4),
              metrics=['accuracy'])

# train model
history = model.fit(x=x, y=y, batch_size=32, epochs=25, verbose=1)

# plot history
plot_history(history)

## Evaluate model performance

In [None]:
# evaluate model
_, acc = model.evaluate(x, y, verbose=0)

# ROC curve and AUC
auc = plot_roc_curve(model, x, y)

print(f'Model accuracy = {acc:%}')
print(f'Model AUC      = {auc:.3f}')

## Predict on sample image

In [None]:
# get random sample image
sample = np.random.randint(0, len(x))
x_sample = x[sample]

# predict class probabilities
y_probs = np.squeeze(model.predict(x_sample[np.newaxis, ...]))
# class with highest probability
y_pred = np.argmax(y_probs)
# probability of winning class 
confidence = y_probs[y_pred]
#print(f"Prediction: Class \"{['benign', 'cancer'][y_pred]}\" with {confidence:.2%} confidence")

plot_prediction(x_sample, y_probs)