# Introduction


**The Task:** Design a system to grade Whole Slide Images of Prostate Tissue Biopsies.


**The motivation:** Within Cancer grade assessment there can be significant inter-observer variability, and the time of expert pathologists is a valuable limited resource. Machine Learning has already shown promise in this domain. This system was designed as an entry to a $25,000 Kaggle competition hosted by the Karolinska Institute Medical University (Sweden) and the Radboud University Medical Center (Netherlands).


**The Data:** 10,616 Whole Slide Images (between 5,000 and 40,000 pixels in both width and height, lots of empty space) of prostate tissue (format: Multi-level tiff file), with accompanying "Masks" which contain pixelwise labels for each image.  


**The System:** 
1.     Access the intermediate layer of the Whole Slide Image multi-level tiff file, split it into 224x224 px tiles and select the 30 tiles with the most tissue.

2. Each of the selected tiles is fed to three distinct binary classifiers. These binary classifiers consist of a RESNET50 model pretrained on ImageNet, and trained on our Data to detect one of three possible Gleason patterns of prostate cancer.

3. The output of each of these models are collected for each of the 30 tiles. These 90 values, split according to data provider are fed to one of two small Neural Networks which were trained on our data to output a final ISUP grade 0,1,2,3,4 or 5.


**The Result:** Quadratic Weighted Kappa score of 0.53. (1 is perfect, 0 is no better than chance)


**Designing the System:**


The rest of this notebook will describe at a relatively high level each step of designing and training this system. Links to the code will be provided, and questions in the comments are welcome. Techniques which proved fruitful for other entrants will be discussed as well as the principal avenues for improvement.

# Exploratory Data Analysis


A detailed introduction to the challange and exploratory data analysis including python code can be found [here.](http://www.kaggle.com/dararc/introduction-eda/) For the sake of brevity within this notebook I present a small snapshot of the analysis I conducted. 

We see below for reference the first five rows of the CSV file provided alongside our images and masks. Our csv file contains the image id with which we can access the multi-level tiff file, the data provider, the ISUP grade, and the underlying gleason patterns. 

It is worth mentioning here how the gleason score and the isup grade are calculated. The gleason score refers to which patterns are present, 0+0 reflects no cancer present while a gleason score of 4+3 indicates gleason pattern four is the most prevalent cancer present while gleason three is also present. The ISUP grade follows directly from the given Gleason score, it is essentially a tidier scale to reflect the Gleason Score. ISUP of 0 indicates no cancer, while ISUP 5 is the most severe grading. 

In [None]:
import numpy as np 
import pandas as pd 
import os
%matplotlib inline
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
#set directory
MAIN_DIR = '../input/prostate-cancer-grade-assessment'
# load data
train = pd.read_csv(os.path.join(MAIN_DIR, 'train.csv'))
# useful function for plotting counts
def plot_count(df, feature, title='', size=2):
    f, ax = plt.subplots(1,1, figsize=(4*size,3*size))
    total = float(len(df))
    sns.countplot(df[feature],order = df[feature].value_counts().index, palette='deep')
    plt.title(title)
    for p in ax.patches:
        height = p.get_height()
        ax.text(p.get_x()+p.get_width()/2.,
                height + 9,
                '{:1.2f}%'.format(100*height/total),
                ha="center") 
    plt.show()
# useful function for plotting relative distributions 
def plot_relative_distribution(df, feature, hue, title='', size=2):
    f, ax = plt.subplots(1,1, figsize=(4*size,3*size))
    total = float(len(df))
    sns.countplot(x=feature, hue=hue, data=df, palette='deep')
    plt.title(title)
    for p in ax.patches:
        height = p.get_height()
        ax.text(p.get_x()+p.get_width()/2.,
                height + 3,
                '{:1.2f}%'.format(100*height/total),
                ha="center") 
    plt.show()

In [None]:
print(train.head())

We see below that our data is split nearly 50/50 across our two data providers, but that the examples provided by Radboud University Medical Center are more severe than those provided by the Karolinska institute.

In [None]:
plot_count(df=train, feature='data_provider', title = 'Data provider - count and percentage share')

In [None]:
plot_relative_distribution(df=train, feature='isup_grade', hue='data_provider', title = 'relative distribution of ISUP grade across data_provider', size=2)

Two other points of note: 
* During the challenge it was revealed by the organisers that the Radboud data contained a significant level of noise. This was a deciding factor in my decision to split up the examples by data provider in the last step of grading.

* Our EDA checked for mislabelled examples, i.e. isup grades which did not match the underlying gleason patterns, one mislabelled example was identified and removed.

# Data
It took a little time to get used to working with multi-level tiff files. Multi-level means the file contains the same image at three different resolutions, corresponding to a downsampling of 1,4 and 16. We see here an sample from the same image at each of the three resolutions starting with the lowest before zooming in twice 4x in each case on a particular section.

In [None]:
import skimage.io
import PIL
path = os.path.join(MAIN_DIR, 'train_images')
biopsy = skimage.io.MultiImage(os.path.join(path, train.image_id.tolist()[0]+'.tiff'))
display(PIL.Image.fromarray(biopsy[-1]))

In [None]:
x = 1450
y = 1950
level = 1
width = 512
height = 512

patch = biopsy[level][y:y+height, x:x+width]


plt.figure()
plt.imshow(patch)
plt.show()

In [None]:
x = 1450*4
y = 1950*4
level = 0
width = 512
height = 512

patch = biopsy[level][y:y+height, x:x+width]


plt.figure()
plt.imshow(patch)
plt.show()

# Data Labels

Each image in our training data comes with a corresponding 'mask' tiff file of same size providing us with pixelwise labelling in the red channel, other channels are set to zero.


The Radboud labels were semi-automatically generated by several deep learning algorithms, contain noise and can be considered weakly supervised, while the Karolinska labels were semi-automatically generated based on a pathologist's annotations. Each data provider labels the data slightly differently so with the help of a custom colour map we can display some images alongside their labels.

Radboudumc: Prostate glands are individually labelled. Valid values are: 0: background (non tissue) or unknown 1: stroma (connective tissue, non-epithelium tissue) 2: healthy (benign) epithelium 3: cancerous epithelium (Gleason 3) 4: cancerous epithelium (Gleason 4) 5: cancerous epithelium (Gleason 5)

Karolinska: Regions are labelled. Valid values: 0: background (non tissue) or unknown 1: benign tissue (stroma and epithelium combined) 2: cancerous tissue (stroma and epithelium combined)

We will label apply Karolinska's method of not distinguishing between stroma and epithelium. 

Key: 

    Karolinska: Black = Background, Grey = Healthy Tissue, Purple = Cancerous
     
    Radboud: Black = Background, Grey = Healthy Tissue, Yellow = Gleason 3, Orange = Gleason 4, Red = Gleason 5

In [None]:
train_img_index = pd.read_csv(os.path.join(MAIN_DIR, 'train.csv')).set_index('image_id')
import matplotlib
import matplotlib.pyplot as plt
mask_dir = '../input/prostate-cancer-grade-assessment/train_label_masks/'

#creating a function to take an image id and return an array of the image or mask as required
def id2array(id, type):
    if type == 'mask':
        if os.path.isfile(os.path.join(mask_dir + id + '_mask.tiff')) == True:
            array = skimage.io.MultiImage(os.path.join(mask_dir + id + '_mask.tiff'))[-1]
        else:
            print(no_mask_array)
            array = 0
    else:
        array = skimage.io.MultiImage(os.path.join(path + '/' + id + '.tiff'))[-1]
    return array

# we set up two colour maps as described above
cmap_rad = matplotlib.colors.ListedColormap(['black', 'gray', 'gray', 'yellow', 'orange', 'red'])
cmap_kar = matplotlib.colors.ListedColormap(['black', 'gray', 'purple'])


# this function will take 5 image ids and display an image and the related mask
def plot5(ids):
    img_arrays = [id2array(item, 'image') for item in ids]
    mask_arrays = [id2array(item, 'mask') for item in ids]
    fig, axs = plt.subplots(5, 2, figsize=(15,25))
    for i in range(0,5):
        image_id = ids[i]
        data_provider = train_img_index.loc[image_id, 'data_provider']
        gleason_score = train_img_index.loc[image_id, 'gleason_score']
        axs[i, 0].imshow(img_arrays[i])
        mask_array = mask_arrays[i]
        if data_provider == 'karolinska':
            axs[i, 1].imshow(mask_array[:,:,0], cmap=cmap_kar, interpolation='nearest', vmin=0, vmax=2)
        else:
            axs[i, 1].imshow(mask_array[:,:,0], cmap=cmap_rad, interpolation='nearest', vmin=0, vmax=5)
        for j in range(0,2):
            axs[i,j].set_title(f"ID: {image_id}\nSource: {data_provider} Gleason: {gleason_score}")
    plt.show()

In [None]:
plot5(train.image_id.tolist()[100:105])

# Data Pre-Processing

The training data above needed to be pre-processed into a form that could train a model. Iafoss' great tiling [technique](http://www.kaggle.com/iafoss/panda-16x128x128-tiles) which he made public early on in the challenge was utilised by a lot of contenders. I experiemented with some resizing methods and edge detection using convolutions but having seen the success of the tiling technique I decided to utilise it with an adjustment to obtain twice as much data. 

I dropped any suspicious image ids from our training data i.e for reasons of mislabelling, pen marks or missing masks.
I then accessed the intermediate layer of the data and padded it to so that both dimensions were multiples of 224, I then selected the 24 224x224 px boxes with the most tissue. I repeated the process with cropping instead of padding to obtain twice as much data. I conducted the same process with the masks to obtain 224x224 label images for each tile. The tiles were saved as png files and my code for the process (padding version, cropping version differs only in two lines) can be found [here](http://www.kaggle.com/dararc/panda-step-1-tiling)



# Data Labelling

(Up until now I had got by with an introduction from Andrew Ng's Machine Learning course, plenty of python tutorials - particularly several DataCamp courses, learning from other entrants' notebooks and many hours of trial and error. I completed Introduction to TensorFlow for Artificial Intelligence, Machine Learning, and Deep Learning, a four week course on the Coursera platform. This gave me a grounding in flowing image data to Neural Networks using the Keras API.)

As mentioned in the introduction I trained three seperate pre-trained RESNET50 models to detect the three patterns of prostate cancer: Gleason 3, Gleason 4, and Gleason 5. Radboud data provided us with labels as to whether Gleason 3, Gleason 4 or Gleason 5 was present in a particular region so for each of our training tiles I could read the corresponding mask file and determine whether each of the patterns was present in the tile. Detailed code similar to the final code used can be found [here](http://www.kaggle.com/dararc/eda-on-tile-labels)

Karolinska data only labelled tissue as cancerous or benign. So I only used Karolinska slides which contained only one pattern. That is to say only tiles coming from images which were either purely benign, only gleason 3, only gleason 4 or only gleason 5 were used as training data for the RESNET50 models.

# Training the Models

Each tile was copied to three folders according to whether each of the three possible patterns were present or not. This was done using three seperate notebooks as I ran up against the memory limits when running it all in one notebook. For instance a tile containing gleason 3, and gleason 5 but not gleason 4. Was copied to the three folders: Gleason 3 present, Gleason 4 absent, Gleason 5 present.


This way I used all the data to train each model to ascertain whether each pattern was present or absent in a given tile. Validation accuracy got to roughly 0.7 for these models, training of one can be found [here](http://www.kaggle.com/dararc/gl3-panda-training/). 


After about 25 epochs the model tended towards overfitting. There is definitely room for improvement in this step of the system and I will be experimenting by accessing different tiles, for instance 16 tiles from the lowest layer as well as using augmentation of the training data. This will take a lot of GPU hours so it will probably be late August before these improvements are live.


At this stage we had three models which could ascertain with about 70% accuracy whether each pattern was present in a tile. I decided that a cancerous region could obviously be present in the tile with the 25th most amount (still some) of tissue so I decided to run these models on the top 30 tiles when predicting. These models would give us 3 figures per tiles so we still had to collate these 90 figures into a single ISUP grade. I hard coded some procedures which got a QWK score of 0.19, before realising that several thousand examples of 90 values which predict a single value was a perfect task for machine learning.

I made two small Neural Networks (I am currently unsure if a Neural Network was the best choice for this task, I will be researching this over the coming weeks) one for each data provider. The NN would take 90 values and output a single ISUP grade. I trained it on all the training data (minus suspicious slides). I could make use of all the Karolinska data in this phase as I was now predicting with our RESNET50 models on the top 30 tiles per image, the mask files were irrelevant for this step as all the labels I needed was the final ISUP grade of the whole slide.


I tried to use the QWK as the loss function for this model but did not get this custom loss function finished in time (Another task for the coming weeks). I had to settle for binary cross entropy which doesn't distinguish how far away mis-classifications are from the true value, whereas the final grading for this competition does.

# Conclusion


The system scored a QWK of 0.53 on the unseen test data. A score of 1 would be a perfect performance grading every example with the same grade given by expert pathologists. While a score of 0 is a performance no better than chance. The winning entry scored 0.94. Some really interesting ideas utilised by the top solutions included running predictions from several models and chosing the most common predicted value and accessing data from the highest resolution for uncertain regions. I will certainly be gleaning as much as I can from the other entrant's work to improve my own system before taking the knowledge gained with me for other challenges.

This was my first ML competition having started upskilling into the field in March. I thoroughly enjoyed and appreciated the code and ideas that I learnt from and look forward to the next challenge.

In [None]:
from tensorflow.keras.models import load_model
model3 = load_model('../input/gl3-panda-training/gl3_model')
model3.load_weights('../input/gl3-panda-training/best.hdf5')

model4 = load_model('../input/gl4-panda-training/gl4_model')
model4.load_weights('../input/gl4-panda-training/best.hdf5')

model5 = load_model('../input/gl5-panda-training-hup/gl5_model')
model5.load_weights('../input/gl5-panda-training-hup/best.hdf5')

models = [model3, model4, model5]

DATA = '../input/prostate-cancer-grade-assessment/test_images'
TEST = '../input/prostate-cancer-grade-assessment/test.csv'
TRAIN = '../input/prostate-cancer-grade-assessment/train.csv'
SAMPLE = '../input/prostate-cancer-grade-assessment/sample_submission.csv'

testdf = pd.read_csv(TEST)
ids = testdf.image_id.tolist()
dp = testdf.data_provider.tolist()

N = 30
sz = 224

sub_df = pd.read_csv(SAMPLE)
import skimage.io
def id2tiles(id):
    results = []
    if os.path.exists(DATA):
        img = skimage.io.MultiImage(os.path.join(DATA,id+'.tiff'))[-2]
        shape = img.shape
        pad0,pad1 = (sz - shape[0]%sz)%sz, (sz - shape[1]%sz)%sz
        img = np.pad(img,[[pad0//2,pad0-pad0//2],[pad1//2,pad1-pad1//2],[0,0]],
                    constant_values=255)
        img = img.reshape(img.shape[0]//sz,sz,img.shape[1]//sz,sz,3)
        img = img.transpose(0,2,1,3,4).reshape(-1,sz,sz,3)
        if len(img) < N:
            img = np.pad(img,[[0,N-len(img)],[0,0],[0,0],[0,0]],constant_values=255)
        idxs = np.argsort(img.reshape(img.shape[0],-1).sum(-1))[:N]
        img = img[idxs]
        for i in range(len(img)):
            rel_img = img[i]
            rel_img[:,:,0] = ((rel_img[:,:,0]/255) - 0.8094)/ 0.4055
            rel_img[:,:,1] = ((rel_img[:,:,1]/255) - 0.6067)/ 0.5094
            rel_img[:,:,2] = ((rel_img[:,:,2]/255) - 0.7383)/ 0.4158
            results.append(rel_img)
    return(results)
    
model_pred_k = load_model('../input/training-for-model-outputs-to-grade/pred_model_k')
model_pred_k.load_weights('../input/training-for-model-outputs-to-grade/model_kbest.hdf5')

model_pred_r = load_model('../input/fork-of-training-for-model-outputs-to-grade/pred_model_r')
model_pred_r.load_weights('../input/fork-of-training-for-model-outputs-to-grade/model_rbest.hdf5')

list_of_list_of_tiles = []
list_of_list_of_tiles = [id2tiles(item) for item in ids]

isups = []
preds_list = []
def tiles2pred(tiles):
    if os.path.exists(DATA):
        new_images = [np.reshape(item, [1,224,224,3]) for item in tiles]
        preds = np.zeros((N,3))
        for i in range(0,N):
            for j in range(0,3):
                preds[i,j] = models[j].predict(new_images[i])
        preds_list.append(preds)
        
for item in list_of_list_of_tiles:
    tiles2pred(item)
    
model_used = []
index = 0
for item in preds_list:
    features = item.reshape(1,90)
    if dp[index] == 'karolinska':
        pred_array = model_pred_k.predict(features)
        model_used.append('k')
        isup = pred_array.argmax()
        isups.append(isup)
    else:
        pred_array = model_pred_r.predict(features)
        model_used.append('r')
        isup = pred_array.argmax()
        isups.append(isup)
    index = index+1

if os.path.exists(DATA):
    sub_df = pd.DataFrame({'image_id': ids, 'isup_grade': isups})
    sub_df.to_csv("submission.csv", index=False)

sub_df.to_csv("submission.csv", index=False)