# <center>Deeper Exploratory Data Analysis of PANDA</center>


## About the notebook

In this notebook, we are basically going to dive deeper into the dataset explanation and build on that to perform an extensive EDA. However, to get started we will first begin with a deep explanation of some key concepts related to **Prostate Cancer** and its detection and that we are worth knowing.


## Domain Knowledge 

In this section, we are interested in covering the knowledge around the Prostate Cancer in order to have a deeper understanding of the problem. To do this, we are going to start by trying to successively address these five key questions: what is Prostate Cancer? How is it tested and detected?  What is Gleason score and how does it fit into all this? What is ISUP grade and how is it related to Gleason score? How have Gleason scores been generated?

So, let us start looking into these questions

### Question 1: What is Prostate Cancer?

Prostate Cancer is a cancer that occurs in the prostate. The prostate is nothing but a small walnut-shaped gland in men that produces the seminal fluid that nourishes and transports sperm as depicted in the figure below.

<center>
<table class="image" style="table-layout:fixed; width:50%; min-width:100px; max-width:200;">
    <tr>
        <td>
            <img src="https://www.mayoclinic.org/-/media/kcms/gbs/patient-consumer/images/2013/11/15/17/38/ds00043_-my01633_im01561_prostca1thu_jpg.jpg" width="350px" height="350px">
        </td>
    </tr>
    <caption align="bottom">
        <center>
            Prostate Cancer is one of the most common types of cancer in men. Usually, it grows slowly and is initially confined to the prostate gland, where it may not cause serious harm. However, while some types of Prostate Cancer grow slowly and may need minimal or even no treatment, other types are aggressive and can spread quickly.
        </center>
    </caption>
</table>
</center>




### Question 2: How is Prostate Cancer tested and detected?

Prostate Cancer is tested through some prostate screening tests that might include:
* **Digital rectal exam (DRE)**: During a DRE, your doctor inserts a gloved, lubricated finger into your rectum  During a DRE, your doctor inserts a gloved, lubricated finger into your rectum to examine your prostate, which is adjacent to the rectum as depicted in the figure below. More precisely, he feels the back wall of the prostate gland for enlargement, tenderness, lumps or hard spots. If your doctor finds any abnormalities in the texture, shape or size of the gland, you may need further tests.
<center>
<table class="image" style="table-layout:fixed; width:100%; min-width:80px; max-width:200;">
    <tr>
        <img src="https://www.mayoclinic.org/-/media/kcms/gbs/patient-consumer/images/2013/11/15/17/37/ds00043_-hq01273_im01241_hdg7_drethu_jpg.jpg" width="350px" height="350px">
    </tr>
</table>
</center>
* **Prostate-specific antigen (PSA) test**: Here, a blood sample is drawn from a vein in your arm and analyzed for PSA, a substance that's naturally produced by your prostate gland. It's normal for a small amount of PSA to be in your bloodstream. However, if a higher than normal level is found, it may indicate prostate infection, inflammation, enlargement or cancer.

If a DRE or PSA test detects an abnormality, your doctor may recommend further tests to determine whether you have prostate cancer. Such tests can include:

* **Ultrasound**: If other tests raise concerns, your doctor may use transrectal ultrasound to further evaluate your prostate. A small probe, about the size and shape of a cigar, is inserted into your rectum. The probe uses sound waves to create a picture of your prostate gland.
* **Collecting a sample of prostate tissue** : If initial test results suggest prostate cancer, your doctor may recommend a procedure to collect a sample of cells from your prostate (prostate biopsy). Prostate biopsy is often done using a thin needle that's inserted into the prostate to collect tissue. The tissue sample is analyzed in a lab to determine whether cancer cells are present.

### Question 3: What is Gleason score and how does it fit into all this?

When a biopsy confirms the presence of cancer, the next step is to determine the level of aggressiveness (grade) of the cancer cells. A laboratory pathologist examines a sample of your cancer to determine how much cancer cells differ from the healthy cells. A higher grade indicates a more aggressive cancer that is more likely to spread quickly.

The most common scale used to evaluate the grade of prostate cancer cells is called a Gleason score. Gleason scoring combines two numbers and can range from 1-10 and describes how much the cancer from a biopsy looks like healthy tissue (lower score or less aggressive) or abnormal tissue (higher score or very aggressive)., though the lower part of the range isn't used as often.

### Question 4: What is ISUP grade and how is it related to Gleason score?

According to current guidelines by the International Society of Urological Pathology (ISUP), the Gleason scores are summarized into an ISUP grade on a scale from 1 to 5 according to the following rule:
* Gleason score 6 = ISUP grade 1
* Gleason score 7 (3 + 4) = ISUP grade 2
* Gleason score 7 (4 + 3) = ISUP grade 3
* Gleason score 8 = ISUP grade 4
* Gleason score 9-10 = ISUP grade 5

If there is no cancer in the sample, we use the label ISUP grade 0 in this competition. 

### Question 5: How have Gleason scores been generated in the dataset?

Each whole-slide Image (WSI) in this challenge contains one, or in some cases two, thin tissue sections cut from a single biopsy sample. Prior to scanning, the tissue is stained with haematoxylin & eosin (H&E). This is a standard way of staining the originally transparent tissue to produce some contrast. The samples are made up of glandular tissue and connective tissue. The glands are hollow structures, which can be seen as white “holes” or branched cavities in the WSI. The appearance of the glands forms the basis of the Gleason grading system. The glandular structure characteristic of healthy prostate tissue is progressively lost with increasing grade. The grading system recognizes three categories: 3, 4, and 5. The patterns are described in detail below and exemplified in the figure below:

* **[A] Benign prostate glands with folded epithelium:** The cytoplasm is pale and the nuclei small and regular. The glands are grouped together.
* **[B] Prostatic adenocarcinoma:** Gleason Pattern 3 has no loss of glandular differentiation. Small glands infiltrate between benign glands. The cytoplasm is often dark and the nuclei enlarged with dark chromatin and some prominent nucleoli. Each epithelial unit is separate and has a lumen.
* **[C] Prostatic adenocarcinoma:** Gleason Pattern 4 has partial loss of glandular differentiation. There is an attempt to form lumina but the tumor fails to form complete, well-developed glands. This microphotograph shows irregular cribriform cancer, i.e. epithelial sheets with multiple lumina. There are also some poorly formed small glands and some fused glands. All of these are included in Gleason Pattern 4.
* **[D] Prostatic adenocarcinoma:** Gleason Pattern 5 has an almost complete loss of glandular differentiation. Dispersed single cancer cells are seen in the stroma. Gleason Pattern 5 may also contain solid sheets or strands of cancer cells. All microphotographs show hematoxylin and eosin stains at 20x lens magnification.

<center>
<table class="image" style="table-layout:fixed; width:80%; min-width:200px; max-width:400;">
    <tr>
        <img src="https://storage.googleapis.com/kaggle-media/competitions/PANDA/GleasonPattern_4squares%20copy500.png" width="300px">
    </tr>
    <caption align="bottom">
        <center>
            Examples of patterns used for grading. [A] Healthy glands. [B]-[D] Gleason pattern 3 to 5, respectively.
        </center>
    </caption>
</table>
</center>


# Now Let's Get Started with PANDA Dataset

This notebook shows few methods to load and preprocess WSIs from PANDA challenge dataset. Base on the challenge description, the dataset consists of about 11 000 data samples for which each sample represents a whole-slide image (WSI) of prostate biopsies from two sources: [Radboud University Medical Center](https://www.radboudumc.nl/en/research) and the [Karolinska Institute](https://ki.se/en/meb).


## Loading of the Required Dependencies

This Python 3 environment associated with this notebook comes with many helpful analytics libraries installed. It is defined by the `kaggle/python` Docker image: [https://github.com/kaggle/docker-python]() In the cell below we enable the several different and helpful packages/modules/libraries that I think are required for this project.

In [None]:
### Directory dependencies
import os
from glob import glob

### Preprocessing dependencies: Level 1
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import random

### Plotting dependencies
import matplotlib as mpl
import matplotlib.pyplot as plt
import plotly.graph_objs as go # Plotly for the interactive viewer (see last section)
import seaborn as sns
from IPython.display import Image, display, HTML

### Preprocessing dependencies: Level 2
import openslide as OS
import PIL
from PIL import Image, ImageOps
import imageio
import cv2
import skimage.io as IO

### Figure Configuration
sns.set_style("whitegrid")
plt.rc("figure", titlesize=20)   # fontsize of the figure title
plt.rc("axes", titlesize=17)     # fontsize of the axes title
plt.rc("axes", labelsize=15)     # fontsize of the x and y labels
plt.rc("xtick", labelsize=12)    # fontsize of the tick labels
plt.rc("ytick", labelsize=12)    # fontsize of the tick labels
plt.rc("legend", fontsize=13)    # legend fontsize

## Loading of the Data

### Directories' Exploration

**Note:** We can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All". You can also write temporary files to `/kaggle/temp/`, but they won't be saved outside of the current session.

In [None]:
pwd

In [None]:
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

print("This is an overview of the different directories and their contents\n")
count = 1
for dirname, _, filenames in os.walk("/kaggle/"):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        count += 1
        
        if count%10 == 0:
            break

### Load the Data and Check the Content

In [None]:
### Base directory of the data
data_dir = "../input/prostate-cancer-grade-assessment"

### Directories of the training images and label masks
train_dir = os.path.sep.join([data_dir, "train_images"])
train_mask_dir = os.path.sep.join([data_dir, "train_label_masks"])

In [None]:
### Load the data
train_ids = pd.read_csv(f"{data_dir}/train.csv")
test_ids = pd.read_csv(f"{data_dir}/test.csv")
submission = pd.read_csv(f"{data_dir}/sample_submission.csv")

display(train_ids.tail(3).style.background_gradient(cmap="Blues"))

In [None]:
train_ids.set_index("image_id", inplace=True)
display(train_ids.tail(10).style.background_gradient(cmap="Blues"))

#### Checking for the Training Samples

In [None]:
###
files = glob(f"{train_dir}/*")
print(f"Size of the samples in the train_images directory: {len(files)}")
print(f"Size of training samples whose IDs are in train.csv: {train_ids.shape[0]}")
files = glob(f"{train_mask_dir}/*")
print(f"Size of the samples masks in the train_label_masks directory: {len(files)}")
###
print(f"\nData providers IDs: {train_ids.data_provider.unique()}, Size: {len(train_ids.data_provider.unique())}")
print(f"ISUP Grades (target) used: {train_ids.isup_grade.unique()}, Size: {len(train_ids.isup_grade.unique())}")
print(f"Gleason Score used: {train_ids.gleason_score.unique()}, Size: {len(train_ids.gleason_score.unique())}")
###
#display(train_ids.tail(10).style.background_gradient(cmap="Blues"))

#### Inference

We can notice from the outputs of the cell above that
* The training samples directory, `train_images/`, contains exactly 10616 files that matches the number of samples in the `train.csv`.
* All the training images do not have masks, as we can see that the size of the training samples is **10616** whereas the sample mask size is only **10516**. This means there are exactly 100 training samples without that do not have a mask.

#### About the Output Variables

The whole experiment of testing and detecting the Prostate Cancer outputs two varaibles that are:

* **isup_grade**: which is the target variable representing the severity of the cancer on a `0-5` scale.
* **gleason_score**: which is an alternate cancer severity rating system with more levels than the ISUP scale as you can see in the figure below depicting how Gleason score and ISUP grade systems compare. Let's recall this is provided on training data only.

<center>
<table class="image" style="table-layout:fixed; width:80%; min-width:200px; max-width:400;">
    <tr>
        <img src="https://storage.googleapis.com/kaggle-media/competitions/PANDA/Screen%20Shot%202020-04-08%20at%202.03.53%20PM.png" width="900px">
    </tr>
    <caption align="bottom">
        <center>
            An illustration of the Gleason grading process for an example biopsy containing prostate cancer. The most common (blue outline, Gleason pattern 3) and second most common (red outline, Gleason pattern 4) cancer growth patterns present in the biopsy dictate the Gleason score (3+4 for this biopsy), which in turn is converted into an ISUP grade (2 for this biopsy) following guidelines of the International Society of Urological Pathology. Biopsies not containing cancer are represented by an ISUP grade of 0 in this challenge.
        </center>
    </caption>
</table>
</center>


#### Checking Missing Information

In [None]:
train_ids.isnull().sum()

#### Checking of `test.csv` File Content

In [None]:
display(test_ids.head())
###
print(f"Size of test samples whose IDs are in test.csv: {test_ids.shape[0]}")

# Performing of Some Basic EDA

## Checking of the Size of `Training label masks` and  Samples and 



In [None]:
df = pd.DataFrame({"data_type": ["train_images", "train_label_masks"],
                   "count": [len(glob(f"{train_dir}/*")), len(glob(f"{train_mask_dir}/*"))]}).set_index("data_type")

display(df.style.background_gradient(cmap="Blues"))
####
fig, ax = plt.subplots(figsize=[6,8])

val_ax = sns.barplot(x=df.index, y=df["count"], palette="deep", ax=ax)
for i, v in enumerate(df.values):
    ax.text(val_ax.get_xticks()[i], v, str(int(v)),
            ha="center", fontsize=13)
ax.set_ylabel("count")
ax.set_title("Training sample Size\n")
ax.set_xlabel("\ndata type")
fig.tight_layout()

### Inference

As we also highlighted early in the previous section, the outputs of the cell above show that:
* All the training images do not have masks, as we can see that the training samples size is **10616** whereas the sample mask size is only **10516**.

## Checking of the Distribution of `Data Provider`, `ISUP Grade` and `Gleason Score`

### Auxilary Function

In [None]:
def distribution_plot(df, feature, ax, title=""):
    total = float(len(df))
    sns.countplot(df[feature],
                  order=df[feature].value_counts().index,
                  ax=ax)
    
    if feature=="gleason_score":
        ax.set_xticklabels(df[feature].tolist(), rotation=45) 
    ax.set_title(title)
    if feature!="data_provider":
        ax.set_ylabel("")
    for p in ax.patches:
        height = p.get_height()
        ax.text(p.get_x()+p.get_width()/2.,
                height+3,
                "{:1.1f}%".format(100*height/total),
                ha="center",
                fontsize=12) 

### Summary of Counts

In [None]:
df = train_ids.reset_index()
temp1 = df.groupby("data_provider").count()["image_id"].reset_index().sort_values(by="image_id",ascending=False).rename(columns={"image_id": "count"})
temp2 = df.groupby("isup_grade").count()["image_id"].reset_index().sort_values(by="image_id", ascending=False).rename(columns={"image_id": "count"})
temp3 = df.groupby("gleason_score").count()["image_id"].reset_index().sort_values(by="image_id",ascending=False).rename(columns={"image_id": "count"})
display(temp1.style.background_gradient(cmap="Blues"))
display(temp2.style.background_gradient(cmap="Blues"))
display(temp3.style.background_gradient(cmap="Blues"))

### Visualization

In [None]:
df = train_ids.reset_index()
fig = plt.figure(figsize=(22, 8))
ax = [fig.add_subplot(1, 3, 1),
      fig.add_subplot(1, 3, 2),
      fig.add_subplot(1, 3, 3)]
###
title0 = "Data provider"
title1 = "ISUP grade"
title2 = "Gleason score"
###
distribution_plot(df=df, feature="data_provider", ax=ax[0], title=title0)
distribution_plot(df=df, feature="isup_grade", ax=ax[1], title=title1)
distribution_plot(df=df, feature="gleason_score", ax=ax[2], title=title2)
###
fig.suptitle("Distribution plot (count and %)", y=1.1)
fig.tight_layout()
plt.show()

### Inferences on the Distributions

#### About the Distribution of `ISUP Grade`

* Majority of data samples in train set have ISUP grade values 0 or 1 (total > 50%).
* Rest of the data samples have associated ISUP grades from 2 to 5 with all ranging in the 11-12% each.
* Dataset is not balanced in terms of isup_grade.


#### About the Distribution of `Gleason Socre`

* From above graph, it is clear that gleason_score distribution is not uniform.
* Few gleason_score like (3+3) and (0+0) are more frequent while others like (3+5) and (5+3) are very rare in this dataset.
* Dataset is not balanced in terms of gleason_score.


## Checking for the Relative Distributions (RDs)
1. `ISUP Grade` and `Data Provider`
1. `Gleason Score` and `Data Provider`


In [None]:
df = train_ids.reset_index()
fig, ax = plt.subplots(1, 2, figsize=[20,8], sharey=True)

val1 = df.groupby(["isup_grade", "data_provider"]).count()["image_id"].unstack().plot(kind="bar", ax=ax[0])
val2 = df.groupby(["gleason_score", "data_provider"]).count()["image_id"].unstack().plot(kind="bar", ax=ax[1])

total = df.shape[0]
for k, p in enumerate(ax[0].patches):
    height = p.get_height()
    ax[0].text(p.get_x()+p.get_width()/2.,
            height + 3,
            "{:1.1f}%".format(100*height/total),
            ha="center",
            fontsize=12, rotation=0)
        
for k, p in enumerate(ax[1].patches):
    height = p.get_height()
    ax[1].text(p.get_x()+p.get_width()/2.,
               height + 3,
               "{:1.1f}%".format(100*height/total),
               ha="center",
               fontsize=12, rotation=70)
    
for label in ax[0].get_xticklabels():
    label.set_rotation(0)
for label in ax[1].get_xticklabels():
    label.set_rotation(0)


ax[0].set_ylabel("count")
ax[0].set_title("ISUP Grade/Data Provider\n")
ax[1].set_title("Gleason Score/Data Provider\n")
fig.suptitle("Relative Distribution plot (count and %)", y=1.1)

fig.tight_layout();

### Inferences on the Relative Distributions

#### About `ISUP Score` and `Data Provider`

* In isup_grade category 0 and 1 most of the data is provided by `karolinska`.
* In isup_grade category 3,4 and 5 most of the data is provided by `radbound`.

#### About`Gleason Score` and `Data Provider`

* In gleason_score category (0+0), all the data is provided by `karolinska`.
* In gleason_score category (negative), all the data is provided by `radbound`.
* Also in gleason_score category (3+3), karolinska is major data provider.
* On the other hand radbound is major data provider for (4+4), (4+3), (4+5), (5+4), (5+5), (5+3), (3+5).


# Some Advanced EDA

In this section, we are going to load some few images samples from the training image directory and learn some of their properties. The PANDA dataset are whole slide images (WSIs) stored in **Tag Image File Format** (`.tiff`). Let us recall `.tiff` is a common format for exchanging raster graphics (bitmap) images between application programs, including especially those used for scanner images. Thus, `.tiff` files can be in any of several classes, including gray scale, color palette, or RGB full color. The loading and processing of these WSIs will be performed using both `skimage` and `openslide` libraries in Python. Both libraries have some amazing and independent functions to process WSIs, for example, one of the key benefits of `openslide` is that we can load arbitrary regions of the slide, without having to load the whole image into memory. Click on [OpenSlide Python Documentation](https://openslide.org/api/python/) to read more about the OpenSlide Python. 
 

## Loading and Quickly Displaying of some few WSIs

Let's randomly select some few WSIs from the training samples and play with them to get some few insights.

In [None]:
### Randomly sample 9 training samples from our dataframe train_ids
np.random.seed(13)
WSI9 = train_ids.sample(n=9)
display(WSI9.style.background_gradient(cmap="Blues"))

### Checking of some Generalizable Properties on the Data

In [None]:
###  Let's open two files from different data provider, from the 7 selected images above
print(f"Data provider: {WSI9.data_provider.tolist()[0]}\n")
file_path1 = os.path.sep.join([train_dir, WSI9.index[0]+".tiff"]) # Full file directory
example_slide1 = OS.OpenSlide(file_path1) # Openining without reading the image into memory
print(f"Level dimensions: {example_slide1.level_dimensions}\n")

for prop in example_slide1.properties.keys():
    print("{}: -> {}".format(prop, example_slide1.properties[prop]))

In [None]:
print(f"Data provider: {WSI9.data_provider.tolist()[1]}\n")
file_path2 = os.path.sep.join([train_dir, WSI9.index[1]+".tiff"]) # Full file directory
example_slide2 = OS.OpenSlide(file_path2) # Openining without reading the image into memory
print(f"Level dimensions: {example_slide2.level_dimensions}\n")

for prop in example_slide2.properties.keys():
    print("{}: -> {}".format(prop, example_slide2.properties[prop]))


### Some Few Insights

The outputs of the cell above show the properties of two WSIs from the 9 randomly selected training images from different data provider. What we learned from these outputs are the following:

   - The image dimensions are quite large, ranging typically between about 4 000 and 40 000 pixels in both x and y.
   - Each WSI has 3 levels that can be loaded, corresponding to a downsampling of about 1, 4 and 16 as shown in the output cell above. So, intermediate levels can be created by simply downsampling a higher resolution level.
   - The dimensions of each level differ based on the dimensions of the original image.
   - Biopsies can be in different rotations. This rotation has no clinical value, and is only dependent on how the biopsy was collected in the lab.
   - There are noticable color differences between the biopsies, this is very common within pathology and is caused by different laboratory procedures.
   

 Besside these WSI properties, I associated a $512\times 512$ zoomed part of the image around $(x,y)=(1024,1024)$. And as you can observe on the figure below, starting from $(x,y)=(1024,1024)$ on some images, a zomm of $512\times 512$ doesn't capture the biopsy.

In [None]:
### Visualization based on OpenSlide and Matplolib packages
def plot_WSIs(slides): 
    fig, ax = plt.subplots(nrows=3, ncols=3, figsize=(18,18))
    for i, slide in enumerate(slides):
        file_path = os.path.sep.join([train_dir, slide+".tiff"]) # Full file directory
        image = OS.OpenSlide(file_path) # Openining without reading the image into memory
        
        # Creation of the patch to plot
        patch = image.read_region(location=(0,0),
                                  level=image.level_count-1,  # Get the last level/slide
                                  size=image.level_dimensions[-1]) # Get the dimension corresponding of the last level
        
        # Plot the patch
        ax[i//3, i%3].imshow(patch)
        image.close()
        ax[i//3, i%3].axis("on")
        
        image_id = slide
        data_provider = train_ids.loc[slide, "data_provider"]
        isup_grade = train_ids.loc[slide, 'isup_grade']
        gleason_score = train_ids.loc[slide, 'gleason_score']
        ax[i//3, i%3].set_title(f"\nID: ~{image_id[:7]}, Source: {data_provider}\nISUP: {isup_grade}, Gleason: {gleason_score}")

    fig.tight_layout()
    fig.suptitle("")
    plt.show()


In [None]:
plot_WSIs(WSI9.index)

### Let's Resize and Visualize the 9 Randomly Selected Biopsy Images


`skimage.io.MultiImage()` and `cv2.resize()` can be used as well instead of `openslide.OpenSlide()` and `cv2.resize()`, but the former currently fails fails on the latest version of the `Kaggle Docker image`  to the following packages `imagecodecs` and `tifffile`.

In [None]:
### Visualization based on Skimage and Matplotlib packages
def plot_resized_biopsy(slides): 
    fig, ax = plt.subplots(nrows=3, ncols=3, figsize=(18,18))
    for i, slide in enumerate(slides):
        file_path = os.path.sep.join([train_dir, slide+".tiff"]) # Full file directory
        image = OS.OpenSlide(file_path) # Openining without reading the image into memory
        
        # Creation of the patch to plot
        patch = image.read_region(location=(0,0),
                                  level=image.level_count-1,  # Get the last level/slide
                                  size=image.level_dimensions[-1]) # Get the dimension corresponding of the last level
        
        # Resize the image patch
        image = cv2.resize(np.asarray(patch), (512, 512))
        
        # Plot the resized image patch
        ax[i//3, i%3].imshow(image)  
        ax[i//3, i%3].axis("on")
        
        image_id = slide
        data_provider = train_ids.loc[slide, "data_provider"]
        isup_grade = train_ids.loc[slide, 'isup_grade']
        gleason_score = train_ids.loc[slide, 'gleason_score']
        ax[i//3, i%3].set_title(f"\nID: ~{image_id[:7]}, Source: {data_provider}\nISUP: {isup_grade}, Gleason: {gleason_score}")

    fig.tight_layout()
    plt.show()

In [None]:
plot_resized_biopsy(WSI9.index)


### Exploration and Analysis of the Training Label Masks

#### Overview
Apart from the slide-level label (present in the csv file), almost all slides in the training set have an associated mask with additional label information. The importance of these masks is they directly indicate which parts of the tissue are healthy and which are cancerous. However, the information in the masks differ from the two centers. Here is how:
- **Radboudumc**: the prostate glands are individually labelled and the valid values are:
  - 0: background (non tissue) or unknown
  - 1: stroma (connective tissue, non-epithelium tissue)
  - 2: healthy (benign) epithelium"
  - 3: cancerous epithelium (Gleason 3)
  - 4: cancerous epithelium (Gleason 4)
  - 5: cancerous epithelium (Gleason 5)
- **Karolinska**: the regions are labelled and the valid values are:
  - 0: background (non tissue) or unknown
  - 1: benign tissue (stroma and epithelium combined)
  - 2: cancerous tissue (stroma and epithelium combined)

The label masks provided by Radboudumc center were semi-automatically generated by several deep learning algorithms, contain noise, and can be considered as weakly-supervised labels. The label masks provided by Karolinska center were semi-autotomatically generated based on annotations by a pathologist.

The label masks are stored in an RGB format so that they can be easily opened by image readers. The label information is stored in the red (R) channel, the other channels are set to zero and can be ignored. As with the slides itself, the label masks can be opened using OpenSlide.

#### Loading of label masks and Visualization

In [None]:
def plot_biopsy_masks(slides): 
    fig, ax = plt.subplots(3,3, figsize=(18,18))
    for i, slide in enumerate(slides):
        
        file_path = os.path.sep.join([train_mask_dir, slide+"_mask.tiff"]) # Full file directory
        biopsy_mask = OS.OpenSlide(file_path) # Openining without reading the image into memory
        
        # Creation of the patch to plot
        mask_data = biopsy_mask.read_region(location=(0,0),
                                            level=biopsy_mask.level_count - 1,  # Get the last level/slide
                                            size=biopsy_mask.level_dimensions[-1]) # Get the dimension corresponding of the last level
    
        # Plot
        cmap = mpl.colors.ListedColormap(["black", "gray", "green", "yellow", "orange", "red"])
        ax[i//3, i%3].imshow(np.asarray(mask_data)[:,:,0], cmap=cmap, interpolation="nearest", vmin=0, vmax=5) 
        biopsy_mask.close()       
        ax[i//3, i%3].axis("on")
        
        image_id = slide
        data_provider = train_ids.loc[slide, "data_provider"]
        isup_grade = train_ids.loc[slide, "isup_grade"]
        gleason_score = train_ids.loc[slide, "gleason_score"]
        ax[i//3, i%3].set_title(f"\nID: {image_id[:7]} Source: {data_provider}\nISUP: {isup_grade} Gleason: {gleason_score}")
    
    fig.tight_layout()    
    plt.show()

In [None]:
plot_biopsy_masks(WSI9.index)

#### Prepare the Training Sample Images with their Corresponding Masks.

Recall we showed early that there are exactly 100 training sample images whose the training label masks are not available in the mask directory. So, we are going rearrange each training sample with its corresponding mask.

In [None]:
train_df = train_ids.reset_index() # In train_ids, I set index to "image_id"

masks = os.listdir(train_mask_dir)
masks_df = pd.DataFrame(data={"mask_id": masks})
masks_df["image_id"] = masks_df.mask_id.apply(lambda x: x.split("_")[0]) # Recall mask_id=image_id+"_mask.tiff"

train_df = pd.merge(train_df, masks_df, on="image_id", how="outer")
print(f"We have {train_df.shape[0]} training sample images and {masks_df.shape[0]} masks. So, there will be exactly {len(train_df[~train_df.mask_id.isna()])} images in the final training samples.")
display(train_df.head(10).style.background_gradient(cmap="Blues"))

#### Plotting of some Image and Masks According to `sup_grade` Category


**Auxiliary functions for this task:**

In [None]:
def load_and_resize_biopsy(img_id):
    
    file_path = os.path.sep.join([train_dir, img_id+".tiff"]) # Full file directory
    biopsy_img = OS.OpenSlide(file_path) # Openining without reading the image into memory

    # Creation of the patch to plot
    patch = biopsy_img.read_region(location=(0,0),
                                   level=biopsy_img.level_count-1,  # Get the last level/slide
                                   size=biopsy_img.level_dimensions[-1]) # Get the dimension corresponding of the last level

    # Resize the image patch
    image = cv2.resize(np.asarray(patch), (512, 512))
    
    return image

def load_and_resize_biopsy_mask(img_id):
    
    file_path = os.path.sep.join([train_mask_dir, img_id+"_mask.tiff"]) # Full file directory
    biopsy_mask = OS.OpenSlide(file_path) # Openining without reading the image into memory

    # Creation of the patch to plot
    patch = biopsy_mask.read_region(location=(0,0),
                                    level=biopsy_mask.level_count-1,  # Get the last level/slide
                                    size=biopsy_mask.level_dimensions[-1]) # Get the dimension corresponding of the last level

    # Resize the mask patch
    mask = cv2.resize(np.asarray(patch), (512, 512))[:,:,0]
    
    return mask

#### Visualization

In [None]:
### Visualization
data_providers = train_df.data_provider.unique().tolist()
# cmap = mpl.colors.ListedColormap(["black", "gray", "green", "yellow", "orange", "red"])
cmap_rad = mpl.colors.ListedColormap(["white", "lightgrey", "green", "orange", "red", "darkred"])
cmap_kar = mpl.colors.ListedColormap(["white", "green", "red"])
labels = []
for grade in range(train_ids.isup_grade.nunique()):
    fig, ax = plt.subplots(nrows=4, ncols=4, figsize=(22, 22))

    for i, row in enumerate(ax):
        idx = i//2
        temp_idx = (train_df.isup_grade == grade) & (train_df.data_provider == data_providers[idx])
        temp = train_df[temp_idx].image_id.tail(4).reset_index(drop=True)
        if i%2 < 1:
            labels.append(f"{data_providers[idx]}\n(image)")
            for j, col in enumerate(row):
                col.imshow(load_and_resize_biopsy(temp[j]))
                col.set_title(f"\nID: {temp[j][:13]} $\cdots$")
                
        else:
            labels.append(f"{data_providers[idx]}\n(mask)")
            for j, col in enumerate(row):
                if data_providers[idx] == "radboud":
                    col.imshow(load_and_resize_biopsy_mask(temp[j]), 
                               cmap = cmap_rad, 
                               norm = mpl.colors.Normalize(vmin=0, vmax=5, clip=True))
                else:
                    col.imshow(load_and_resize_biopsy_mask(temp[j]),
                               cmap = cmap_kar,
                               norm = mpl.colors.Normalize(vmin=0, vmax=2, clip=True))
                    
                gleason_score = train_ids.loc[temp[j], "gleason_score"]
                col.set_title(f"\nID: {temp[j][:13]} $\cdots$")
        
    for row, r in zip(ax[:,0], labels):
        row.set_ylabel(r, rotation=0, size="large", labelpad=30, fontsize=20)
    
    fig.tight_layout()
    fig.suptitle(f"ISUP Grade {grade}", y=1.01, fontsize=23)
    plt.show()

### Overlaying of Training Label Masks on their Corresponding Whole Slide Images

Given the knowledge of the importance of masks and also the fact that they have the same dimension as their corresponding slides we can overlay them on the tissue to directly see areas which are cancerous. This overlay can help you identifying the different growth patterns. To do this, we load both the mask and the biopsy with `OS.OnpenSlide()` and merge them using `PIL.Image()`.

#### Auxiliary functions for this tasks

After creating the mask patch to visualize, we will apply `mask.split()` function that will split the image, `mask`, into individual bands/channels. This method returns a tuple of individual image bands from an image. For example, splitting an "RGB" image creates three new images each containing a copy of one of the original bands/channels (red, green, blue).

In [None]:
def mask_on_slide_overlayer(images, center="radboud", fig_title="", alpha=0.8, max_size=(800, 800)):
    """Show a mask overlayed on a slide."""
    fig, ax = plt.subplots(nrows=3,ncols=3, figsize=(18,18))
    
    
    for i, image_id in enumerate(images):
        # Open a slide/wsi
        wsi_file_path = os.path.sep.join([train_dir, image_id+".tiff"]) # Full file directory
        biopsy_img = OS.OpenSlide(wsi_file_path) # Openining without reading the image into memory
        # Open the corresponding mask
        mask_file_path = os.path.sep.join([train_mask_dir, image_id+"_mask.tiff"])
        biopsy_mask = OS.OpenSlide(mask_file_path)

        # Creation of the patch to visualize
        patch_img = biopsy_img.read_region(location=(0,0),
                                           level=biopsy_img.level_count-1,  # Get the last level/slide
                                           size=biopsy_img.level_dimensions[-1]) # Get the dimension corresponding of the last level
        
        patch_mask = biopsy_mask.read_region(location=(0,0),
                                             level=biopsy_mask.level_count-1,
                                             size=biopsy_mask.level_dimensions[-1])


        # Split the patch mask into channels
        patch_mask = patch_mask.split()[0]
        
        # Create alpha mask
        alpha_int = int(round(255*alpha))
        if center == "radboud":
            alpha_content = np.less(patch_mask.split()[0], 2).astype('uint8') * alpha_int + (255 - alpha_int)
        elif center == "karolinska":
            alpha_content = np.less(patch_mask.split()[0], 1).astype('uint8') * alpha_int + (255 - alpha_int)

        alpha_content = Image.fromarray(alpha_content)
        preview_palette = np.zeros(shape=768, dtype=int)

        if center == "radboud":
            # Mapping: {0: background, 1: stroma, 2: benign epithelium, 3: Gleason 3, 4: Gleason 4, 5: Gleason 5}
            preview_palette[0:18] = (np.array([0, 0, 0, 0.5, 0.5, 0.5, 0, 1, 0, 1, 1, 0.7, 1, 0.5, 0, 1, 0, 0]) * 255).astype(int)
        elif center == "karolinska":
            # Mapping: {0: background, 1: benign, 2: cancer}
            preview_palette[0:9] = (np.array([0, 0, 0, 0, 1, 0, 1, 0, 0]) * 255).astype(int)

        patch_mask.putpalette(data=preview_palette.tolist())
        mask_rgb = patch_mask.convert(mode="RGB")
        
        # Overlay the mask on its corresponding slide
        overlayed_image = Image.composite(image1=patch_img, image2=mask_rgb, mask=alpha_content)
        overlayed_image.thumbnail(size=max_size, resample=0)

        # Plot the overlayed image
        ax[i//3, i%3].imshow(overlayed_image) 
        biopsy_img.close()
        biopsy_mask.close()       
        ax[i//3, i%3].axis("on")
        
        data_provider = train_ids.loc[image_id, "data_provider"]
        isup_grade = train_ids.loc[image_id, "isup_grade"]
        gleason_score = train_ids.loc[image_id, "gleason_score"]
        ax[i//3, i%3].set_title(f"\nID: {image_id[:7]} $\cdots$, Source: {data_provider}\nISUP: {isup_grade} Gleason: {gleason_score}")
    
    fig.suptitle(fig_title, y=1.01, fontsize=23)
    fig.tight_layout()
    plt.show()

#### Visualization

In [None]:
fig_title = "Some Examples of Overlayed Images"
mask_on_slide_overlayer(images=WSI9.index, fig_title=fig_title)

#### Some Few Insights

As you can observe on the outputs of the cell above, overlaying mask on slide, there are some few pen markings on the slide (dark green smudges). As explained in the overview on competition, these markings are not part of the tissue but were made by the pathologists who originally checked this case. As the competition organizers described, *slightly different procedures were in place for the images used in the test set than the training set. Some of the training set images have stray pen marks on them, but the test set slides are free of pen marks., these pen markings are available on some slides in the training set.*

### Let's Explore few of these Images with Pen Markers


In [None]:
WSI9.index[0]

In [None]:
pen_marked_images = ["ca0798453868081bc8aeeabb01847d4e",
                     "ff10f937c3d52eff6ad4dd733f2bc3ac",
                     "e9a4f528b33479412ee019e155e1a197",
                     "fd6fe1a3985b17d067f2cb4d5bc1e6e1",
                     "f39bf22d9a2f313425ee201932bac91a",
                     "fb01a0a69517bb47d7f4699b6217f69d",
                     "ebb6a080d72e09f6481721ef9f88c472",
                     "feee2e895355a921f2b75b54debad328",
                     "ebb6d5ca45942536f78beb451ee43cc4"]


fig_title = "Some Examples of Pen Marked Images"
mask_on_slide_overlayer(images=pen_marked_images, fig_title=fig_title)

## Distributions of the Dimensions (`width`,`hight`) of Training Samples

### Extraction of Dimensions (`width`,`hight`) and Merging with the Train Data

In [None]:
trn_df = train_df.copy()
dims, spacings = [], []

for img_id in trn_df.reset_index().image_id:
    # Open a slide/wsi
    wsi_file_path = os.path.sep.join([train_dir, img_id+".tiff"]) # Full file directory
    biopsy_img = OS.OpenSlide(wsi_file_path) # Openining without reading the image into memory
    
    spacing = 1 / (float(biopsy_img.properties["tiff.XResolution"]) / 10000)
    dims.append(biopsy_img.dimensions)
    spacings.append(spacing)
    biopsy_img.close()

In [None]:
trn_df["spacing"] = spacings
trn_df["width"]  = [i[0] for i in dims]
trn_df["height"] = [i[1] for i in dims]

display(trn_df.head(10).style.background_gradient(cmap="Blues"))

### Visualization of some Insighful Distributions


#### Auxiliary Function

In [None]:
def plot_distribution_grouped(feature, feature_group, ax):
    for feat in trn_df[feature_group].unique():
        df = trn_df.loc[trn_df[feature_group] == feat]
        sns.kdeplot(df[feature], label=feat, ax=ax, shade=True)
    ax.set_title(f"Images {feature}\ngrouped by {feature_group}\n")
    ax.legend()

In [None]:
fig, ax = plt.subplots(nrows=1, ncols=3, figsize=(15,6), dpi=200, sharey=True)
###
sns.kdeplot(trn_df["width"], ax=ax[0], shade=True, label="width")
sns.kdeplot(trn_df["height"], ax=ax[0], shade=True, label="height")
ax[0].set_xlabel("dimension")
ax[0].set_title("Images Width and Height\n")
ax[0].legend()
###
plot_distribution_grouped(feature="width", feature_group="data_provider", ax=ax[1])
plot_distribution_grouped(feature="height", feature_group="data_provider", ax=ax[2])

fig.suptitle("Distribution Plots", y=1.1)
fig.tight_layout()
plt.show()

In [None]:
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(12,6), dpi=200, sharey=True)
###
plot_distribution_grouped(feature="width", feature_group="isup_grade", ax=ax[0])
plot_distribution_grouped(feature="height", feature_group="isup_grade", ax=ax[1])

fig.suptitle("Distribution by ISUP Grade", y=1.1)
fig.tight_layout()
plt.show()

In [None]:
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(12,6), dpi=200, sharey=True)
###
plot_distribution_grouped(feature="width", feature_group="gleason_score", ax=ax[0])
plot_distribution_grouped(feature="height", feature_group="gleason_score", ax=ax[1])

fig.suptitle("Distribution by Gleason Score", y=1.1)
fig.tight_layout()
plt.show()

#### Inferences on the Distributions

We can see that overall, as well as at `ISUP Grade` and `Gleason Socre` levels, width and height have almost similar distribution.

## Inspirational References


The following kernels were of great help in the creation of this notebook:

1. The main domain knowledge info has been taken from the following sources:
   - https://www.kaggle.com/c/prostate-cancer-grade-assessment/overview
   - https://www.kaggle.com/c/prostate-cancer-grade-assessment/data
   - https://zenodo.org/record/3715938#.XxR-85MzZQL
   - https://emedicine.medscape.com/article/1612022-overview
1. https://www.kaggle.com/gpreda/panda-challenge-starting-eda
1. https://www.kaggle.com/iamleonie/panda-eda-visualizations-suspicious-data
   