<h1 align="center"><a href="https://github.com/sborquez/her2bdl"> Her2BDL</a> - Her2 Bayesian Deep Learning</h1>

<br>
<img src="images/utfsm.png" width="50%"/>

<h2 align="center">Exploratory Data Analysis</h2>

<center>
<i> Notebook created by Sebastián Bórquez G. - <a href="mailto://sebstian.borquez@sansano.usm.cl">sebastian.borquez@sansano.usm.cl</a> - utfsm - Agosto 2020.</i>
</center>


# Setup Notebook


## (Option A) Colab Setup

Connectar a tu `Google Drive` e instalar dependencias.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
%cd "/content/drive/<Path to Project>"
!ls

## (Opción B) Local Setup

Cambiarse al directorio raíz del proyecto.

In [None]:
%cd ..

## Importar Modulos

In [None]:
# Her2BDL packege
from her2bdl import *

# Adhoc modules
import numpy as np
import pandas as pd
import cv2
import matplotlib.pyplot as plt
from tqdm.notebook import tqdm
from IPython.core.display import display, HTML
%matplotlib inline

# Exploratory Data Analysis

Descripción y exploración del datasets, distribución de clases y ejemplos.


## Warwick Her2 Scoring contest

### Descripción

WSIs are generally high resolution gigapixel images obtained by scanning the conventional glass slides. They are normally stored in pyramid structures containing several levels, each level has a different resolution. For visualization, the region-of-interest (ROI) from these images require specially designed libraries or tools. OpenSlide is one of the commonly used libraries that provides a simple interface to read WSIs.

### Source

- https://warwick.ac.uk/fac/sci/dcs/research/tia/her2contest/download/

- Qaiser, Talha, et al. "Her 2 challenge contest: a detailed assessment of automated her 2 scoring algorithms in whole slide images of breast cancer tissues." Histopathology 72.2 (2018): 227-238.
https://onlinelibrary.wiley.com/doi/epdf/10.1111/his.13333

- T. Qaiser, N.M. Rajpoot, "Learning Where to See: A Novel Attention Model for Automated Immunohistochemical Scoring", in IEEE Transactions on Medical Imaging, 2019. DOI: 10.1109/TMI.2019.2907049
https://ieeexplore.ieee.org/document/8672928


### Warwick Training Dataset

The training dataset consists of 52 WSIs with equally distributed cases for all 4 possible stages of HER2 scoring (0/ 1+/2+/3+).

The ground truth data for WSIs is provided in a spreadsheet containing the case number, HER2 score and percentage cells with complete membrane staining irrespective of intensity respectively.

In [None]:
source = 'E:/datasets/medical/warwick/train'

train_dataset = get_dataset(source, include_ground_truth=True)
describe_dataset(train_dataset)
train_dataset.sample(5)

### Warwick Testing Dataset 
The testing dataset contains 28 whole-slide-images (WSIs)

In [None]:
source = 'E:/datasets/medical/warwick/test'

test_dataset = get_dataset(source, include_ground_truth=False)
describe_dataset(test_dataset, include_targets=False)
test_dataset.sample(1)

## Training/Test Splits

Training dataset is divided into train/test 80/20 split.

In [None]:
train = aggregate_dataset(load_dataset("./train/datasets/train.csv"))
test = aggregate_dataset(load_dataset("./train/datasets/test.csv"))

### Class Balance

In [None]:
display_class_distribution(train, target=TARGET, target_labels=TARGET_LABELS, dataset_name="Training")
display_class_distribution(test, target=TARGET, target_labels=TARGET_LABELS, dataset_name="Testing")

## CrossValidation Splits

Training set is splitted into K-folds for hyperparameters optimization.

In [None]:
cv_splits = prepare_cv_splits(train, 5, seed=42)
for i, (tr, ts) in enumerate(cv_splits, start=1):
    display_class_distribution(tr, target=TARGET, target_labels=TARGET_LABELS, dataset_name=f"Training - Hold out {i}")
    display_class_distribution(ts, target=TARGET, target_labels=TARGET_LABELS, dataset_name=f"Validation - Hold out {i}")


## Training/Validation for Best parameters

The model with best performance in CV will be retrained with a new split of the training set. The divisiion of training/validation with a 90/10 ratio will be used for earlystop and identify overfitting.

In [None]:
train_2, val_2 = split_dataset(dataset, validation_ratio=0.15, seed=42)
display_class_distribution(train_2, target=TARGET, target_labels=TARGET_LABELS, dataset_name=f"Training - Best Parameters")
display_class_distribution(val_2, target=TARGET, target_labels=TARGET_LABELS, dataset_name=f"Validation - Best Parameters");
#save_dataset(train_2, output_folder="./train/datasets/best_parameters", dataset_name=f"training")
#save_dataset(val_2, output_folder="./train/datasets/best_parameters", dataset_name=f"validation")

## WSI: Sample images

A whole-slide image is a digital representation of a microscopic slide, typically at a very high level of magnification such as 20x or 40x. As a result of this high magnification, whole slide images are typically very large in size. The maximum file size for a single whole-slide image in our training data set is 3.4 GB, with an average over 1 GB. [[source](https://developer.ibm.com/articles/an-automatic-method-to-identify-tissues-from-big-whole-slide-images-pt1/)]

We can use the [OpenSlide](https://openslide.org/api/python) project to read a variety of whole-slide image formats. This is a pyramidal, tiled format, where the massive slide is composed of a large number of constituent tiles.

In [None]:
samples = train_dataset.sample(5)
wsi_images = []
for _,sample in samples.iterrows():
    wsi_images.append(open_wsi(sample["source"], sample["CaseNo"],sample["image_her2"]))
her2_wsi = open_wsi(sample["source"], sample["CaseNo"],sample["image_her2"])
describe_wsi(her2_wsi)

In [None]:
for her2_wsi in wsi_images:
    size = her2_wsi.level_dimensions[-1]
    display(size)
    display(her2_wsi.get_thumbnail(size))

### Size distribution

In [None]:
display_wsi_sizes_distribution(train_dataset, "Train Set")

## Otsu

In [None]:
size = her2_wsi.level_dimensions[5]
image = cv2.cvtColor(np.array(her2_wsi.get_thumbnail(size)),  cv2.COLOR_RGB2BGR)
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
# Otsu's thresholding
ret2,th2 = cv2.threshold(gray,0, 255, cv2.THRESH_BINARY+cv2.THRESH_OTSU)

In [None]:
plt.figure(figsize=(12, 6))
plt.imshow(image[:,:,::-1], interpolation=None)
plt.grid(False)

In [None]:
plt.figure(figsize=(12, 6))
plt.imshow(gray, interpolation=None, cmap="gray")
plt.grid(False)

In [None]:
plt.figure(figsize=(12,6))
plt.imshow(th2, interpolation=None, cmap="binary")
plt.grid(False)

## OTSU with HSV


In [None]:
size = her2_wsi.level_dimensions[5]
image = cv2.cvtColor(np.array(her2_wsi.get_thumbnail(size)),  cv2.COLOR_RGB2HSV)

# Otsu's thresholding
ret2,th2 = cv2.threshold(gray,0, 255, cv2.THRESH_BINARY+cv2.THRESH_OTSU)

In [None]:
plt.figure(figsize=(12,6))
plt.imshow(img, interpolation=None)
plt.grid(False)

## Dataset Generator

Los generadores se encargan de alimentar a los modelos, generando los patches de imágenes etiquetados durante el entrenamiento.

In [None]:
train = aggregate_dataset(load_dataset("./train/datasets/train.csv"))
test = aggregate_dataset(load_dataset("./train/datasets/test.csv"))

### Grid Generator

Extract patches from a grid

In [None]:
train_generator = GridPatchGenerator(train, 1, 3, (224, 224))
X_batch, y_batch = train_generator[0]
for xi, yi in zip(X_batch, y_batch):
    plot_sample(xi, yi.argmax())
    plt.show()

In [None]:
test_generator = GridPatchGenerator(test, 2, 2, (224, 224))
X_batch, y_batch = test_generator[0]
for xi, yi in zip(X_batch, y_batch):
    plot_sample(xi, yi.argmax())
    plt.show()

### MCPatchGenerator

Random patch samples from wsi image 

In [None]:
train_generator = MCPatchGenerator(train, 2, 3, (224, 224))
X_batch, y_batch = train_generator[0]
for xi, yi in zip(X_batch, y_batch):
    plot_sample(xi, yi.argmax())
    plt.show()

In [None]:
test_generator = MCPatchGenerator(test, 2, 3, (224, 224))
X_batch, y_batch = test_generator[3]
for xi, yi in zip(X_batch, y_batch):
    plot_sample(xi, yi.argmax())
    plt.show()