<h1 align="center"><a href="https://github.com/sborquez/her2bdl"> Her2BDL</a> - Her2 Bayesian Deep Learning</h1>

<br>
<img src="images/header.png" width="35%"/>

<h2 align="center">Exploratory Data Analysis</h2>

<center>
<i> Notebook created by Sebastián Bórquez G. - <a href="mailto://sebstian.borquez@sansano.usm.cl">sebastian.borquez@sansano.usm.cl</a> - utfsm - Agosto 2020.</i>
</center>


# Setup Notebook


## (Option A) Colab Setup

Connectar a tu `Google Drive` e instalar dependencias.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
%cd "/content/drive/<Path to Project>"
!ls

## (Opción B) Local Setup

Cambiarse al directorio raíz del proyecto.

In [1]:
%cd ..

d:\sebas\Google Drive\Projects\her2bdl


## Importar Modulos

In [2]:
# Her2BDL packege
from her2bdl import *

# Adhoc modules
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from tqdm.notebook import tqdm
from IPython.core.display import display, HTML

# Exploratory Data Analysis

Descripción y exploración del datasets, distribución de clases y ejemplos.


## Warwick Her2 Scoring contest

### Descripción

WSIs are generally high resolution gigapixel images obtained by scanning the conventional glass slides. They are normally stored in pyramid structures containing several levels, each level has a different resolution. For visualization, the region-of-interest (ROI) from these images require specially designed libraries or tools. OpenSlide is one of the commonly used libraries that provides a simple interface to read WSIs.

### Fuente

- https://warwick.ac.uk/fac/sci/dcs/research/tia/her2contest/download/

- Qaiser, Talha, et al. "Her 2 challenge contest: a detailed assessment of automated her 2 scoring algorithms in whole slide images of breast cancer tissues." Histopathology 72.2 (2018): 227-238.
https://onlinelibrary.wiley.com/doi/epdf/10.1111/his.13333

- T. Qaiser, N.M. Rajpoot, "Learning Where to See: A Novel Attention Model for Automated Immunohistochemical Scoring", in IEEE Transactions on Medical Imaging, 2019. DOI: 10.1109/TMI.2019.2907049
https://ieeexplore.ieee.org/document/8672928


### The Training Dataset

The training dataset consists of 52 WSIs with equally distributed cases for all 4 possible stages of HER2 scoring (0/ 1+/2+/3+).

The ground truth data for WSIs is provided in a spreadsheet containing the case number, HER2 score and percentage cells with complete membrane staining irrespective of intensity respectively.

In [3]:
source = 'E:/datasets/medical/warwick/train'

train_dataset = get_dataset(source, include_ground_truth=True)
describe_dataset(train_dataset)
train_dataset.sample(5)

Dataset Info:
  size: 52
  columns: Index(['CaseNo', 'HeR2 SCORE', 'source', 'image_her2', 'image_he'], dtype='object')
  by class:
    Score 3: 13
    Score 2: 13
    Score 1: 13
    Score 0: 13


Unnamed: 0,CaseNo,HeR2 SCORE,source,image_her2,image_he
2,6,2,E:/datasets/medical/warwick/train,06_Her2.ndpi,06_HE.ndpi
14,25,2,E:/datasets/medical/warwick/train,25_Her2.ndpi,25_HE.ndpi
12,22,3,E:/datasets/medical/warwick/train,22_Her2.ndpi,22_HE.ndpi
46,82,3,E:/datasets/medical/warwick/train,82_Her2.ndpi,82_HE.ndpi
39,66,0,E:/datasets/medical/warwick/train,66_Her2.ndpi,66_HE.ndpi


### The Testing Dataset

The testing dataset contains 28 whole-slide-images (WSIs)

In [7]:
source = 'E:/datasets/medical/warwick/test'

test_dataset = get_dataset(source, include_ground_truth=False)
describe_dataset(test_dataset, include_targets=False)
test_dataset.sample(1)

Dataset Info:
  size: 2


Unnamed: 0,source,CaseNo,image_her2,image_he,HeR2 SCORE
1,E:/datasets/medical/warwick/test,3,03_Her2.ndpi,03_HE.ndpi,


## Preparación del Dataset

El train set se divide en `train/validation/test` con una razón de 80%, 10% y 10% respectivamente.
La división de un dataset debe mantener el balance de clases. 

In [10]:
source = 'E:/datasets/medical/warwick/train'
# 1. Get Dataset from source
dataset = get_dataset(source, include_ground_truth=True)
    
# 2. Split train/validation/test
train, val, test = split_dataset(dataset, validation_ratio=0.1, test_ratio=0.1, seed=42)

# 3. Save
print('train split')
describe_dataset(train)
save_dataset(train, "./train/datasets", "train")

print()
print('validation split')
describe_dataset(val)
save_dataset(val, "./train/datasets", "validation")

print()
print('test split')
describe_dataset(test)
save_dataset(test, "./train/datasets", "test");

train split
Dataset Info:
  size: 36
  columns: Index(['CaseNo', 'HeR2 SCORE', 'source', 'image_her2', 'image_he'], dtype='object')
  by class:
    Score 3: 9
    Score 2: 9
    Score 1: 9
    Score 0: 9

validation split
Dataset Info:
  size: 8
  columns: Index(['CaseNo', 'HeR2 SCORE', 'source', 'image_her2', 'image_he'], dtype='object')
  by class:
    Score 3: 2
    Score 2: 2
    Score 1: 2
    Score 0: 2

test split
Dataset Info:
  size: 8
  columns: Index(['CaseNo', 'HeR2 SCORE', 'source', 'image_her2', 'image_he'], dtype='object')
  by class:
    Score 3: 2
    Score 2: 2
    Score 1: 2
    Score 0: 2


'./train/datasets\\test.csv'

## Visualizaciones 

In [11]:
train = load_dataset("./train/datasets/train.csv")
val = load_dataset("./train/datasets/validation.csv")
test = load_dataset("./train/datasets/test.csv")

### Clases balanceadas

In [None]:
display_class_distribution(train, dataset_name="Training")
display_class_distribution(val, dataset_name="Validation")
display_class_distribution(test, dataset_name="Testing")

## WSI: Imágenes de ejemplo