# HuBMAP deepflash2 Judge Price Submission

> Outline of the deepflash2<sup>*</sup> approach

***
<sup>*</sup>deepflash2 is our kaggle team name and our [open source python library](https://matjesg.github.io/deepflash2/) that aims to combine state-of-the-art deep learning with a barrier-free environment for life science researchers.

# Highlights

> Super fast training: 30 minutes training to reach a public LB score of 0.922.
- **Efficient Sampling**
    - no "pre-tiling" needed, the training data gets converted into .zarr files for efficient loading.
    - flexible tile dimensions (e.g. 1024, 512, 256)  & downscaling (2, 3, 4x) at runtime 
    - training focuses on the relevant regions (e.g., tiles that contain glomeruli and cortex)
    - during data augmentation we have no cropping artifacts during rotation etc.
- **Standardized Workflows**
    - Using best practices for training schedules, architectures, augmentations and inference
    - leveraging the `pytorch` ecosystem: `deepflash2`, `fastai`, `Albumentations`, `segmentation-models.pytorch`
    - *production ready*: our apporach provides an outstanding prediction quality without stacking many different architectures and encoders.
- **Uncertainty Estimation**
    - provides bayesian and energy based measures for uncertainty
    - enables human-in-the-loop refinement of difficult specimen
- **Generalizability**
    - We applied our modelling technique to a new type of tissue and achived a cross validation dice score over 0.8  with only 3 annotated slices. 
- **Public Kernel Sharing and Open Source Libraries**
    - All our solutions are open source and run on Kaggle Servers. We provided public notebooks as an example to use our sampling techniques earlier in the challenge.

| Public Kernels      |                         |                        |                        |
| --------------------| ------------------------|------------------------|------------------------|
| [![Custom Badge](https://www.kaggle.com/static/images/medals/competitions/silverl@1x.png)](https://www.kaggle.com/matjes/hubmap-zarr)              | [![Custom Badge](https://www.kaggle.com/static/images/medals/competitions/silverl@1x.png)](https://www.kaggle.com/matjes/hubmap-labels-pdf-0-5-0-25-0-01)              | [![Custom Badge](https://www.kaggle.com/static/images/medals/competitions/goldl@1x.png)](https://www.kaggle.com/matjes/hubmap-efficient-sampling-deepflash2-train)                    |[![Custom Badge](https://www.kaggle.com/static/images/medals/competitions/goldl@1x.png)](https://www.kaggle.com/matjes/hubmap-efficient-sampling-deepflash2-sub) 
| [File Conversion](https://www.kaggle.com/matjes/hubmap-zarr)    | [Efficient Samping](https://www.kaggle.com/matjes/hubmap-labels-pdf-0-5-0-25-0-01)    | [Training](https://www.kaggle.com/matjes/hubmap-efficient-sampling-deepflash2-train) |[Inference](https://www.kaggle.com/matjes/hubmap-efficient-sampling-deepflash2-sub) |
| 30 upvotes, 116 forks| 63 upvotes, 126 forks| 97 upvotes, 356 forks |113 upvotes, 561 forks |


**Team**

- Matthias Griebel, PhD Candidate & Research Associate, University of Würzburg 
- Philipp Sodmann, Physician & Research Assistant, University Hospital Würzburg
- Thomas Lux, Physician & Clinical Scientist, University Hospital Würzburg

***

# Table of Contents

* [1. Methodology](#1)
  * [1.1. Efficient Sampling](#sampling)
  * [1.2. Training](#train)
  * [1.3. Validation](#val)
  * [1.4. Confidence estimation](#ce)
  * [1.5. Inference](#inf)
* [2. Generalizability](#gen)
* [3. Limitations and conclusion](#lim)

# 1. Methodology<a id="1"></a>

Our competition notebooks for the Judge Price cover the following workflow

<figure>
  <img src="https://i.imgur.com/2mxXcJh.png" alt="workflow"/>
    <figcaption>Figure 1: Proposed standard workflow</br></figcaption>
</figure>

The corresponding kernels are 
- [File Conversion](https://www.kaggle.com/matjes/hubmap-zarr)
- [Sampling Preparation](https://www.kaggle.com/matjes/hubmap-efficient-sampling-ii-deepflash2)
- [Ensemble Training](https://www.kaggle.com/matjes/hubmap-deepflash2-train)
- [Validation](https://www.kaggle.com/matjes/hubmap-deepflash2-validation)
- [Inference (5-Fold-CV Ensemble)](https://www.kaggle.com/matjes/hubmap-deepflash2-sumbission)
- [Inference (Ensemble at different scales)](https://www.kaggle.com/matjes/hubmap-deepflash2-scaled-ensemble-sumbission)


# 1.1. Efficient Sampling<a id="sampling"></a>

[View our Kernel](https://www.kaggle.com/matjes/hubmap-efficient-sampling-ii-deepflash2)

A common approach to deal with the very large (>500MB - 5GB) TIFF files in the dataset is to decompose the images in smaller patches/tiles, for instance by using a sliding window approach. However, whole slide images contain only few relevant regions while much of the image are either blank or contain tissue without the target class. 

Instead of preprocessing the images by saving them into fixed tiles, we combine two sampling approaches:

> 1. Sampling tiles via center points in the proximity of every glomerulus - this ensures that each glomerulus is seen during one epoch of training at least once. 
> 2. Sampling random tiles based on region probabilities (e.g., medulla, cortex, other)

We use the provided anatomical information to train with more examples of the cortex than the medulla, because glomeruli have a higher abundance in this region. We also sampled a few examples not contained in the anatomic regions to ensure that our model can interpret these as well. The figure below depicts the sampling process of one image during one training epoch. 

<figure>
  <img src="https://i.imgur.com/nsEXX1M.png" alt="Examplary sampling of one training epoch"/>
    <figcaption>Figure 2: Examplary sampling of one training epoch on image <i>"0486052bb"</i> </br>
    Masks of the annotated glomeruli (left), anatomical regions (middle), sampled tiles during one epoch of training (right).
    </figcaption>
</figure>

This kind of sampling is not only intuitive but also has strong theoretical support. Considering the exemplary batch of 16 images and the corresponding pixel distribution in the figures below,  we can see that the data distribution clearly follows a normal distribution. This property is beneficial when using pre-trained models and generally speeds up learning and leads to faster convergence during the training of artificial neural networks. Moreover, the sampling is model-agnostic and can be used with any model on several other tasks (e.g., classification, object detection).

<figure>
  <img src="https://i.imgur.com/CWpZq8Q.png" alt="One batch of 512x512 patches downscaled by factor 3" width="800"/>
    <figcaption>Figure 3: One batch of 16 512x512 patches downscaled by factor 3, note that 14 out of 16 examples contain the foreground class "glomerulus".</figcaption>
</figure>
<br />
<figure>
  <img src="https://i.imgur.com/uANDUPG.png" alt="Pixel distribution after normalization" width="400"/>
    <figcaption>Figure 4: Pixel distribution after normalization of the batch above</figcaption>
</figure>
<br />
A beneficial side effect of our sampling method is that functional tissue units which were overlooked in the annotation will be rarely used in training. 

# 1.2. Training<a id="train"></a>

[View our Kernel](https://www.kaggle.com/matjes/hubmap-deepflash2-train)

Our training methodology is based on best practices for training schedules (fastai), architectures (segmentation-models.pytorch), image augmentations (Albumentations) and inference (tile shift and gaussian weighting, see [nnunet](https://www.nature.com/articles/s41592-020-01008-z)). To ensure the objectivity, reliability, and validity of our model we tried to incorporate the principles outlined by [Segebarth, Griebel, et. al. (elife 2020)](https://elifesciences.org/articles/59780) by using model ensembles.

**Negligible differences between various encoders and models**

During this challenge, we trained and tested different encoders (resnet, efficientnetb0-b4) and architectures (U-Net, U-Net-plusplus, deeplab-v3) and found no significant difference in performance. Therefore, we decided to use a reasonable small encoder (efficientnet-b2) as well as a simple default U-Net. We tried the same experiment for different optimizers (SGD, AdamW, Ranger, Madgrad) and found that ranger performed most consistently.
In our experiments, the Dice-CrossEntropy loss worked well. We also tried more exotic loss functions like deep supervision but did not find any consistent benefit.  

**Resolution makes a difference**

We trained our model with different magnifications. While a higher resolution might be beneficial to identify the glomerular border (bowman's capsule) correctly, a reduced resolution (equal to less magnification) provides more context for the annotated area. We compared a resolution reduction of 2, 3, 4, 6, 8x in which a 3x resolution reduction resulted in the best dice score. For the same reason, we trained our model on an image size of 256 and 512. A smaller image size results in larger batch size and therefore in better estimates for batch norm. The larger image size resulted in better scores as well. We logged all our training runs using [wandb](https://wandb.ai). The results can be made publicly available on demand.

**Final training setup**

- Architecture: U-Net
- Encoder: efficientnet-b2
- Pretraining: imagenet
- Loss: Dice-CrossEntropy
- Optimizer: [ranger](https://github.com/lessw2020/Ranger-Deep-Learning-Optimizer)
- Learning rate: 1e-3
- Batch Size: 16
- Tile Size: 512x512
- Resolultion Downscaling Factor: 3
- Training iterations: 2500-3000 (best model selected on validation set)
- Ensembling 1: 5 Models (5-fold cross validation)
- Ensembling 2: 3 Models at scale 2,3, and 4 trained on all data
- Augmentations: see training Notebook

The figure below summarizes a typical training run of one model with [one-cycle-training](https://arxiv.org/pdf/1803.09820.pdf).

<figure>
  <img src="https://i.imgur.com/kP9SCyn.png" alt="Pixel distribution after normalization"/>
    <figcaption>Figure 5: Training summary </br></figcaption>
</figure>

Furthermore, we tested the recently published [nnunet](https://www.nature.com/articles/s41592-020-01008-z), but we could not observe improved results. Because of the high computational cost (about 1 week training time for 5 folds), we did not run further experiments and discarded this approach.

## 1.3. Validation<a id="val"></a>
[View our Kernel](https://www.kaggle.com/matjes/hubmap-deepflash2-validation)

To evaluate the performance of our models, we trained and tested them in a five-fold cross validation. Each fold is trained on 12 and validated on 3 whole slice images.

**Metrics**

**Pixel level metrics during training (Figure 5)**:
We logged the dice f1, the precision and recall as well as the soft dice and loss for each epoch. The scores were consistent over all folds except for one image, in which our model found significantly more glomeruli than annotated in a poorly stained area.

**Instance level metrics**: 
To estimate the quality of our model on instance level (glomeruli level) instead of pixel level, we additionally calculated measures to account for the detection similarity. Similarly to [Segebarth, Griebel, et. al. (elife 2020)](https://elifesciences.org/articles/59780) we define the intersection over the union: 

$$
M_{\text{IoU}}(a,b) := \frac{|a\cap b|}{|a \cup b|}
$$

A pair of objects (segmented glomeruli) with an IoU is above a threshold $t$ as correctly detected (true positive - $TP$). 
Objects that match with an IoU at or below $t$ or have no match at all are considered to be false negative ($FN$) for the source mask and false positive ($FP$) for the target mask. This allows us to calculate the Precision $M_{\text{Precision}}$, Recall $M_{\text{Recall}}$, and F1 score $M_{\text{F1 score}}$ as the harmonic mean of $M_{\text{Precision}}$ and $M_{\text{Recall}}$:

$$
M_{\text{Precision}}(t):=\frac{TP(t)}{TP(t)+FP(t)}
$$

$$
M_{\text{Recall}}(t):=\frac{TP(t)}{TP(t)+FN(t)}
$$

$$M_{\text{F1 score}}(t) :=  2 \cdot \frac{\mathrm{M_{\text{Precision}}(t)} \cdot \mathrm{M_{\text{Recall}}(t)}}{\mathrm{M_{\text{Precision}}(t)} + \mathrm{M_{\text{Recall}}(t)}}
$$

with $t\in[0,1]$ as a fixed IoU threshold. If not indicated differently, we used $t=0.5$ in our calculations.

**Results**

The cross validation results are depicted in Figure 6. The high recall indicates that the models can find almost all annotated glomeruli. The precision is slightly lower, which indicates that the models are sometimes misled by vessels or other tissue that look similar to a glomerulus. Overall, we think that these results are entirely convincing.

<figure>
  <img src="https://i.imgur.com/aT9ypRo.png" alt="Proposed human-in-the-loop annotation refinement"/>
    <figcaption>Figure 6: Cross Validation Results</figcaption>
</figure>

## 1.4. Confidence estimation<a id="ce"></a>
[View our Kernel](https://www.kaggle.com/matjes/hubmap-deepflash2-validation)

To robustly estimate the confidence of our prediction, we developed an energy based approach for image segmentation that is based on the work of [Weitang et al. (2020)](https://proceedings.neurips.cc/paper/2020/file/f5496252609c43eb8a3d147ab9b9c006-Paper.pdf) for image classification. When applying a softmax prediction, neural networks often overestimate their confidence when predicting out of distribution data. Using the energy score can help to reduce the false positive rate \[FPR\]in such cases. The authors were able to reduce the FPR on CIFAR-10 by 18.03%. The formula for the energy is derived from the Helmholtz free energy and calculated as: $$E(x;f) = -T * log(\sum_{i}^{K} e^{fi(x)/T})$$
In this equation, $K$ is the number of classes, $x$ the logits and $f$ the function (here the neural network).  
As suggested by the authors, we chose the temperature parameter $T=1$. To allow a more intuitive interpretation of the energy (i.e., having mostly positive numbers), we calculated the *negative* energy score $-E$ in our experiments. Thus, our energy score always describes the *negative* energy in all our work.

<figure>
  <img src="https://i.imgur.com/ozVyAQT.png" alt="Proposed human-in-the-loop annotation refinement"/>
    <figcaption>Figure 7: True positive example with high energy</figcaption>
</figure>

<figure>
  <img src="https://i.imgur.com/u861uOf.png" alt="Proposed human-in-the-loop annotation refinement"/>
    <figcaption>Figure 8: False positive example: using a thresholded softmax at 0.5, the network predicts a glomerulus. However, the energy is comparatively low indicating uncertainty in the prediction.</figcaption>
</figure>

<figure>
  <img src="https://i.imgur.com/n22iqp5.png" alt="Proposed human-in-the-loop annotation refinement"/>
    <figcaption>Figure 9: Mean softmax activation values</figcaption>
</figure>
</br>

Figure 9 summarizes the softmax activations over the entire training cross-validation. Most of the glomeruli exhibit high mean softmax activations. Figure 10 displays the same data and its mean energy score. 
In both Figures 9 and 10, softmax and energy show a positive correlation between the iou metric and the respective score, but the energy score is much more selective and allows differentiation of out-of-distribution data (positive classes which are unlike the examples the network has seen during training).

<figure>
  <img src="https://i.imgur.com/WSWhuBn.png" alt="Proposed human-in-the-loop annotation refinement"/>
    <figcaption>Figure 10: Mean energy values</figcaption>
</figure>

**Pseudolabels and human-in-the-loop annotation refinement**

To extend the amount of training data, we predicted labels on the publicly available test-data as well as on the unused whole slide images published on [hubmap](https://portal.hubmapconsortium.org/search?q=pas&mapped_data_types[0]=PAS%20Stained%20Microscopy&entity_type[0]=Dataset), [the data is discussed in the challenge forums here](https://www.kaggle.com/c/hubmap-kidney-segmentation/discussion/233336).
The resulting predictions were manually refined by one physician in uncertain regions using qupath and a Wacom drawing tablet.
The uncertainty was estimated by computing the energy score on the logits. Positive instances with a low energy score were specifically reviewed.
Glomeruli were excluded if more than half of their area was destroyed by artifacts.
We observed a steady improvement of our score with the number of glomeruli we found. 
Therefore, we reduced the threshold to a smaller value when reviewing the pseudo annotations to include all uncertain examples.

<figure>
  <img src="https://i.imgur.com/Zle7BDU.png" alt="Proposed human-in-the-loop annotation refinement"/>
    <figcaption>Figure 11: Proposed human-in-the-loop annotation refinement</figcaption>
</figure>

**Finding reference glomeruli**

Our approach also provides insights that are useful to generate reference glomeruli for inclusion into a Human Reference Atlas. 
To identify typical glomeruli in an image, we utilize the energy score. A high score energy helps us to locate typical and artifact-free glomeruli on a whole slide image.

<figure>
  <img src="https://i.imgur.com/eQPVrUj.png" alt="Proposed human-in-the-loop annotation refinement"/>
    <figcaption>Figure 12: Proposed reference glomeruli with a high energy score, one selected for each image provided in the training data.</figcaption>
</figure>

### 1.5. Inference<a id="inf"></a>
[View our Kernel](https://www.kaggle.com/matjes/hubmap-deepflash2-sumbission)

For inference, we combined several best practices:
- Overlapping tiles (shift factor 0.8)
- Gaussian weighting ([`nnunet`](https://www.nature.com/articles/s41592-020-01008-z))
- Pre-filtering of empty tiles (thanks to @iafoss ([kernel](https://www.kaggle.com/iafoss/hubmap-pytorch-fast-ai-starter-sub))!
- Test-time augmentation (horizontal and vertical flip)

These "tricks" removed almost any prediction artifacts such as half cut-off glomeruli or noise.

**Post-processing delivers ambiguous results**

We compared different post-processing steps in our cross validation experiment and compared them with a 0.5 threshold value as our baseline.
Using a conditional random field had a negative impact on the dice score. We found no benefit in removing positive areas that were significantly smaller than the normal glomeruli size. This is most likely due to the negligible amount of pixels affected compared with the total amount of positive pixels.
We observed promising results when the softmax score was locally thresholded with Otsu's method. However, this did not improve the average dice when applied to all data. Thus, we did not use any post-processing in the final submission. 


<figure>
  <img src="https://i.imgur.com/sZySuV9.png" alt="Pixel distribution after normalization"/>
    <figcaption>Figure 13: Example of post-processing with Otsu's method</br></figcaption>
</figure>

However, we used post-processing to get a better approximation of our uncertainty scores based on the visual results during cross validation. We decided to post-process areas with less than 10k pixel (at scale 2) with Otsu's method and "fill holes" for all other regions. 

# 2. Generalizability: Experiments on pancreas specimen<a id="gen"></a>

**Data**

To test for generalizability, we annotated pancreatic islets (langerhans) in three whole slice images \[[1](https://pathology.cancerimagingarchive.net/pathdata/cptac_camicroscope/osdCamicroscope.php?tissueId=C3L-03371-25), [2](https://pathology.cancerimagingarchive.net/pathdata/cptac_camicroscope/osdCamicroscope.php?tissueId=C3L-01158-25), [3](https://pathology.cancerimagingarchive.net/pathdata/cptac_camicroscope/osdCamicroscope.php?tissueId=C3L-03350-24)\] in the same way as the challenge data. Similar to the challenge data, pancreatic islands are a functional tissue unit of the pancreas. They function as the endocrinologic system of the pancreas and are producing insulin, these cells are destroyed in cases of diabetes type 1. A major difference to the challenge data is that the slides are stained with hematoxilin and eosin and not PAS.)

<figure>
  <img src="https://i.imgur.com/Vmj6GRY.png" alt="Reference annotation for pancreatic islands"/>
    <figcaption>Figure 14: Examplary data from the pancreas dataset</figcaption>
</figure>


<figure>
  <img src="https://i.imgur.com/mfHEA0k.png" alt="Reference annotation for pancreatic islands" width="400"/>
    <figcaption>Figure 15: Reference annotation for pancreatic islands</figcaption>
</figure>


The data was downloaded from [cancerimagingarchive](https://www.cancerimagingarchive.net/) and only cancer-free tissue was included.

**Our kernels**

- [Pancreas image conversion notebook](https://www.kaggle.com/matjes/cptac-pda-to-zarr)  
- [Pancreas sampling notebook](https://www.kaggle.com/matjes/cptac-pda-pancreas-efficient-sampling-deepflash2)  
- [Pancreas training notebook](https://www.kaggle.com/matjes/cptac-pda-train-deepflash2)  
- [Pancreas validation notebook](https://www.kaggle.com/matjes/cptac-pda-deepflash2-validation/output)

**Training**

We trained a three-fold cross validation with the same model architecture and training routine as for the kidney data. The only things to adjust for are the mean and standard deviation of the image data.  
In about 10 epochs, each fold achieved a good dice metric for training (0.805, 0.808, 0.922), recall (0.848, 0.810, 0.916) and precision (0.827, 0.818, 0.949).  

<figure>
  <img src="https://i.imgur.com/4o9GyhN.png" alt="Pixel distribution after normalization"/>
    <figcaption>Figure 16: Pancreas training summary </br></figcaption>
</figure>

<figure>
  <img src="https://i.imgur.com/Sc4Rais.png" alt="Pixel distribution after normalization"/>
    <figcaption>Figure 17: Instance based cross validation metrics on the pancareas data set </br></figcaption>
</figure>
<br />
<br />
  
**Uncertainty Estimation**

We used the same code for validation and uncertainty estimation for the pancreas data. The figures below depict exemplary predictions.
The softmax and energy scores exhibit the same desired properties.
<figure>
  <img src="https://i.imgur.com/1UqI91j.png" alt="Proposed human in the loop annotation refinement"/>
    <figcaption>Figure 18: True positive example with high energy</figcaption>
</figure>
<br />

<figure>
  <img src="https://i.imgur.com/BG74Yw7.png" alt="Proposed human in the loop annotation refinement"/>
    <figcaption>Figure 19: Example with low energy</figcaption>
</figure>
<br />
<br />
<figure>
  <img src="https://i.imgur.com/KqityVf.png" alt="Proposed human in the loop annotation refinement"/>
    <figcaption>Figure 20: Mean energy values</figcaption>
</figure>

# 3. Limitations and conclusion<a id="lim"></a>

**Limitations**

Since our models are not trained on the entire data, it is possible that rare artifacts like small air bubbles are not seen during training and get misclassified by the model for glomeruli. These regions have a lower energy score compared to normal glomeruli and can easily be found with human-in-the-loop quality control.
When training the next model, these regions can be upsampled as well.  
  
Furthermore, we refrained from stacking different architectures or differently uptrained models with e.g. deviating loss functions. 
Even though this usually results in a marginally better test time performance, these models are hyper optimized for one use case and will not provide any benefit when trained on a new kind of tissue such as colon crypts.

**Conclusion**

Our sampling strategy is ideal to segment large volumes of whole slide images:
- model agnostic sampling 
- model agnostic uncertainty scores 
- fast and reliable training and on standards GPUs

We have shown that our training pipeline can achieve competitive performance without the need to stack different models and can be easily transferred to new data without the need for many annotated examples.  

The human-in-the-loop annotation process is ideal and helps to create more training data painlessly. It can be even further improved with custom plugins in qupath.