In [None]:
from matplotlib import pyplot as plt

def plot_image(img_path, title, figsize=(20, 25), title_pos=-0.2, fontsize=15):
    """
    Helper to visualize images loaded from an external dataset
    """
    img=plt.imread(img_path)
    fig = plt.figure(figsize=figsize)
    plt.axis('off')
    plt.title(title, y=title_pos, fontsize=fontsize)
    plt.imshow(img);

# Introduction

The Human Biomolecular Atlas Program ([HuBMAP](https://hubmapconsortium.org/about/)) is aiming to create an open global atlas of the human
body at the cellular level, to accelerate understanding of the relationships between cell and tissue
organization, their functions, and human health. HuBMAP will create the next generation of molecular
analysis technologies and computational tools, enabling the generation of foundational 3D tissue maps
and construction of an atlas of the function and relationships among cells in the human body. One
component of this overarching goal is to identify medically relevant functional tissue units (FTUs) within
whole slide microscopy images of human tissues. Once these FTUs are detected, information on size,
shape, variability in number, and location within the tissue samples can be used to help in building a
spatially accurate and semantically explicit model of the human body, [as pointed out](https://www.kaggle.com/leahscherschel/dataset-details) by organizers of
this challenge.

One example of an FTU, which is in the focus of this competition, is the glomerulus found in the outer
layer of kidney tissue. Glomeruli perform filtration of waste products out of blood and are represented
by three-dimensional blocks of cells centered around a capillary, such that each cell in this block is
within diffusion distance from any other cell in the same block. The objective of this challenge is the
segmentation of regions with glomeruli in human kidney tissue images across different tissue
preparation pipelines. Below we provide an overview of the main data challenges arising in this task
as well as a brief solution description and a list of external data used to build our models. Next, we
discuss the main results and the limitations of the proposed approach, including confidence estimation.
Finally, we applied the developed method for segmentation of nucleus cells in Line-field Confocal
Optical Coherence Tomography (LC-OCT) images, and provide the main conclusions of our work.


## About our team : deeplive.exe

- Theo Viel (@theoviel) is one of the three French Kaggle Competitions Grandmaster and
currently ranks 17. His Kaggle achievements gave him a lot of experience in Computer Vision
and NLP. He currently works as a Computer Vision Researcher at DAMAE Medical where he
uses his modeling skills to improve the understanding of healthy and pathological skin.
- Maxim Shugaev (@Iafoss) is a Computer Vision Researcher at Intelligent Automation, Inc. In
2019 he has received his PhD degree from the University of Virginia, USA, in Applied Physics.
Beyond that Maxim has an extensive experience in more than 15 computer vision related
projects ranging from cancer grade assessment based on biopsies and pneumothorax
segmentation to deepfake detection, satellite image segmentation, and few-shot learning.
- Sebastien Fischman (@Optimo) is a 30-year-old French Computer Vision Researcher at
DAMAE Medical. Before working with medical images, he had several Machine Learning
experiences on very different topics ranging from Auto Machine Learning, stock market
predictions using sentiment analysis based on Tweets, add retargeting and user segmentations.

## Thanks to the organisers

We would start by thanking the organizers and the Kaggle team for hosting this competition. We
believe that leveraging Artificial Intelligence in medical imaging will play a key role in improving the
overall understanding of the human body.


# Part 1: Data challenges

In this section we highlight the key challenges provided by the competition data to build a practically
usable model.

## Data quality variation

Because of the nature of the data, it is very likely that images have artefacts. Indeed, PAS kidney images are quite hard to acquire and a lot of variability can be observed. Specifically, such quality issues
can be found both in train and test datasets, and the built models, therefore, should be robust to
unexpected image changes. To achieve this objective, we trained our models with an extensive image
augmentation, including MixUp [[1]](https://arxiv.org/abs/1710.09412), which helps distinguishing glomeruli from the background in
regions affected by image artefacts. As an illustration we plot model predictions on parts of the *aa05346ff*
and *afa5e8098* images (from test and train datasets, respectively), which include regions affected by
luminosity, contrast and blur issues. In all cases the model continues giving meaningful predictions missing only one
glomerulus in the *aa05346ff*. Moreover, the model is able to predict masks for glomeruli in the dark
region of *afa5e8098*, even if the ground truth masks were intentionally excluded during annotation
because of the insufficient image quality (see Fig 1).


In [None]:
title = """Fig 1: Example of quality issues in aa05346ff (test) and afa5e8098 (train) images. Blue and red
colors correspond to model predictions of healthy and unhealthy glomeruli. Green color in right figure
depicts glomeruli not included into annotation because of insufficient image quality but still predicted
by the model. The zoomed in image depicts one of these glomeruli."""
plot_image(
    "../input/presentation-images/image_quality_issues.png",
    title,
    (20, 25),
    -0.2, 
    fontsize=18
)

## Non-Sclerotic, Sclerotic, and Fibrous Crescent Glomeruli

One of the challenges of glomeruli detection is the significant variation of tissue peculiarities from one
patient to another, which is deteriorated by the various tissue processing methods used. One more layer
of complexity, meanwhile, is added by the fact that not all glomeruli are healthy. Variety of glomeruli
defects, including Fibrous Crescent (FC), Epithelial Crescent, Glomerulosclerosis, Necrosis, etc., as
well as a different time passed after glomeruli degradation and similarity of degraded glomeruli with
other nephron components makes detection of unhealthy glomeruli extremely challenging.

In this competition participants are not asked to detect unhealthy glomeruli. Leah Scherschel, one of
the competition hosts, [pointed out](https://www.kaggle.com/c/hubmap-kidney-segmentation/discussion/228993) that globally sclerotic glomeruli were excluded from the annotation .

However, the level of injury is on a continuous scale, and there is a number of annotated glomeruli
that have defects. The figure (Fig 2) below indicates several examples of unhealthy glomeruli from
[data.mendeley.com](https://data.mendeley.com/datasets/k7nvtgn2x6/3) with some of them being similar to annotated instances in the provided train data.
It is not clear what should be a fraction of fibrous tissue to stop considering a glomerulus to be healthy.
Another drastic example is annotation of FC glomeruli in *d488c759a* image from the test set and
absence of such annotation in other provided images. The uncertainty of including a particular
glomerulus in the mask predicted by the model rather than detection of glomeruli (healthy + unhealthy)
is the major factor degrading the model performance.


In [None]:
title = """Fig 2: Several examples of unhealthy glomeruli taken from data.mendeley.com."""
plot_image(
    "../input/presentation-images/unhealthy_gloms.png",
    title,
    (7, 7),
    -0.1
)

To mitigate the impact of unhealthy glomeruli detection on the model performance we have built a **2-
class model performing detection of both healthy and unhealthy glomeruli**. Given the large variety of
glomeruli defects and insufficient expertise in the field, we decided to course-grain all the unhealthy glomeruli into a
single class. We performed a hand annotation of the provided data to create masks for unhealthy
glomeruli. In addition, we hand-labeled the external data from [data.mendeley.com](https://data.mendeley.com/datasets/k7nvtgn2x6/3) the same way. So, during training the model is less affected by the noise resulted by
including/excluding a particular instance to the healthy glomeruli class annotated originally. The model
employs the information that an unhealthy glomerulus is also a glomerulus. Moreover, the ability of
the model to detect unhealthy glomeruli may be helpful for practical use cases. Meanwhile, the drastic
difference between the cross validation (CV) score of the model of 0.941 and 0.63 for healthy and
unhealthy glomeruli, respectively, indicates that the task of detection of glomeruli with defects is
significantly more complicated than the one considered in this challenge.


## Missing annotation and AI guided labeling

One of the difficulties with hand annotation of data is that a human, despite having the ability to
correctly treat difficult cases, sometimes may not pay enough attention and miss several instances.
For example, Figure 3 below indicates several glomeruli missed during annotation, but recognized
by the model. Those missed masks negatively impact training, and, unfortunately, the model
evaluation. More importantly, such mistakes can affect medical diagnostic and, therefore, human lives.

In [None]:
title = """Fig 3: An example of missing annotation in 8242609fa image. Green color corresponds to correctly
predicted glomeruli, while red color indicates glomeruli predicted by the model but not included
into the ground truth masks. Zoomed in regions with missed glomeruli are shown on the left."""
plot_image(
    "../input/presentation-images/missing_annotations.png",
    title,
    (15, 15),
    -0.1
)

One of the possible solutions is using a deep learning model working together with a human on a
sample analysis and annotation. Then, the human can verify cases of disagreement between him and
the AI annotation to correct possible errors. Specifically, we used this strategy to correct several
mistakes in the provided annotation of train data, as illustrated in Fig 3.

# Part 2: Solution Overview

For simplicity, we do not include the training code in the notebook. The code will be cleaned,
documented, and made publicly available on GitHub: https://github.com/Optimox/HubMap

## Augmentations & Data


### Image sizes


We worked with several image resolutions and tile sizes (a tile is a cropped image from the original image used for training and inference for the model).

Initial experiments were performed for tiles
of size 256x256 extracted from images downsized by 4 times (resolution/4), which enabled fast training
and rapid prototyping.

Then we switched to 512x512 tiles size at resolution/4 to ensure that sufficient
glomeruli surrounding is included into the input.

In addition, our final models were trained for tile size
of 512x512 extracted at resolution/2 and 768x768 extracted at resolution/3.

In all cases the tiles were
dynamically selected during training from preloaded downsized images into RAM.

### Sampling strategy

To speed up the convergence, we used sampling strategies that automatically select interesting tiles
based on the tissue annotation or having visible glomeruli in them. Each of these methods outperformed
random sampling. To make sure that the information about the areas not covered by the above
sampling is also learnt, we only enforce the chosen strategy 90% of the time.

### Augmentation strategy

To address the aforementioned data quality issues, we use aggressive data augmentation, which
includes:
- Brightness and Contrast changes
- RGB Shifting
- Hue, Saturation and Value shifting
- Color Jittering
- Artificial blurring: Motion blur, Gaussian blur, and Defocus blur
- CutMix and MixUp in some experiments, applied with 50% probability

In addition, we leverage invariance in the data: we randomly flip, rotate, shift, and scale the tiles.
However, since rotating, scaling, and shifting an image introduces side effects at edges, we perform
augmentation on 1.5x larger tiles and then take the center crop. Several examples of
augmented images are illustrated below (Fig 4).

In [None]:
title = """Fig 4: Examples of augmented images used during trianing (with corresponding annotations in red)."""
plot_image(
    "../input/presentation-images/image_augmentations.png",
    title,
    (20, 15),
    -0.05
)

## External data

For building our model in addition to the provided train data we utilized kidney histopathological images
from several additional sources.

- DATASET_A from [data.mendeley.com](https://data.mendeley.com/datasets/k7nvtgn2x6/3). It consists of 31 whole slide images that provides a significant variability and ensures better generalization. The size of the WSI range between 21651x10498 pixels and 49799 x 32359 pixels. We manually labeled this dataset with 2 class masks: healthy (detected in this challenge) and unhealthy glomeruli using QuPath program.

- Kidney glomeruli-ROIs dataset from [zenodo](https://zenodo.org/record/4299694). The original annotation consists of two classes, similar to ones we used for DATASET_A. The annotations were missing a number of glomeruli (mostly at image edges), so we manually added them. Some of our models do not actually use this data.

- 2 publicly available kidney PAS stained microscopy images from [the HubMAP portal](https://portal.hubmapconsortium.org) not included into train/test data and 5 images from public test set of this challenge. These images were annotated with pseudo labels generate by our model. In addition, we manually labeled unhealthy glomeruli in the images from the test set. Thus, the model could become familiar with artefacts precent in the test set.


## Model architecture

For glomeruli segmentation we chose a U-Net like network architecture [[2]](https://arxiv.org/abs/1505.04597). This architecture consists of
an encoder, creating a representation of extracted features at different levels, and a decoder which
combines the features and generates a prediction as a segmentation mask. The skip connections
between the encoder and the decoder allow effective use of features from the intermediate convolutional
layers of the decoder, without a need for the information to go the full way through the entire model.

In our pipeline we use the commonly used EfficientNet encoders. They slightly outperformed networks from
ResNet family (e.g. ResNeXt50) in our experiments. We considered several options to improve the decoder:
- Feature Pyramid Networks (FPN) skip connections [[3]](https://arxiv.org/abs/1612.03144)
- Bottleneck Transformer central block [[4]](https://arxiv.org/abs/2101.11605) to expand the receptive field of the model

However, since the problem considered in this competition is rather related to making a decision on
whether or not to include a particular glomerulus in the produced mask than to create of a highly
accurate segmentation, the choice of the decoder architecture is not critical since the encoder takes
care about most of the decision task. In our experiments, we saw that even the naivest decoders
achieve good performance, although they are slower to converge.

## Training setup
For the training setup, we adopt practices acquired from previous competitions that we know work
well:
- The cross-entropy loss is used as a loss function: it is simple but efficient. The Lovász loss [[5]](https://arxiv.org/abs/1705.08790) did not provide better performance in the considered task.
- Our model predicts two glomeruli classes (healthy & unhealthy), we use a 0.2 weight for the unhealthy class.
- Models are trained for at least 10000 iterations using the largest batch size that fits on a 2080Ti with half precision. Smaller models are trained for longer, especially if MixUp/CutMix augmentations are used.
- The learning rate is linearly increased up to 0.001 during the first 500 iterations, and then decreased linearly to 0.

Overall, we are able to reach a leaderboard score of 0.940 in 3 hours of training using a relatively
cheap setup. We noticed that surprisingly, image resolution does not really play a noticeable role.

# Part 3: Results

The validation scheme we used is a 5-fold split per image. The overall idea behind it is that we want
the models to generalize well to unseen images. We primary worked on optimization of our validation
dice score: we assume this metric will correlate the best with the private leaderboard. Because one of
the images in the public dataset, *d488c759a*, has a number of Fibrous Crescent glomeruli annotated,
we quickly realized that this image made leaderboard unreliable and not representative. We observed a first
leaderboard jump (on April 13) when our models started predicting fibrous crescent glomeruli.
However, this jump was due to label ambiguity in the external data (the [publicly available labels](https://www.kaggle.com/c/hubmap-kidney-segmentation/discussion/208972) for DATASET_A  include both healthy and unhealthy glomeruli as a single class), so we decided to focus
on optimizing the performance on the other test images excluding *d488c759a* (starting from April 30) as show in Fig 5.

In [None]:
title = """Fig 5: Dice score evolution on CV and LB over time."""
plot_image(
    "../input/presentation-images/Cv_vs_LB.png",
    title,
    (15, 15),
    -0.05
)

## Performance assessment

In fact, the dice score is not ideal to assess how well our models are working. Segmentation labels
are quite noisy and we believe it is more important to detect glomeruli than segment them. To this
extent, we put emphasis on glomeruli level metrics. In our predictions, we characterize glomeruli using
connected components of the mask.

As an example, we report performances on *2f6ecfcdf* (see Fig 6):
- The dice score is 0.9610
- There are 160 glomeruli
- 3 glomeruli are missed
- There are 6 false positive
- Detection score: 160 / (160 + 3 + 6) = 94.7%

In [None]:
title = """Fig 6: Error cases on 2f6ecfcdf. Green = prediction, red = ground truth."""
plot_image(
    "../input/presentation-images/error_cases.png",
    title,
    (7, 7),
    -0.05
)

However, this metric is not robust to annotation noise. Among the few mistakes, it is sometimes hard for
non-expert to tell whether the model is wrong or if the label is missing, such as the *b2dc8411c* example (see Fig 7).
Overall, to improve model performances, we spent a lot of time visualizing predictions since hand-crafted
metrics were not reliable enough.

In [None]:
title = """Fig 7: b2dc8411c has 138 annotations, the model has no false negative but predicts 5 false positive."""
plot_image(
    "../input/presentation-images/false_positives.png",
    title,
    (18, 5),
    -0.1, 
    fontsize=18
)

## Limitations

One of the challenges of this competition is label noise, it is really difficult to evaluate models. As a
result, it is hard to robustly assess their performance, which in our opinion is a big issue for medical
imaging tasks.

Still, from the images above, we can identify a few choke points of our models:
- Glomeruli predictions sometimes do not look round but have complex shapes (see images 2 and 4 in Fig 7)
- Predictions have some artefacts
- The model struggles with glomeruli with a large empty space in them
- Despite our model being good at differentiating sclerotic from non-sclerotic glomeruli, it still predicts some sclerotic ones (as we mentioned earlier, the level of injury is on a continuous scale, and sometimes it may be difficult to distinguish the cases)

We believe post-processing can be the key to remove some of the flaws of the model, but could not
develop a reliable strategy that consistently improves our validation score.

## Confidence estimation

As mentioned above, glomeruli detection made more sense to use for assessment of model
performance. However, the model we used does not directly output a confidence score since we use
semantic segmentation instead of instance segmentation. As a proxy for this score, we use the
maximum pixel prediction over the glomeruli (see Fig 8). We noticed that this was a good estimation of how
confident the model was, and that low scores (<0.7) were often caused by hard cases.
We tried removing false positive using low confidence predictions, but it did not work consistently
enough. However, this allowed us to be detect annotation mistakes more consistently.

Also, since our approach uses two classes, our model can also be used to provide a score for sclerotic
glomeruli. This score can be used to assess the evolution of the disease, helping the practician to
differentiate between an healthy and an unhealthy patient.
Overall, a high confidence score indicates that the prediction is almost certainly correct. If a practician
were to manually verify all glomeruli, he/she would know that these ones are not worth spending a lot
of time on. Then, the challenge is on the few (<10%) glomeruli with low scores, which can be visually
checked. This framework enables a speed-up of the tedious histology reviewing work, which is highly beneficial as we believe that the time of experts is very valuable.

In [None]:
title = """Fig 8: Confidence scores example for b2dc8411c. Green = prediction, red = ground truth."""
plot_image(
    "../input/presentation-images/confidence_esimation.png",
    title,
    (15, 15),
    -0.1,
    fontsize=18
)

# Part 4: Other applications and images

## Histological slides of any kind

The method used by our team is suitable for any type of histological slide. It is straightforward to adapt
it to any semantic segmentation task, provided there is data available. Then, achieving decent
performances is only a matter of tunning the hyperparameters. Furthermore, the pipeline also works
with multiclass problems, since we built the additional “unhealthy glomeruli” class.


## Other imaging domains: Line-field confocal optical coherence tomography (LC-OCT)


With minor changes, our pipeline can be used with completely different images. In fact, only the data
augmentation strategy is specific to histology. To illustrate this, we present results on the task of
segmenting nucleus cells in LC-OCT images. [LC-OCT](https://pubmed.ncbi.nlm.nih.gov/30353716/) is a new non-invasive medical imaging
technology that allows to explore the skin in-vivo. The images are black and white which changes from
the histology, but it allows to retrieve almost the same information with the advantage of being non-invasive and allowing for 3D acquisitions (see Figure 9).

In [None]:
title = """Fig 9: Visual comparison of LC-OCT and histology. Taken from Monier et al "In vivo characterization of
healthy human skin with a novel, non-invasive imaging technique: line-field confocal optical
coherence tomography"."""
plot_image(
    "../input/presentation-images/monier_healthy_skin.png",
    title,
    (15, 15),
    -0.1
)

### Segmenting 2D LC-OCTs

We adapted our pipeline to the task of keratinocytes nuclei (KCs) segmentation. Keratinocytes contain a lot
of information about the skin: some diseases and cancers affect keratinocytes. Segmenting them
manually is a long and tedious task, and with automated segmentation we hope to accelerate the
screening of patients and automatically provide the dermatologist with useful insights. In fact, once the
segmentations are generated (see Fig 10 and 11), one can generate quantitative metrics on top of them. For instance,
assessing the number, size, and shape of the keratinocytes will definitely help diagnose diseased skin.

The problem differs a lot from the task of
segmenting glomeruli (more cells, smaller cells) but our pipeline adapts well to it.

In [None]:
title = """Fig 10: 2D LC-OCT image example ('en coupe')."""
plot_image(
    "../input/presentation-images/side_view_no_seg.png",
    title,
    (20, 15),
    -0.1
)

title = """Fig 11: Segmentation of keratinocytes on 2D LC-OCT images ('en coupe')."""
plot_image(
    "../input/presentation-images/side_view_seg.png",
    title,
    (20, 15),
    -0.1
)

This pipeline also works with 'en face' LC-OCT images, we trained a second model for this other task (see Fig 12 and 13).

In [None]:
title = """Fig 12: 2D LC-OCT image example ('en face')."""
plot_image(
    "../input/presentation-images/top_view_no_seg.png",
    title,
    (20, 15),
    -0.1
)

title = """Fig 13: Segmentation of keratinocytes on 2D LC-OCT images ('en face')."""
plot_image(
    "../input/presentation-images/top_view_seg.png",
    title,
    (20, 15),
    -0.1
)

### Towards 3D segmentation of LC-OCT images

As mentioned above, LC-OCT allows to view 3D stacks, enabling a higher level of information than histology. Since keratinocytes are 3D as well, it makes sense to use our predictions at a 3D level. Using averaging of the predictions on the two axes (en coupe and en face), we are able to retrieve the 3D structure of cells (see Fig 14). 

In [None]:
title = """Fig 14: Segmentation of keratinocytes on 3D LC-OCT image."""
plot_image(
    "../input/presentation-images/3D_KCs_seg.png",
    title,
    (18, 15),
    -0.1
)

# Conclusions

We have presented our approach to the HuBMAP - Hacking the Kidney Competition. This approach is built on understanding the challenges behind the data:
- It takes in consideration the link between healthy glomeruli and unhealthy ones (Sclerotic, Fibrous Crescent, etc.)  by predicting both into two different classes.
- It is designed to be robust to noise, as we used aggressive augmentations.
- We incorporate several external datasets in our pipeline and manually annotated them (using 2 classes of glomeruli) to ensure that our model is able to generalize well on new cases.


Our model architecture is relatively simple, and the pipeline can be easily transferred to other very different segmentation tasks such as keratinocytes segmentation in LC-OCTs. Furthermore, it achieves competitive performances on the leaderboard, even though label noise makes it difficult to tell whether the results will translate well to the private leaderboard. 

We also presented a framework to assess the model performance, using glomeruli level prediction. In fact, the image-level dice is too coarse to entirely capture the challenge of the problem. By leveraging true positive, false positive, and false negative rates at glomeruli level, we are able to understand the weaknesses of the model. Furthermore, our model is able to provide a confidence score estimation which could help practicians in their decision making.

We hope that our work will help to understand the relationships between glomeruli and the health of an individual. We also hope that a lot of people will benefit from the effort the hosts, the Kaggle team, the participants, and us put into the competition.