# [Image Segmentation](https://www.ibm.com/topics/image-segmentation)

Image segmentation is the process of partitioning a digital image into multiple image segments. Classification can tell us whether a cat is present, while segmentation will tell us where we can find it.

<figure>
    <center>
        <img src="attachments/segmentation-categories.png"  style="width:550px;" >
    </center>
</figure>


## Instance Segmentation

Labels individual objects of interest with distinct labels. 

So whereas semantic segmentation would label a picture of two people shaking hands with two labels (“person” and “background”), instance segmentation would have three labels (“person1,” “person2,” and “background”) with a boundary somewhere around the clasped hands.

## [Object Detection](https://www.ibm.com/topics/object-detection)

Locates an item of interest in an image and puts a bounding box around the item. More information in [YOLOv3](https://pjreddie.com/media/files/papers/YOLOv3.pdf).


# Semantic Segmentation

The act of classifying individual pixels in an image using labels just like those we’ve seen for our classification tasks, for example, “bear,” “cat,” “dog,” and so on. If done properly, this will result in distinct chunks or regions that signify things like “all of these pixels are part of a cat.” This takes the form of a label mask or heatmap that identifies areas of interest. We will have a simple binary label: true values will correspond to nodule candidates, and false values mean uninteresting healthy tissue.

## Plan

1. **Segmentation**. First we will learn how segmentation works with a U-Net model, including what the new model components are and what happens to them as we go through the segmentation process.
2. **Update**. To implement segmentation, we need to change our existing code base in three main places. 
    a. **Update the model**. We will integrate a preexisting U-Net into our segmentation model. The model in this chapter will output an entire image.
    b. **Change the dataset**. We need to change our dataset to not only deliver bits of the CT but also provide masks for the nodules. The classification dataset consisted of 3D crops around nodule candidates, but we’ll need to collect both full CT slices and 2D crops for segmentation training and validation.
    c. **Adapt the training loop**. We need to adapt the training loop so we bring in a new loss to optimize. Because we want to display images of our seg- mentation results in TensorBoard, we’ll also do things like saving our model weights to disk.
3. **Results**. Finally, we’ll see the fruits of our efforts when we look at the quantitative segmentation results.

# U-Net

In this diagram, the boxes represent intermediate results and the arrows represent operations between them. The U-shape of the architecture comes from the multiple resolutions at which the network operates. In the top row is the full resolution (512 × 512 for us), the row below has half that, and so on. The data flows from top left to bot- tom center through a series of convolutions and downscaling. 

<figure>
    <center>
        <img src="attachments/unet-architecture.png"  style="width:750px;" >
    </center>
</figure>


## Architecture

Earlier network designs already had this _U-shape_, which people attempted to use to address the **limited receptive field** size of fully convolutional networks. To address this limited field size, they used a design that copied, inverted, and appended the focusing portions of an image-classification network to create a symmetrical model that goes from fine detail to wide receptive field and back to fine detail.

Those earlier network designs had problems converging, however, most likely due to the **loss of spatial information during downsampling**. Once information reaches a large number of very downscaled images, the exact location of object boundaries gets harder to encode and therefore reconstruct. To address this, the U-Net authors added the skip connections.

In _U-Net_, skip connections **short-circuit** inputs along the downsampling path into the corresponding layers in the upsampling path. These layers receive as input both the upsampled results of the wide receptive field layers from lower in the U as well as the output of the earlier fine detail layers via the “copy and crop” bridge connections.

All of this means those final detail layers are operating with the best of both worlds. They’ve got both information about the larger context surrounding the immediate area and fine detail data from the first set of full-resolution layers.


## Implementation

> Do not test multiple modifications at the same time!

Tt’s important to note that our output is a single channel, with each pixel of output representing the model’s estimate of the probability that the pixel in question is part of a nodule.

### Base Model

We will be using pre implemented [U-Net model](https://github.com/jvanvugt/pytorch-unet) implemented from [U-Net: Convolutional Networks for Biomedical Image Segmentation](https://arxiv.org/abs/1505.04597) paper and  apply different improvements using _ablation study_ process.

### Ablation Study

> [An ablation study](https://en.wikipedia.org/wiki/Ablation_(artificial_intelligence)) investigates the performance of an AI system by removing certain components to understand the contribution of the component to the overall system.

TBD: How exactly are we implementing ablation study and what is our process?

#### Batch Normalization

We’re going to pass the input through batch normalization. This way, we won’t have to normalize the data ourselves in the dataset; and, more importantly, we will get normalization statistics (read mean and standard deviation) estimated over individual batches.

This means when a batch is dull for some reason—that is, when there is nothing to see in all the CT crops fed into the network—it will be scaled more strongly. The fact that samples in batches are picked randomly at every epoch will minimize the chances of a dull sample ending up in an all-dull batch, and hence those dull samples getting overemphasized.

> **Note** that we’re using nn.BatchNorm2d here. This is because U-Net is fundamentally a two-dimensional segmentation model. We could adapt the implementation to use 3D convolutions, in order to use information across slices. The memory usage of a straight- forward implementation would be considerably greater: that is, we would have to chop up the CT scan. Also, the fact that pixel spacing in the Z direction is much larger than in-plane makes a nodule less likely to be present across many slices. These considerations make a fully 3D approach less attractive for our purposes. Instead, we’ll adapt our 3D data to be segmented a slice at a time, providing adjacent slices for context (for example, detecting that a bright lump is indeed a blood vessel gets much easier along- side neighboring slices).

#### Applying Sigmoid to model output

We are going to pass the output through an `nn.Sigmoid` layer to restrict the output to the range $[0, 1]$.

#### Reducing the total depth

We will reduce the total depth and number of filters we allow our model to use. **The capacity** of the model using the standard parameters far **outstrips** our dataset size. This means we’re unlikely to find a pretrained model that matches our exact needs.

TBD: What does outstrips mean?




# Dataset

The original U-Net implementation did not use padded convolutions, which means while the output segmentation map was smaller than the input, every pixel of that output had a fully populated receptive field. None of the input pixels that fed into the determination of that output pixel were padded, fabricated, or otherwise incomplete. Thus the output of the original U-Net will tile perfectly, so it can be used with images of any size (except at the edges of the input image, where some context will be missing by definition).

## Why do we have to padded convolutions?

### Losing information near the edges because of padding

The U-Net paper used $572 × 572$ image patches, which resulted in $388 × 388$ output maps. The input images are bigger than our $512 × 512$ CT slices, and the output is quite a bit smaller! That would mean any nodules near the edge of the CT scan slice wouldn’t be segmented at all. Although this setup works well when dealing with very large images, it’s not ideal for our use case.

We will address this issue by padding images during convolutions. This will mean we can use input images of any size, and we will get output of the same size. We may lose some fidelity near the edges of the image, since the receptive field of pixels located there will include regions that have been artificially padded, but that’s a compromise we decide to live with.

### U-Net trade-offs for 3D vs. 2D data

The second issue is that our 3D data doesn’t line up exactly with U-Net’s 2D expected input. Simply taking our $512 × 512 × 128$ image and feeding it into a converted-to-3D U-Net class won’t work, because we’ll exhaust our GPU memory.

instead of trying to do things in 3D, we’re going to treat each slice as a 2D segmentation problem and cheat our way around the issue of context in the third dimension by providing neighboring slices as separate channels. 

**Pros:**
- Can run on "normal" GPU.

**Cons:**
- We lose the direct spatial relation- ship between slices when represented as channels, as all channels will be linearly com- bined by the convolution kernels with no notion of them being one or two slices away, above or below.
- We also lose the wider receptive field in the depth dimension that would come from a true 3D segmentation with downsampling.



## Building the ground truth

We have a mismatch between our human- labeled training data and the actual output we want to get from our model. We have **annotated points**, but we want a **per-voxel mask** that indicates whether any given voxel is part of a nodule.

### Bounding boxes

If we assume that the nodule locations are roughly centered in the mass, we can trace outward from that point in all three dimensions until we hit low-density voxels, indicating that we’ve reached normal lung tissue (which is mostly filled with air).

<figure>
    <center>
        <img src="attachments/bounding-boxes.png"  style="width:750px;" >
    </center>
</figure>

#### Algorithm

Here's the step-by-step algorithm in simple instructions. Repeat the loop below for every dimension:

**1. Grab the center data:**
  - This refers to the data point identified as the center of the nodule.

**2. Initialize loop variables:**
  - Set `index_radius` to 0 (initial search radius).
  - Create a variable `density_above_threshold` and set it to `True` (assuming high density at the center).

**3. Loop until density falls below threshold:**
  - While `density_above_threshold` is `True`:
    - Try:
      - Check the density at `center_index + index_radius` and `center_index - index_radius`. 
      - If either density is below a defined threshold, set `density_above_threshold` to `False`.
    - Except (catch `IndexError`):
      - Ignore this error, it means the search went out of bounds (not important in this case).
    - Increment `index_radius` by 1 (expand search radius).

**4. Define bounding box:**
  - Create a bounding box around the center with a one-voxel border of low-density tissue on at least one side.
  - Since some nodules touch the lung wall, searching needs to stop in both directions when air is reached.
  - The `center_index` values should be centered to ensure the one-voxel border exists on the edge closest to the nodule.

**5. Set bounding box mask:**
  - Set the corresponding box in a bounding box mask array to `True`. This likely refers to an array that identifies the area of interest.


## Segmentation Dataset

The data that we produce will be two-dimensional CT slices with multiple channels. The extra channels will hold adjacent slices of CT. Each slice of CT scan can be thought of as a 2D grayscale image. 

How we combine those slices is up to us. For the input to our classification model, we treated those slices as a 3D array of data and used 3D convolutions to process each sample. For our segmentation model, we are going to instead treat each slice as a single channel, and produce a multichannel 2D image. Doing so will mean that we are treating each slice of CT scan as if it was a color channel of an RGB image.



# Augmentation

Document new augmentation running on GPU.

# Segmentation training

We will update three things affecting the outcome from the training code for classification model: 
- We need to instantiate the new model (unsurprisingly).
- We will introduce a new loss: the **Dice loss**.
- We will also look at an optimizer other than the venerable SGD we’ve used so far. We’ll stick with a popular one and use Adam.

But we will also step up our bookkeeping, by
- Logging images for visual inspection of the segmentation to TensorBoard
- Performing more metrics logging in TensorBoard
- Saving our best model based on the validation


## Adam Optimizer

- What is it's relationship with loss weight and why can't it be used with SGD?
- Adam vs Stochastic gradient descent with Nesterov momentum (chapter 13.6.2)

## [Dice Loss](https://en.wikipedia.org/wiki/Dice-Sørensen_coefficient)

Also known as Dice-Sørensen coefficient is a common loss metric for seg- mentation tasks. One advantage of using Dice loss over a per-pixel cross-entropy loss is that Dice handles the case where only a small portion of the overall image is flagged as positive.

<figure>
    <center>
        <img src="attachments/dice-loss.png"  style="width:750px;" >
    </center>
</figure>


Definition of Dice loss on two sets:
$$
Dice = \frac{2|X \cap Y|}{|X| + |Y|}
$$

Definition of Dice loss on boolean data:
$$
Dice = \frac{2*TP}{2*TP + FP + FN}
$$

> **Note:** This is a per-pixel F1 score where the “population” is one image’s pixels. Since the population is entirely contained within one training sample, we can use it for training directly. In the classification case, the F1 score is not calculable over a single minibatch, and, hence, we cannot use it for training directly.