<br>

## Visual Exploration of Semantic Segmentation Metrics

---

#### <b><font color="red">This is a work in progress.<br><br>I plan to finish it by the weekend.<br><br>I'm making it public while I work on it though.<br><br>Please bear with me while I finish it out. Thanks!</font></b>

<br>

#### **IMPORTS**

---

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import cv2
import os

<br>

#### **DEFINE OUR INITIAL GROUND TRUTH MASK AND PREDICTION MASK**

---

In [None]:
DEMO_SHAPE = (10,10)

ground_truth_mask = np.zeros(DEMO_SHAPE, dtype=np.float32)
for i in range(10): ground_truth_mask[i:i+2, i:i+2] = 1.0
for i in range(4,7): ground_truth_mask[i, i:i+1] = 0.0
ground_truth_rgb_mask = np.stack([ground_truth_mask, ground_truth_mask[::-1], np.roll(ground_truth_mask, shift=3)], axis=-1)

random_pred_mask = np.random.randint(low=0, high=2, size=DEMO_SHAPE).astype(np.float32)
random_pred_rgb_mask = np.random.randint(low=0, high=2, size=(*DEMO_SHAPE, 3)).astype(np.float32)

<br>

#### **HELPER FUNCTIONS FOR PLOTTING**

---

We then use these helpers to plot our ground truth mask and randomly generated prediction mask

In [None]:
def grid_imshow(mask):
    ax = plt.gca()
    plt.imshow(mask, cmap="gray" if len(mask.shape)==2 else None, interpolation="none", vmin=0, vmax=1, aspect="equal")

    # Major ticks
    ax.set_xticks(np.arange(0, DEMO_SHAPE[1], 1))
    ax.set_yticks(np.arange(0, DEMO_SHAPE[0], 1))

    # Labels for major ticks
    ax.set_xticklabels(np.arange(1, DEMO_SHAPE[1]+1, 1))
    ax.set_yticklabels(np.arange(1, DEMO_SHAPE[0]+1, 1))

    # Minor ticks
    ax.set_xticks(np.arange(-.5, DEMO_SHAPE[1], 1), minor=True)
    ax.set_yticks(np.arange(-.5, DEMO_SHAPE[0], 1), minor=True)

    # Gridlines based on minor ticks
    ax.grid(which='minor', color='gray', linestyle='-', linewidth=2)
    
def compare_masks(gt_mask, pred_mask, gt_title="Ground Truth Mask", pred_title="Prediction Mask", _figshape=(20,10)):
    plt.figure(figsize=_figshape)

    plt.subplot(1,2,1)    
    grid_imshow(gt_mask)
    plt.title(gt_title, fontweight="bold")
    
    plt.subplot(1,2,2)
    grid_imshow(pred_mask)
    plt.title(pred_title, fontweight="bold")
    
    plt.tight_layout()
    plt.show()
    
compare_masks(ground_truth_mask, random_pred_mask)
compare_masks(ground_truth_rgb_mask, random_pred_rgb_mask, gt_title="Ground Truth RGB Mask", pred_title="Prediction RGB Mask")

<br>

#### **WHAT IS SEMANTIC SEGMENTATION AND WHAT DOES IT HAVE TO DO WITH THESE GRIDS**

---

In simplest terms, **semantic segmentation is simply per-pixel classification.** This pex-pixel classification can take similar forms to what you've seen/heard about in regular machine learning classification problems
* Binary Semantic Segmentation ***(Binary Classification)***
* Multi-Class Semantic Segmentation ***(Multiclass Classification)***
* Multi-Label Semantic Segmentation ***(Multilabel Classification)***

---

Knowing this, we will simply take this per pixel definition to the visual first principles, and investigate a case of **binary semantic segmentation** where we are trying to predict whether a grid-tile (read pixel) is white or black.
* This is a much simpler version of an analagous task like predicting which pixels contain cat in a given image

---

We will follow that investigation up with a second example where we will perform **multilabel** semantic segmentation on our 10x10 grid. We will predict the RGB binary values in an attempt to match a ground truth RGB image.
* This is simply a much simpler version of the analagous task found in this UWM GI Tract Image Segmentation Competition


<br>

#### **LET'S LOOK AT ACCURACY! – BINARY SEGMENTATION VERSION**

---

Well... before we dive into the literature. Let's think about things. If we completely disregard the idea of segmentation as a whole and look at what we are really doing – **per-pixel binary classification** – it's not absurd to think that we could simply calculate the **per-pixel binary accuracy**.
* This is kind of like treating each pixel as a single example and then just averaging over the shape of the image

---

<br>

**I will illustrate this in the cell below in an inefficient, but verbose way to demonstrate what is happening**
* Note that because the prediction mask is random you will have some variability in accuracy.
* However, the usual is between 45-60%.

In [None]:
compare_masks(ground_truth_mask, random_pred_mask)

correct_pred_indices = []
incorrect_pred_indices = []
for i in range(DEMO_SHAPE[0]): # loop row by row
    for j in range(DEMO_SHAPE[1]): # walk the row one cell at a time
        gt_pixel_val = ground_truth_mask[i, j]
        pred_pixel_val = random_pred_mask[i, j]
        if gt_pixel_val==pred_pixel_val:
            correct_pred_indices.append((i,j))
        else:
            incorrect_pred_indices.append((i,j))

binary_pixel_classification_accuracy = len(correct_pred_indices)/(DEMO_SHAPE[0]*DEMO_SHAPE[1])

viz_error_mask = np.zeros((*DEMO_SHAPE, 3), dtype=np.float32)
viz_error_mask[np.array(correct_pred_indices)[:, 0], np.array(correct_pred_indices)[:, 1], 1] = 1.0
viz_error_mask[np.array(incorrect_pred_indices)[:, 0], np.array(incorrect_pred_indices)[:, 1], 0] = 1.0

print(f"\n\n\n... BINARY ACCURACY = {100*binary_pixel_classification_accuracy:.2f} ...\n")
compare_masks(ground_truth_mask, viz_error_mask, pred_title="Error Visualization\nGreen=Agreement\nRed=Disagreement")

<br>

#### **WHAT'S WRONG WITH WHAT WE JUST DID?**

---

There's nothing inherently wrong with what we just did... however, let's imagine a different ground truth case. One where only a few pixels are white. Now let's see how we do. 

In [None]:
ground_truth_mask_2 = np.zeros(DEMO_SHAPE, dtype=np.float32)
ground_truth_mask_2[0,0] = 1.0
ground_truth_mask_2[0,DEMO_SHAPE[1]-1] = 1.0
ground_truth_mask_2[DEMO_SHAPE[0]-1,0] = 1.0
ground_truth_mask_2[DEMO_SHAPE[0]-1,DEMO_SHAPE[1]-1] = 1.0

compare_masks(ground_truth_mask_2, random_pred_mask)

correct_pred_indices = []
incorrect_pred_indices = []
for i in range(DEMO_SHAPE[0]): # loop row by row
    for j in range(DEMO_SHAPE[1]): # walk the row one cell at a time
        gt_pixel_val = ground_truth_mask_2[i, j]
        pred_pixel_val = random_pred_mask[i, j]
        if gt_pixel_val==pred_pixel_val:
            correct_pred_indices.append((i,j))
        else:
            incorrect_pred_indices.append((i,j))

binary_pixel_classification_accuracy = len(correct_pred_indices)/(DEMO_SHAPE[0]*DEMO_SHAPE[1])

viz_error_mask = np.zeros((*DEMO_SHAPE, 3), dtype=np.float32)
viz_error_mask[np.array(correct_pred_indices)[:, 0], np.array(correct_pred_indices)[:, 1], 1] = 1.0
viz_error_mask[np.array(incorrect_pred_indices)[:, 0], np.array(incorrect_pred_indices)[:, 1], 0] = 1.0

print(f"\n\n\n... BINARY ACCURACY = {100*binary_pixel_classification_accuracy:.2f} ...\n")
compare_masks(ground_truth_mask_2, viz_error_mask, pred_title="Error Visualization\nGreen=Agreement\nRed=Disagreement")

<br>

#### **OK... WE GOT AROUND THE SAME ACCURACY?**

---

See now I look silly. Now it looks like it didn't matter. However, let's further consider... that instead of predicting a random mask, we simply predict all zeros!
* This type of scenario is analagous to medical imagery where often, much of the image IS NOT the area of interest.
* This results in a large amount of "background" (i.e. class=0) and a small (sometimes very small) amount of "foreground". 
* This can make models prone to simply put out all zeros as a prediction (and it certainly would if we framed the problem as binary classification without any classweighting).

In [None]:
all_black_pred = np.zeros_like(ground_truth_mask_2)
compare_masks(ground_truth_mask_2, all_black_pred)

correct_pred_indices = []
incorrect_pred_indices = []
for i in range(DEMO_SHAPE[0]): # loop row by row
    for j in range(DEMO_SHAPE[1]): # walk the row one cell at a time
        gt_pixel_val = ground_truth_mask_2[i, j]
        pred_pixel_val = all_black_pred[i, j]
        if gt_pixel_val==pred_pixel_val:
            correct_pred_indices.append((i,j))
        else:
            incorrect_pred_indices.append((i,j))

binary_pixel_classification_accuracy = len(correct_pred_indices)/(DEMO_SHAPE[0]*DEMO_SHAPE[1])

viz_error_mask = np.zeros((*DEMO_SHAPE, 3), dtype=np.float32)
viz_error_mask[np.array(correct_pred_indices)[:, 0], np.array(correct_pred_indices)[:, 1], 1] = 1.0
viz_error_mask[np.array(incorrect_pred_indices)[:, 0], np.array(incorrect_pred_indices)[:, 1], 0] = 1.0

print(f"\n\n\n... BINARY ACCURACY = {100*binary_pixel_classification_accuracy:.2f} ...\n")
compare_masks(ground_truth_mask_2, viz_error_mask, pred_title="Error Visualization\nGreen=Agreement\nRed=Disagreement")

<br>

#### **OHHH. I SEE!**

---

**BINARY ACCURACY DOESN'T TAKE INTO CONSIDERATION LARGE CLASS IMBALANCES!** This is the main reason why something as simple as binary accuracy doesn't make sense in cases where a background class is so dominant... Like is usually the case for semantic segmentation.

What now....

<br>

<center><img src="https://i.ibb.co/LhfQzxf/final-6261f0f883d46a00a328c606-592769.png"></center>

<br>



<br>

#### **LET'S LOOK AT PRECISION, RECALL, F1-SCORE, & AUC! BINARY SEGMENTATION VERSION**

---

Before I throw up (pun intended) all of the images and formulas to explain these metrics, we first have to understand these conceptual metrics:
* **True-Positives**
* **False-Positives**
* **True-Negatives**
* **False-Negatives**

To explain these term's visually let's use our first ground truth mask and our first random prediction and generate examples of each. When we plot these, we will also offer a straightforward definition to help you understand.

---



<br>

#### **TRUE-POSITIVES**

---

We will use image below in every section to illustrate (graphically) what we are trying to capture.

<center><img src="https://miro.medium.com/max/1336/1*uzJKEMrjHEv9DBAGNke3EQ.png" w=25%></center>

<br>

**FROM THE ABOVE IMAGE WE CAN GIVE A BASIC DEFINITION**

A **True-Positive**, is **an outcome where the model CORRECTLY predicts the POSITIVE class**
* What the model predicts as the positive class is shown as the circle 
* What the actual ground truth positive classes are is shown as the entire left side.
* The points that are on the left side of the image inside the circle are **True-Positives** as they represent the "overlap" of positive class model predictions that match the ground truth values.

**LET'S SEE WHAT THIS LOOKS LIKE FOR OUR PREDICTION**
* For simplicity we will make **True-Positives** appear <b><font color="lightgreen">Light Green</font></b>
* For visualization purposes, all other cells will be displayed in **black**.
* As we understand more and more of the terminology, our visualization will get more colourful!

In [None]:
compare_masks(ground_truth_mask, random_pred_mask)

true_positive_indices = []
for i in range(DEMO_SHAPE[0]): # loop row by row
    for j in range(DEMO_SHAPE[1]): # walk the row one cell at a time
        gt_pixel_val = ground_truth_mask[i, j]
        pred_pixel_val = random_pred_mask[i, j]
        if (gt_pixel_val==pred_pixel_val) and (gt_pixel_val==1.0):
            true_positive_indices.append((i,j))

viz_mask = np.zeros((*DEMO_SHAPE, 3), dtype=np.float32)
viz_mask[np.array(true_positive_indices)[:, 0], np.array(true_positive_indices)[:, 1], 1] = 1.0

compare_masks(ground_truth_mask, viz_mask, pred_title="True Positive Visualization\nGreen=TP")

<br>

#### **TRUE-NEGATIVES**

---

We will use image below in every section to illustrate (graphically) what we are trying to capture.

<center><img src="https://miro.medium.com/max/1336/1*uzJKEMrjHEv9DBAGNke3EQ.png" w=25%></center>

<br>

**FROM THE ABOVE IMAGE WE CAN GIVE A BASIC DEFINITION**

A **True-Negative**, is **an outcome where the model CORRECTLY predicts the NEGATIVE class**
* What the model predicts as the negative class is shown as everything outside the circle 
* What the actual ground truth negative classes are is shown as the entire right side
* The points that are on the right side of the image outside the circle are **True-Negatives** as they represent the "overlap" of negative class model predictions that match the ground truth values.

**LET'S SEE WHAT THIS LOOKS LIKE FOR OUR PREDICTION**
* Remember that our **True-Positives** appear <b><font color="lightgreen">Light Green</font></b>
* For simplicity we will make **True-Negatives** appear <b><font color="darkgreen">Dark Green</font></b>
* For visualization purposes, all other cells will be displayed in **black**.
* As we understand more and more of the terminology, our visualization will get more colourful!

In [None]:
compare_masks(ground_truth_mask, random_pred_mask)

true_positive_indices = []
true_negative_indices = []
for i in range(DEMO_SHAPE[0]): # loop row by row
    for j in range(DEMO_SHAPE[1]): # walk the row one cell at a time
        gt_pixel_val = ground_truth_mask[i, j]
        pred_pixel_val = random_pred_mask[i, j]
        if gt_pixel_val==pred_pixel_val:
            if gt_pixel_val==1.0:
                true_positive_indices.append((i,j))
            else:
                true_negative_indices.append((i,j))

viz_mask = np.zeros((*DEMO_SHAPE, 3), dtype=np.float32)
viz_mask[np.array(true_positive_indices)[:, 0], np.array(true_positive_indices)[:, 1], 1] = 1.0
viz_mask[np.array(true_negative_indices)[:, 0], np.array(true_negative_indices)[:, 1], 1] = 0.333

compare_masks(ground_truth_mask, viz_mask, pred_title="True Positive/Negative Visualization\nLight-Green=TP\nDark-Green=TN")

<br>

#### **FALSE-POSITIVES**

---

We will use image below in every section to illustrate (graphically) what we are trying to capture.

<center><img src="https://miro.medium.com/max/1336/1*uzJKEMrjHEv9DBAGNke3EQ.png" w=25%></center>

<br>

**FROM THE ABOVE IMAGE WE CAN GIVE A BASIC DEFINITION**

A **False-Positive**, is **an outcome where the model INCORRECTLY predicts the POSITIVE class**
* What the model predicts as the positive class is shown as everything inside the circle 
* What the actual ground truth positive classes are is shown as the entire left side
* The points that are on the right side of the image inside the circle are **False-Positives** as they are the positive class model predictions that DO NOT match the ground truth values.

**LET'S SEE WHAT THIS LOOKS LIKE FOR OUR PREDICTION**
* Remember that our **True-Positives** appear <b><font color="lightgreen">Light Green</font></b>
* Remember that our **True-Negatives** appear <b><font color="darkgreen">Dark Green</font></b>
* For simplicity we will make **False-Positives** appear <b><font color="red">Light Red</font></b>
* For visualization purposes, all other cells will be displayed in **black**.
* As we understand more and more of the terminology, our visualization will get more colourful!

In [None]:
compare_masks(ground_truth_mask, random_pred_mask)

true_positive_indices = []
true_negative_indices = []
false_positive_indices = []
for i in range(DEMO_SHAPE[0]): # loop row by row
    for j in range(DEMO_SHAPE[1]): # walk the row one cell at a time
        gt_pixel_val = ground_truth_mask[i, j]
        pred_pixel_val = random_pred_mask[i, j]
        if gt_pixel_val==pred_pixel_val:
            if gt_pixel_val==1.0:
                true_positive_indices.append((i,j))
            else:
                true_negative_indices.append((i,j))
        else:
            if gt_pixel_val==1.0:
                false_positive_indices.append((i,j))

viz_mask = np.zeros((*DEMO_SHAPE, 3), dtype=np.float32)
viz_mask[np.array(true_positive_indices)[:, 0], np.array(true_positive_indices)[:, 1], 1] = 1.0
viz_mask[np.array(true_negative_indices)[:, 0], np.array(true_negative_indices)[:, 1], 1] = 0.333
viz_mask[np.array(false_positive_indices)[:, 0], np.array(false_positive_indices)[:, 1], 0] = 1.0

compare_masks(ground_truth_mask, viz_mask, pred_title="TP, FP, & TN Visualization\nLight-Green=TP\nDark-Green=TN\nLight-Red=FP")

<br>

#### **FALSE-NEGATIVES**

---

We will use image below in every section to illustrate (graphically) what we are trying to capture.

<center><img src="https://miro.medium.com/max/1336/1*uzJKEMrjHEv9DBAGNke3EQ.png" w=25%></center>

<br>

**FROM THE ABOVE IMAGE WE CAN GIVE A BASIC DEFINITION**

A **False-Negatives**, are **an outcome where the model INCORRECTLY predicts the NEGATIVE class**
* What the model predicts as the negative class is shown as everything outside the circle 
* What the actual ground truth negative classes are is shown as the entire right side
* The points that are on the left side of the image outside the circle are **False-Negatives** as they are the negative class model predictions that DO NOT match the ground truth values.

**LET'S SEE WHAT THIS LOOKS LIKE FOR OUR PREDICTION**
* Remember that our **True-Positives** appear <b><font color="lightgreen">Light Green</font></b>
* Remember that our **True-Negatives** appear <b><font color="darkgreen">Dark Green</font></b>
* Remember that our **False-Positives** appear <b><font color="red">Light Red</font></b>
* For simplicity we will make **False-Negatives** appear <b><font color="darkred">Dark Red</font></b>
* We are now at maximum COLORFULNESS and no cells are displayed as black

In [None]:
compare_masks(ground_truth_mask, random_pred_mask)

true_positive_indices = []
true_negative_indices = []
false_positive_indices = []
false_negative_indices = []
for i in range(DEMO_SHAPE[0]): # loop row by row
    for j in range(DEMO_SHAPE[1]): # walk the row one cell at a time
        gt_pixel_val = ground_truth_mask[i, j]
        pred_pixel_val = random_pred_mask[i, j]
        if gt_pixel_val==pred_pixel_val:
            if gt_pixel_val==1.0:
                true_positive_indices.append((i,j))
            else:
                true_negative_indices.append((i,j))
        else:
            if gt_pixel_val==1.0:
                false_positive_indices.append((i,j))
            else:
                false_negative_indices.append((i,j))
viz_mask = np.zeros((*DEMO_SHAPE, 3), dtype=np.float32)
viz_mask[np.array(true_positive_indices)[:, 0], np.array(true_positive_indices)[:, 1], 1] = 1.0
viz_mask[np.array(true_negative_indices)[:, 0], np.array(true_negative_indices)[:, 1], 1] = 0.333
viz_mask[np.array(false_positive_indices)[:, 0], np.array(false_positive_indices)[:, 1], 0] = 1.0
viz_mask[np.array(false_negative_indices)[:, 0], np.array(false_negative_indices)[:, 1], 0] = 0.333

compare_masks(ground_truth_mask, viz_mask, pred_title="TP, FP, TN, & FN Visualization\nLight-Green=TP\nDark-Green=TN\nLight-Red=FP\Dark-Red=FN")

<br>

#### **LET'S REVIEW FOR A MINUTE**

---

We can now see in the above image a colourful representation of not only how well we did (Green v. Red), but we can also see how well we did w.r.t. both the foreground and the background classes. 

<br>

**This gives us some more options!**

<br>

**RECALL** (Also known as *sensitivity* or *true-positive rate*)
* We could identify the accuracy w.r.t. the ground truth foreground! i.e. What percentage of all ground truth positives did we manage to guess correctly.
    * This would be represented as the formula: 
$$
\dfrac{TP}{TP+FN}
$$
    
<br>

**PRECISION**
* We could identify the accuracy w.r.t. the predicted foreground! i.e. What percentage of all of the predicted positives did we manage to guess correctly.
    * This would be represented as the formula:
$$
\dfrac{TP}{TP+FP}
$$
    
<br>

**There are other metrics, but for the most part this should cover the basics of what we need**

<br>

---

<br>

We could then take things even further by combining some of these metrics to get a better understanding of how well we predicted everything (foreground, background, etc.). This leads us to a real strong candidate for a satisfactory metric... the F1 Score!

<br>

**F1 SCORE**
* One simple approach would be to take the ***harmonic mean*** of precision and recall. 
    * NOTE: We use the ***harmonic mean*** because it penalizes extreme values (all the bg wrong, etc.)
    * This would be represented as the formula: <br><br>
$$
2 \times \dfrac{PRECISION \times RECALL}{PRECISION + RECALL}
$$
    
---

**BRIEF ASIDE ON DIFFERENT TYPES OF MEANS...**

<center><img src="https://miro.medium.com/max/1086/1*WUYsiOqd1UtBoMf1UcSsMg.gif"></center>

---


<br>

#### **LET'S MAKE SURE WE CAN'T BE TRICKED LIKE BEFORE!**

---

Let's calculate the F1-Score for the previous example and see how it compares.
* Let's first validate that our calculated F1-Score yields the same as an off the shelf tool
    * We will see below that the sklearn.metrics.f1_score works just as well as our tool to calculate the f1-score... as a result we will use that when we simply need to calculate F1-Score

In [None]:
from sklearn.metrics import f1_score

true_positive_indices = []
true_negative_indices = []
false_positive_indices = []
false_negative_indices = []

compare_masks(ground_truth_mask, random_pred_mask)

for i in range(DEMO_SHAPE[0]): # loop row by row
    for j in range(DEMO_SHAPE[1]): # walk the row one cell at a time
        gt_pixel_val = ground_truth_mask[i, j]
        pred_pixel_val = random_pred_mask[i, j]
        if gt_pixel_val==pred_pixel_val:
            if gt_pixel_val==1.0:
                true_positive_indices.append((i,j))
            else:
                true_negative_indices.append((i,j))
        else:
            if gt_pixel_val==1.0:
                false_positive_indices.append((i,j))
            else:
                false_negative_indices.append((i,j))

viz_mask = np.zeros((*DEMO_SHAPE, 3), dtype=np.float32)
viz_mask[np.array(true_positive_indices)[:, 0], np.array(true_positive_indices)[:, 1], 1] = 1.0
viz_mask[np.array(true_negative_indices)[:, 0], np.array(true_negative_indices)[:, 1], 1] = 0.333
viz_mask[np.array(false_positive_indices)[:, 0], np.array(false_positive_indices)[:, 1], 0] = 1.0
viz_mask[np.array(false_negative_indices)[:, 0], np.array(false_negative_indices)[:, 1], 0] = 0.333

precision = len(true_positive_indices)/(len(true_positive_indices)+len(false_positive_indices))
recall    = len(true_positive_indices)/(len(true_positive_indices)+len(false_negative_indices))

f1_score_ours = (2*precision*recall)/(precision+recall)
sk_f1_score   = f1_score(ground_truth_mask.flatten(), random_pred_mask.flatten())

compare_masks(ground_truth_mask, viz_mask, pred_title=f"F1 SCORE VISUALIZATION PLOT\n        OUR F1 SCORE: {f1_score_ours}\nSKLEARN F1 SCORE: {sk_f1_score}")

print("\n\n\n... WE CAN SEE THAT OUR F1 SCORE AND THE SKLEARN F1 SCORE ARE THE SAME ...\n\n")
print("\n... LET'S NOW USE THAT TOOL TO COMPUTE THE F1-SCORES FOR OUR RANDOM PRED, A BLACK PRED, AND A WHITE PRED")
print(f"\t... RANDOM PREDICTION F1-SCORE: {f1_score(ground_truth_mask.flatten(), random_pred_mask.flatten()):.4f}")
print(f"\t... ALL ZEROS (BLACK) F1-SCORE: {f1_score(ground_truth_mask.flatten(), np.zeros(DEMO_SHAPE).flatten()):.4f}")
print(f"\t... ALL ONES  (WHITE) F1-SCORE: {f1_score(ground_truth_mask.flatten(), np.ones(DEMO_SHAPE).flatten()):.4f}")

print("\n\n\n... NOW LET'S CHECK THE SAME FOR THE OUR MOSTLY BLACK IMAGE ...\n\n")

true_positive_indices = []
true_negative_indices = []
false_positive_indices = []
false_negative_indices = []

compare_masks(ground_truth_mask_2, random_pred_mask)

for i in range(DEMO_SHAPE[0]): # loop row by row
    for j in range(DEMO_SHAPE[1]): # walk the row one cell at a time
        gt_pixel_val = ground_truth_mask_2[i, j]
        pred_pixel_val = random_pred_mask[i, j]
        if gt_pixel_val==pred_pixel_val:
            if gt_pixel_val==1.0:
                true_positive_indices.append((i,j))
            else:
                true_negative_indices.append((i,j))
        else:
            if gt_pixel_val==1.0:
                false_positive_indices.append((i,j))
            else:
                false_negative_indices.append((i,j))

viz_mask = np.zeros((*DEMO_SHAPE, 3), dtype=np.float32)
viz_mask[np.array(true_positive_indices)[:, 0], np.array(true_positive_indices)[:, 1], 1] = 1.0
viz_mask[np.array(true_negative_indices)[:, 0], np.array(true_negative_indices)[:, 1], 1] = 0.333
viz_mask[np.array(false_positive_indices)[:, 0], np.array(false_positive_indices)[:, 1], 0] = 1.0
viz_mask[np.array(false_negative_indices)[:, 0], np.array(false_negative_indices)[:, 1], 0] = 0.333

compare_masks(ground_truth_mask_2, viz_mask, pred_title=f"F1 SCORE VISUALIZATION PLOT\n")

print("\n... LET'S NOW USE THAT TOOL TO COMPUTE THE F1-SCORES FOR OUR RANDOM PRED, A BLACK PRED, AND A WHITE PRED")
print(f"\t... RANDOM PREDICTION F1-SCORE: {f1_score(ground_truth_mask_2.flatten(), random_pred_mask.flatten()):.4f}")
print(f"\t... ALL ZEROS (BLACK) F1-SCORE: {f1_score(ground_truth_mask_2.flatten(), np.zeros(DEMO_SHAPE).flatten()):.4f}")
print(f"\t... ALL ONES  (WHITE) F1-SCORE: {f1_score(ground_truth_mask_2.flatten(), np.ones(DEMO_SHAPE).flatten()):.4f}")


<br>

#### **WHILE THAT'S BETTER! OUR RANDOM SCORE SHOULD BE QUITE SIMILAR TO OUR TRICK SCORES.**

---

This seems good enough right!? For now we will leave AUC alone. If I have time to come back to it I will update this notebook accordingly.

Let's try the same experiment as previously but with a **multi-label** prediction

<br>

#### **HOW DOES THIS ALL WORK FOR MULTILABEL CLASSIFICATION?**

---

Honestly, this is pretty straightforward. We will simply treat the problem the same as we treated the binary classification problem, for each respective channel. Then we will take the average across the channels!


In [None]:
compare_masks(ground_truth_rgb_mask, random_pred_rgb_mask, 
              gt_title="Ground Truth RGB Mask", pred_title="Prediction RGB Mask")

print("\n\n... THIS RGB IMAGE COMPARISON BECOMES THREE INDIVIDUAL COMPARISONS ...\n\n")

for i in range(3):
    compare_masks(ground_truth_rgb_mask[..., i], random_pred_rgb_mask[..., i], 
                  gt_title=f"Ground Truth {'RGB'[i]} Mask", pred_title=f"Prediction {'RGB'[i]} Mask")

<br>

#### **LET'S CALCULATE THE INDIVIDUAL F1-SCORES FOR OUR MULTILABEL TASK**

---

But wait... does this make sense? 
* Imagine a channel where we only have one or two positive examples? 
* What about a channel where it occurs very frequently?

This is why we have to take into consideration whether or not directly averaging across the classes makes sense. We have two options for how to deal with this:

1. **MACRO AVERAGING**
* Calculate metrics for each label, and find their unweighted mean. 
* This does not take label imbalance into account.

2. **MICRO AVERAGING**
* Calculate metrics **globally** by counting the total true positives, false negatives and false positives.

---

The **DICE COEFFICIENT** is simply the 1-**MICRO F1SCORE**. The dice coefficient is 

In [None]:
from scipy.spatial.distance import dice

print(f"MACRO SCORE  : {f1_score(ground_truth_rgb_mask.reshape(-1, 3), random_pred_rgb_mask.reshape(-1, 3), average='macro')}")
print(f"MICRO SCORE  : {f1_score(ground_truth_rgb_mask.reshape(-1, 3), random_pred_rgb_mask.reshape(-1, 3), average='micro')}")

print(f"\nINVERSE DICE SCORE : {1-dice(ground_truth_rgb_mask.flatten(), random_pred_rgb_mask.flatten())}")
print(f"DICE SCORE         : {dice(ground_truth_rgb_mask.flatten(), random_pred_rgb_mask.flatten())}")

<br>

#### **WE FINALLY MADE IT TO THE DICE SCORE! ONE OF THE COMPETITION METRICS**

---

<br>
    
**Simply put, the Dice Coefficient is 2 * the Area of Overlap divided by the total number of pixels in both images.**

<br>

**Essentially... the F1-SCORE**

<br>

<center><img src="https://miro.medium.com/max/858/1*yUd5ckecHjWZf6hGrdlwzA.png"></center>

<br>

---




<br>

#### **LET'S TAKE AN ASIDE AND DISCUSS AN ALTERNATIVE... THE JACCARD INDEX (IOU)**

---

<br>

**Simply put, the Jaccard Index (IoU) is the area of overlap between the predicted segmentation and the ground truth divided by the area of union between the predicted segmentation and the ground truth.**<br><br>

<center><img src="https://miro.medium.com/max/600/0*kraYHnYpoJOhaMzq.png"></center>

<br>

So, the numerator stays the same – $2 \times PRECISION \times RECALL$ – as before, however, we change our denominator from being ALL PIXELS to AREA-OF-UNION. The difference here is that ALL PIXELS would count overlap TWICE wheras AREA-OF-UNION would only count the intersection pixels once.

<br>

In [None]:
from sklearn.metrics import jaccard_score

print(f"INVERSE DICE SCORE : {1-dice(ground_truth_rgb_mask.flatten(), random_pred_rgb_mask.flatten())}")
print(f"JACCARD SCORE      : {jaccard_score(ground_truth_rgb_mask.flatten(), random_pred_rgb_mask.flatten())}")

<br>

#### **FINALLY LET'S TAKE A LOOK AT HAUSDORFF DISTANCE**

---

<br>


In [None]:
from skimage.metrics import hausdorff_distance
from scipy.ndimage import distance_transform_edt
!pip install -q monai
import monai

ground_truth_3d_rgb_mask = np.stack([ground_truth_rgb_mask, ground_truth_rgb_mask, ground_truth_rgb_mask, ground_truth_rgb_mask], axis=0)
random_pred_3d_mask = np.stack([random_pred_rgb_mask, random_pred_rgb_mask, random_pred_rgb_mask, random_pred_rgb_mask], axis=0)
monai.metrics.compute_percent_hausdorff_distance(ground_truth_3d_rgb_mask.astype(np.uint8), random_pred_3d_mask.astype(np.uint8))

In [None]:
def get_surface_distance(seg_pred, seg_gt, distance_metric="euclidean"):
    """
    This function is used to compute the surface distances from `seg_pred` to `seg_gt`.

    Args:
        seg_pred: the edge of the predictions.
        seg_gt: the edge of the ground truth.
        distance_metric: : [``"euclidean"``, ``"chessboard"``, ``"taxicab"``]
            the metric used to compute surface distance. Defaults to ``"euclidean"``.

            - ``"euclidean"``, uses Exact Euclidean distance transform.
            - ``"chessboard"``, uses `chessboard` metric in chamfer type of transform.
            - ``"taxicab"``, uses `taxicab` metric in chamfer type of transform.

    Note:
        If seg_pred or seg_gt is all 0, may result in nan/inf distance.

    """

    # Check if mask is empty... if so set dis to array of infinite
    if not np.any(seg_gt):
        dis = np.inf * np.ones_like(seg_gt)
        
    
    else:
        if not np.any(seg_pred):
            dis = np.inf * np.ones_like(seg_gt)
            return np.asarray(dis[seg_gt])
        if distance_metric == "euclidean":
            dis = distance_transform_edt(~seg_gt)
        elif distance_metric in {"chessboard", "taxicab"}:
            dis = distance_transform_cdt(~seg_gt, metric=distance_metric)
        else:
            raise ValueError(f"distance_metric {distance_metric} is not implemented.")

    return np.asarray(dis[seg_pred])