# Tutorial 4 - Yolo loss function
#### This is the forth tutorial of a series of step-by-step walkthroughs of Yolo algorithm

In this tutorial, we are going to focus on **YOLO2**, which is simpler to understand and easier to explain comparing to its sucessors. If you are looking for YOLO3/4, I still encourage you to understand YOLO2 first as later modifications are **very similar** to YOLO2 in terms of detection logics.

In [43]:
# some boilerplates code 
import tensorflow as tf 
import os, glob 
import numpy as np 
from YoloBackbone.yolo2 import getOffset, getIOU

In the last tutorial we created training labels for Yolo by converting a list of box coordinates into a tensor with shape (conv H, conv W, anchor num, 6). In the 1st tutorial we learned about the output shape of Yolo, which is (conv H, conv W, anchor num * 5+class num). For Pascal VOC dataset, the author used 5 anchor boxes and reshape all images to 416 x 416, meaning the output of Yolo is **(13, 13, 125)** and label is of shape **(13, 13, 5, 6)** 

In this tutorial, we are going to implement the yolo loss function described in figure 1
![loss](Misc/YoloLoss.jpg)
<center>figure 1: yolo loss</center>

The above loss function can be decomposed into three components: 
1. the **localization loss**: measure the quality of predicted box location and size
2. the **classification loss**: measure the quality of object classification 
3. the **confidence loss**: measure the sensitivity of detection (are all objects are detected? How many false positives? )

We need three major inputs of our loss function for reasons that I'll later explain: 
1. **features**: tensor, of shape (None, convH, convW, num anchor * (num class + 5) ) = y predicted 
2. **labels**: tensor, of shape (None, convH, convW, num anchor * (num class + 5) ) = y true  
3. **anchors**: tensor, of shape (num anchor, 2), predetermined anchor shapes
4. **hyperparam**: dict, tunnable loss hyperparameters

### Step 1. Localization Loss
To measure the quality of the predicted boxes, it is important to measure the following two things:
1. The distance between the **predicted box center** and the **ground truth center**
2. The difference in box **sizes** and **shapes**

Here's a heuristic:
<pre>
    loss = 0
    for row in range [0,conv H]:
        for col in range [0, conv W]:
            for anchor k in range [0, 5]: # assume 5 anchors 
                if label[row, col,k] has a ground truth box:
                    l1 = calculate the difference in center coordinates 
                    l2 = calculate the difference in box width and height
                    loss += (l1 + l2)
</pre>

Of course, the iterative-based algorithm is going to be catastrophically slow. Some efforts are needed to vectorize the above heuristic.

In [50]:
from Utilities.io import LoadPascal
import tensorflow.keras as tfk 
DATA_DIR = 'SampleData/Images'
ANNOT_DIR = 'SampleData/Annotations'
ANCHOR_PATH = 'SampleData/vocAnchors.txt'
CLASSNAME_PATH = 'SampleData/vocClasses.txt'
IMG_SHAPE = (416, 416, 3)
loader = LoadPascal(imgDir=DATA_DIR, annotDir=ANNOT_DIR, 
                    anchorPath=ANCHOR_PATH, classNamePath=CLASSNAME_PATH)
data = loader.loadData(imgShape=IMG_SHAPE, batchSize=2, imgOnly=False, shuffle=True)
MODEL_PATH = 'I:/model/YOLO/tutorial/yolo_voc.h5'
model = tfk.models.load_model(MODEL_PATH)



In [51]:
# get a sample input 
for (imgs, labels, imgNames) in data:
    break
print('sample batch shape:{}'.format(imgs.shape))
print('sample label shape:{}'.format(labels.shape))
features = model.predict(imgs)
print('sample output shape:{}'.format(features.shape))
anchors = loader.anchors
print('anchor shape:{}'.format(anchors.shape))
hyperparam={'local': 5.0, 'obj': 5.0, 'nonObj': 0.5, 'iouThresh': 0.5}
localCoef, nonObjCoef, objCoef = hyperparam['local'], hyperparam['nonObj'], hyperparam['obj']

sample batch shape:(2, 416, 416, 3)
sample label shape:(2, 13, 13, 5, 6)
sample output shape:(2, 13, 13, 125)
anchor shape:(5, 2)


In [52]:
# local loss 
numAnchor = anchors.shape[0]
numClass = fShape[-1] // numAnchor - 5
fShape = tf.shape(features) # get feature shape
features = tf.reshape(features, shape=[fShape[0], fShape[1], fShape[2], numAnchor, numClass + 5])
"""
converting tx, ty to bx, by(see figure 2)
converting tw, th to bw, bh
getOffset is implemented in tutorial 3
"""
offset = tf.cast(getOffset([fShape[1], fShape[2]]), features.dtype)
boxXY = tf.nn.sigmoid(features[..., :2]) + offset
boxWH = tf.math.exp(features[..., 2:4]) * anchors
"""
get ground truth coordinates and sizes. Remember gtXY and gtWH are created to match the scale of boxXY and boxWH 
in tutoral 3
"""
gtXY, gtWH = labels[..., 0:2] + offset, (labels[..., 2:4] * anchors)
"""
calculate the euclidean distance. local loss should be 13 by 13 in shape 
"""
localLoss = tf.square(gtXY - boxXY) + tf.square(tf.sqrt(gtWH) - tf.sqrt(boxWH))
print('1. boxXY shape: {}, gtXH shape:{}'.format(boxXY.shape, gtXY.shape))
print('2. boxWH shape: {}, gtWH shape:{}'.format(boxWH.shape, gtWH.shape))
print('3. localLoss shape: {}'.format(localLoss.shape))
"""
filter out grids that do not contain ground truth boxes by leveraging the the flag in label[i,j,k,0]
"""
mask = labels[..., 0:1]  # signal if an object appears in a given location
print('4. mask shape: {}'.format(mask.shape))
localLoss = tf.math.reduce_sum(localLoss, axis=[1, 2, 3, 4])
print('5. reduced local loss shape:{}'.format(localLoss.shape))

1. boxXY shape: (2, 13, 13, 5, 2), gtXH shape:(2, 13, 13, 5, 2)
2. boxWH shape: (2, 13, 13, 5, 2), gtWH shape:(2, 13, 13, 5, 2)
3. localLoss shape: (2, 13, 13, 5, 2)
4. mask shape: (2, 13, 13, 5, 1)
5. reduced local loss shape:(2,)


![BoxPred](Misc/BoxPred.jpg)  
<center>figure 2</center>

### Step 2. classification loss
For every grid and every anchor channel that contains a ground truth box, we calculate the probabilistic differences
<pre>
for row in range [0, conv H]:
    for col in range [0, conv W]:
        for anchor in range [0, 5]:
            if anchor has a box:
                measure differences
</pre>

In [53]:
classProb = tf.nn.softmax(features[..., 5:])
"""
get the ground truth object class index and one hot encode it
"""
gtClasses = tf.cast(labels[..., 4], tf.int32)
gtClasses = tf.one_hot(gtClasses, depth=numClass)
print('1. classProb shape: {}'.format(classProb.shape))
print('2. grount truth prob shape: {}'.format(gtClasses.shape))
"""
use the same mask trick to filter out locations without boxes
"""

classLoss = tf.square(classProb - gtClasses) * mask
print('3. classification loss shape: {}'.format(classLoss.shape))
classLoss = tf.math.reduce_sum(classLoss, axis=[1, 2, 3, 4])
print('4. classification loss reduced: {}'.format(classLoss.shape))

1. classProb shape: (2, 13, 13, 5, 20)
2. grount truth prob shape: (2, 13, 13, 5, 20)
3. classification loss shape: (2, 13, 13, 5, 20)
4. classification loss reduced: (2,)


### Step 3. Confidence loss 
The confidence loss is slightly more complicated, as it not only measures the prediction is anchors channels that contain ground truth boxes. Moreover, we need to calculate the ground truth C by measuring the IOU of the boxes and anchors. 
<pre>
for row in range [0, conv H]:
    for col in range [0, conv W]:
        for anchor in range [0, 5]:
            if anchor has a box:
                calculate Ci
            if anchor has no box:
                Ci = 0
            measure the differences between Ci and Ci_pred
</pre>

In [63]:
"""
get predicted object score
"""
objScore = tf.nn.sigmoid(features[..., 4:5])
print('1. objScore shape:{}'.format(objScore.shape))
"""
for a given grid in the label, there are at most one anchor channel that has a bounding box.
However, that doesn't mean Ci=0 for other anchor channels in the same grid. After all, they overlap
with the ground truth as well. 
"""
iou = getIOU(tf.concat([boxXY, boxWH], axis=-1), tf.concat([gtXY, gtWH], axis=-1))
print('2. IOU shape:{}'.format(iou.shape))

nonObj = (1.0 - mask) * nonObjCoef
nonObjLoss = nonObj * tf.square(0 - objScore)
print('3. nonObjLoss shape:{}'.format(nonObjLoss.shape))
objLoss = mask * tf.square(objScore - tf.expand_dims(iou, axis=-1)) * objCoef
print('4. obj loss shape:{}'.format(objLoss.shape))
objLoss = tf.math.reduce_sum(nonObjLoss + objLoss, axis=[1, 2, 3, 4])
print('5. confidence loss reduced:{}'.format(objLoss.shape))

1. objScore shape:(2, 13, 13, 5, 1)
2. IOU shape:(2, 13, 13, 5)
3. nonObjLoss shape:(2, 13, 13, 5, 1)
4. obj loss shape:(2, 13, 13, 5, 1)
5. confidence loss reduced:(2,)


### Bring everything together

In [64]:
def yoloLoss(features, labels, anchors, hyperparam={'local': 5.0, 'obj': 5.0, 'nonObj': 0.5, 'iouThresh': 0.5}):
    # get hyperparameters for the loss function
    localCoef, nonObjCoef, objCoef = hyperparam['local'], hyperparam['nonObj'], hyperparam['obj']
    fShape = tf.shape(features)
    # numAnchor = labels.shape[-2] # label has shape (None, convH, convW, num anchor, 6)
    numAnchor = anchors.shape[0]
    numClass = fShape[-1] // numAnchor - 5
    features = tf.reshape(features, shape=[fShape[0], fShape[1], fShape[2], numAnchor, numClass + 5])

    # convert raw features into box coordinates w.r.t to each square in the feature map
    # please refer to https://arxiv.org/pdf/1612.08242.pdf page 3
    offset = tf.cast(getOffset([fShape[1], fShape[2]]), features.dtype)
    boxXY = tf.nn.sigmoid(features[..., :2]) + offset
    boxWH = tf.math.exp(features[..., 2:4]) * anchors
    objScore = tf.nn.sigmoid(features[..., 4:5])
    classProb = tf.nn.softmax(features[..., 5:])
    # yolo loss is defined as
    # W1 * localization loss + W2 * confidence loss + classification loss
    # https://arxiv.org/abs/1506.02640 page 4

    # calculate the classification loss
    mask = labels[..., 0:1]  # signal if an object appears in a given location
    labels = labels[..., 1:]
    gtClasses = tf.cast(labels[..., 4], tf.int32)
    gtClasses = tf.one_hot(gtClasses, depth=numClass)
    classLoss = tf.square(classProb - gtClasses) * mask
    classLoss = tf.math.reduce_sum(classLoss, axis=[1, 2, 3, 4])

    # calculate the localization loss
    # coordinates w.r.t to each square in the feature map
    gtXY, gtWH = labels[..., 0:2] + offset, (labels[..., 2:4] * anchors)
    localLoss = tf.square(gtXY - boxXY) + tf.square(tf.sqrt(gtWH) - tf.sqrt(boxWH))
    localLoss *= mask
    localLoss = localCoef * tf.math.reduce_sum(localLoss, axis=[1, 2, 3, 4])

    # calculate the detection loss
    iou = getIOU(tf.concat([boxXY, boxWH], axis=-1), tf.concat([gtXY, gtWH], axis=-1))
    bestIOU = tf.math.reduce_max(iou, axis=-1, keepdims=True)
    nonObj = bestIOU < hyperparam['iouThresh']
    nonObj = tf.cast(tf.expand_dims(nonObj, axis=-1), mask.dtype) * (1.0 - mask) * nonObjCoef
    nonObjLoss = nonObj * tf.square(0 - objScore)
    objLoss = mask * tf.square(objScore - tf.expand_dims(iou, axis=-1)) * objCoef

    objLoss = tf.math.reduce_sum(nonObjLoss + objLoss, axis=[1, 2, 3, 4])
    # overall loss
    return tf.math.reduce_mean(objLoss + localLoss + classLoss), [objLoss, localLoss, classLoss]

In [69]:
for (imgs, labels, names) in data:
    predicted = model.predict(imgs)
    loss, [objLm, localL, confidenceL] = yoloLoss(predicted, labels, anchors)
    print(loss)

tf.Tensor(9.819646, shape=(), dtype=float32)
tf.Tensor(1.7078857, shape=(), dtype=float32)
tf.Tensor(6.215113, shape=(), dtype=float32)
