# Tutorial on YOLO (V3) Training  

Now let's switch the gear to YOLO_V3 Training. As a reminder, the network outputs 3 detection maps with stride 32, 16, 8 respectively （stride denotes the factor by which the image is downsampled）. Each cell of the feature map is expected to predict an object through one of it's bounding boxes containing 5+C attributes. In terms of YOLO training, we only expect the cell to predict object of which the center of ground truth object falls in its corresponding receptive field. Since each cell could predict three bounding boxes. The bounding box with higher IoU with the ground truth box is responsible for detecting the object. 

Now we need to understand on how to calculate the loss of boxing coordinates, objectness score, class scores. Here we explain the YOLO loss function in each of its parts. 

### Bounding cooridnates 

Loss from predicting the bounding box coordinates. Note that tx, ty, tw, th are the network outputs. The function computes a sum over each bounding box prediction of each grid cell. 1_obj is defined as 1 if an object (based on ground truth label) is expected in grid cell i and jth bounding box predictor and 0 otherwise. 
   
### Object Confidence

Loss associated with the confidence score for each bounding box predictor. x_obj is the network output for the object score. y_obj_i is equal to 1 if an object (based on ground truth label) is expected in grid cell i and jth bounding box predictor and 0 otherwise. The loss function represents a binary logistic regression cost 

### Classfication Loss

This is essentailly also a binary logistic regression cost to represent the probabilities of the detected object belonging to a particular class (instead of softmaxing class which assume the classes are mutually exclusive, not always the case like "Women" and "Person"). y_class_i is equal to 1 for the responable bounding box predictor and the ground truth class index otherwise 0. 1_obj is involved here so classification error is not penealized when the ground truth object is not present on the bounding boxes. This is necessary to increase the model stability

<img src="imgs/ipy_pic/eq.png" width="600" style="fix: right;">

The total loss is a sum of all loss terms with a hyper-parameter of weight factor that user can play with.

### Training Data

We use Pascal VOC dataset to demonstrate on the Loss calculation. Please refer the [YOLO Website](https://pjreddie.com/darknet/yolo/) on how to download the data and generate the label files for training. 

For each image, there is a paired label file. The label file shows each row for a object. The format is  [class_index], [x_center], [y_center], [width], [height]. Note that they are all normalized.

### Evaluation 

One of the most popular approach to measure the accuracy of object detectors during training is [Precision](https://sanchom.wordpress.com/tag/average-precision) and "[Recall](https://sanchom.wordpress.com/tag/average-precision)". We define the "True Positive" as if the bounding box has IoU > 0.5 to ground truth box, the object score > 0.5 for each class. We define the "True negative" as all other bounding boxes which have object score > 0.5. Please be awared that this evaluation is not the same as the metrics (mAP) used by object detection competitions such as VOC and COCO. But it is necessary to apply the offical mAP to evaluate your model accuracy for either competitions or benchmarking among different approaches

### Section 1: Data Loading and Processing

Pytorch provides the 'dataloader' funcation to help users prepare the data easily. torch.utils.data.Dataset can be applied to create the dataset class. torch.utils.data.DataLoader helps batching, shuffling the data for training. Please refer to the [pytorch tutorial](https://pytorch.org/tutorials/beginner/data_loading_tutorial.html) for detailed information if you are not familiar with these functions 

In [1]:
# Run some setup code for this notebook.
import os
import torch
from Darknet_VOC import Darknet
from utils.Data_Loading import ListDataset
from torch.utils.data import Dataset
import time
import copy
import numpy as np
import math
import torch.nn as nn
from utils.util import *
import warnings
warnings.filterwarnings("ignore")

Let's first set up the training and validation path. The "*.txt" lists the image files as an example shown below. Here we concatenate all the 2007 and 2012 train and val files together for training (~15k images) and use 2007_test.txt(~5k images) for validation. 

/home/xingyu/Desktop/Project/Pascal_VOC/VOCdevkit/VOC2007/JPEGImages/000012.jpg 

/home/xingyu/Desktop/Project/Pascal_VOC/VOCdevkit/VOC2007/JPEGImages/000017.jpg

In [2]:
train_path = '/home/xingyu/Desktop/Project/Pascal_VOC/train.txt'
val_path = '/home/xingyu/Desktop/Project/Pascal_VOC/2007_test.txt'

The dataset class is created as below. For each image path, the corresponding label file can be simplily obtained by replacing "JPEGImages", ".jpg" to "labels" and "txt" 

OpenCV is used to load the image. The image is padded, resized and prepared to pytorch inputs as we already explained from "Yolo_V3_Inference_Step_by_Step" tutorial.

The label has to be recalculated since we pad the image. SInce different images have different objects, in order to concatenate the label files as a batch, a zero-like label matrix with shape (50 x 5) is created. The label data is then filled in. The sample of our dataset will be a dict {'input_img': input_img, 'orig_img': pad_img, 'label': filled_labels, 'path': img_path}. 

In [3]:
class ListDataset(Dataset):
    def __init__(self, list_path, img_size=416):
        with open(list_path, 'r') as file:
            self.img_files = file.readlines()
        self.label_files = [path.replace('JPEGImages', 'labels').replace('.jpg', '.txt') for path in self.img_files]
        self.img_shape = (img_size, img_size)
        self.max_objects = 50

    def __len__(self):
        return len(self.img_files)

    def __getitem__(self, index):

        #---------
        #  Image
        #---------

        img_path = self.img_files[index % len(self.img_files)].rstrip()
        img = cv2.imread(img_path)
        h, w, _ = img.shape
        dim_diff = np.abs(h - w)
        # Upper (left) and lower (right) padding
        pad1, pad2 = dim_diff // 2, dim_diff - dim_diff // 2
        # Determine padding
        pad = ((pad1, pad2), (0, 0), (0, 0)) if h <= w else ((0, 0), (pad1, pad2), (0, 0))
        # Add padding
        pad_img = np.pad(img, pad, 'constant', constant_values=128)

        padded_h, padded_w, _ = pad_img.shape

        # Resize
        pad_img = cv2.resize(pad_img, self.img_shape)
        # Channels-first
        input_img = pad_img[:, :, ::-1].transpose((2, 0, 1)).copy()
        # As pytorch tensor
        input_img = torch.from_numpy(input_img).float().div(255.0)

        #---------
        #  Label
        #---------

        label_path = self.label_files[index % len(self.img_files)].rstrip()

        labels = None
        if os.path.exists(label_path):
            labels = np.loadtxt(label_path).reshape(-1, 5)
            # Extract coordinates for unpadded + unscaled image
            x1 = w * (labels[:, 1] - labels[:, 3]/2)
            y1 = h * (labels[:, 2] - labels[:, 4]/2)
            x2 = w * (labels[:, 1] + labels[:, 3]/2)
            y2 = h * (labels[:, 2] + labels[:, 4]/2)
            # Adjust for added padding
            x1 += pad[1][0]
            y1 += pad[0][0]
            x2 += pad[1][0]
            y2 += pad[0][0]
            # Calculate ratios from coordinates
            labels[:, 1] = ((x1 + x2) / 2) / padded_w
            labels[:, 2] = ((y1 + y2) / 2) / padded_h
            labels[:, 3] *= w / padded_w
            labels[:, 4] *= h / padded_h
        # Fill matrix
        filled_labels = np.zeros((self.max_objects, 5))
        if labels is not None:
            filled_labels[range(len(labels))[:self.max_objects]] = labels[:self.max_objects]

        filled_labels = torch.from_numpy(filled_labels)

        sample = {'input_img': input_img, 'orig_img': pad_img, 'label': filled_labels, 'path': img_path}

        return sample

Now we can use the dataloader to load the data for both training and validaton.  

In [4]:
batch_size = 8
inp_dim = 416
dataloaders = {'train': torch.utils.data.DataLoader(ListDataset(train_path, img_size=inp_dim), batch_size=batch_size, shuffle=True),
               'val': torch.utils.data.DataLoader(ListDataset(val_path, img_size=inp_dim), batch_size=batch_size, shuffle=True)}

for i_batch, sample_batched in enumerate(dataloaders["train"]):

    input_images_batch, Orig_images_batch, label_batch, path_batch = sample_batched['input_img'], sample_batched['orig_img'], sample_batched['label'], sample_batched['path']

    print(i_batch, input_images_batch.shape, Orig_images_batch.shape, label_batch.shape)

    if i_batch == 3:
        break

0 torch.Size([8, 3, 416, 416]) torch.Size([8, 416, 416, 3]) torch.Size([8, 50, 5])
1 torch.Size([8, 3, 416, 416]) torch.Size([8, 416, 416, 3]) torch.Size([8, 50, 5])
2 torch.Size([8, 3, 416, 416]) torch.Size([8, 416, 416, 3]) torch.Size([8, 50, 5])
3 torch.Size([8, 3, 416, 416]) torch.Size([8, 416, 416, 3]) torch.Size([8, 50, 5])


### Section 2: Loss Calculation 

In order to obtain the loss, we need to deduce which cell and anchor on the network output feature map is expected to predict the object. The ground truth values for the responsible bounding box will be assigned for loss calculation.

The below funcation is intended to convert the ground truth label generated from dataloader to a series of matrixes which contain the ground truth info for each cell on the corresponding detection map.

1. mask (representin 1_obj) is an indexing matrix with size [batch x num_anchors x grid x grid]. The responable bounding box predictor is assigned as 1, otherwise 0. 

2. tx, ty, tw and th has the same format as mask but containing the ground truth cooridnates instead of 1. Note that label file data here are converted to offset to centriod and log-space transfrom for width and height. 

3. tconf is a matrix with size [batch x num_anchors x grid x grid]. The value is 1 representing the highest probability for the responsible bounding box predictor, otherwise 0.

4. tcls is a matrix with size [batch x num_anchors x grid x grid x num_class]. The value is 1 representing the highest probability for the responsible bounding box predictor and ground truth class index, otherwise 0.

In [5]:
def build_targets(target, anchors, grid_size, num_anchors = 3, num_classes = 20):

    nB = target.size(0)
    nA = num_anchors
    nC = num_classes
    nG = grid_size
    mask = torch.zeros(nB, nA, nG, nG)
    tx = torch.zeros(nB, nA, nG, nG)
    ty = torch.zeros(nB, nA, nG, nG)
    tw = torch.zeros(nB, nA, nG, nG)
    th = torch.zeros(nB, nA, nG, nG)
    tconf = torch.zeros(nB, nA, nG, nG)
    tcls = torch.zeros(nB, nA, nG, nG, nC)

    for b in range(nB):  # for each image
        for t in range(target.shape[1]):  # for each object
            if target[b, t].sum() == 0:  # if the row is empty
                continue
            # Convert to object label data to feature map
            gx = target[b, t, 1] * nG
            gy = target[b, t, 2] * nG
            gw = target[b, t, 3] * nG
            gh = target[b, t, 4] * nG
            # Get grid box indices
            gi = int(gx)
            gj = int(gy)
            # Get shape of gt box
            gt_box = torch.FloatTensor(np.array([0, 0, gw, gh])).unsqueeze(0)  # 1 x 4
            # Get shape of anchor box
            anchor_shapes = torch.FloatTensor(
                np.concatenate((np.zeros((len(anchors), 2)), np.array(anchors)), 1))
            # Calculate iou between gt and anchor shapes
            anch_ious = bbox_iou(gt_box, anchor_shapes)
            # Find the best matching anchor box
            best_n = np.argmax(anch_ious)
            # Masks
            mask[b, best_n, gj, gi] = 1
            # Coordinates
            tx[b, best_n, gj, gi] = gx - gi
            ty[b, best_n, gj, gi] = gy - gj
            # Width and height
            tw[b, best_n, gj, gi] = math.log(gw / anchors[best_n][0] + 1e-16)
            th[b, best_n, gj, gi] = math.log(gh / anchors[best_n][1] + 1e-16)
            # One-hot encoding of label
            target_label = int(target[b, t, 0])
            tcls[b, best_n, gj, gi, target_label] = 1
            tconf[b, best_n, gj, gi] = 1

    return mask, tx, ty, tw, th, tconf, tcls

The loss funcation is applied to calculate loss between one of the dectection map and ground truth label converted using "build target" funcation. The network output including center offset (x, y), dimension log-space transform (w, h), object score - pred_conf, class score - pred_cls are reshaped as the same maxtixes as ground truth feature map. Each loss item and overall loss are then calculated based on the equation shown before.  

In [6]:
def Loss(input, target, anchors, inp_dim, num_anchors = 3, num_classes = 20):

    nA = num_anchors  # number of anchors
    nB = input.size(0)  # number of batches
    nG = input.size(2)  # number of grid size
    nC = num_classes
    stride = inp_dim / nG

    # Tensors for cuda support
    FloatTensor = torch.cuda.FloatTensor if input.is_cuda else torch.FloatTensor
    ByteTensor = torch.cuda.ByteTensor if input.is_cuda else torch.ByteTensor

    prediction = input.view(nB, nA, 5 + nC, nG, nG).permute(0, 1, 3, 4, 2).contiguous()  # reshape the output data

    # Get outputs
    x = torch.sigmoid(prediction[..., 0])  # Center x
    y = torch.sigmoid(prediction[..., 1])  # Center y
    w = prediction[..., 2]  # Width
    h = prediction[..., 3]  # Height
    pred_conf = torch.sigmoid(prediction[..., 4])  # Conf
    pred_cls = torch.sigmoid(prediction[..., 5:])  # Cls pred

    # Calculate offsets for each grid
    grid_x = torch.arange(nG).repeat(nG, 1).view([1, 1, nG, nG]).type(FloatTensor)
    grid_y = torch.arange(nG).repeat(nG, 1).t().view([1, 1, nG, nG]).type(FloatTensor)
    scaled_anchors = FloatTensor([(a_w / stride, a_h / stride) for a_w, a_h in anchors])
    anchor_w = scaled_anchors[:, 0:1].view((1, nA, 1, 1))
    anchor_h = scaled_anchors[:, 1:2].view((1, nA, 1, 1))

    # Add offset and scale with anchors
    pred_boxes = FloatTensor(prediction[..., :4].shape)
    pred_boxes[..., 0] = x.data + grid_x
    pred_boxes[..., 1] = y.data + grid_y
    pred_boxes[..., 2] = torch.exp(w.data) * anchor_w
    pred_boxes[..., 3] = torch.exp(h.data) * anchor_h

    mask, tx, ty, tw, th, tconf, tcls = build_targets(
        target=target.cpu().data,
        anchors=scaled_anchors.cpu().data,
        grid_size=nG,
        num_anchors=nA,
        num_classes=num_classes)

    # Handle target variables
    tx, ty = tx.type(FloatTensor), ty.type(FloatTensor)
    tw, th = tw.type(FloatTensor), th.type(FloatTensor)
    tconf, tcls = tconf.type(FloatTensor), tcls.type(FloatTensor)
    mask = mask.type(ByteTensor)

    mse_loss = nn.MSELoss(reduction='sum')  # Coordinate loss
    bce_loss = nn.BCELoss(reduction='sum')  # Confidence loss
    loss_x = mse_loss(x[mask], tx[mask])
    loss_y = mse_loss(y[mask], ty[mask])
    loss_w = mse_loss(w[mask], tw[mask])
    loss_h = mse_loss(h[mask], th[mask])
    loss_conf = bce_loss(pred_conf, tconf)
    loss_cls = bce_loss(pred_cls[mask], tcls[mask])
    loss = loss_x + loss_y + loss_w + loss_h + loss_conf + loss_cls

    return (loss, loss_x, loss_y, loss_w, loss_h, loss_conf, loss_cls)

Below is a loss calculation test using the pretrained model

In [7]:
model = Darknet()
model.load_state_dict(torch.load('weights/Dartnet_VOC_Weights'))
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
CUDA = torch.cuda.is_available()
model = model.to(device)

inputs = input_images_batch.to(device)
labels = label_batch.to(device)
Final_pre, *output = model(inputs, CUDA)

anchors = ([(10, 13), (16, 30), (33, 23)], [(30, 61), (62, 45), (59, 119)], [(116, 90), (156, 198), (373, 326)])
loss_item = {"total_loss": 0, "x": 0, "y": 0, "w": 0, "h": 0, "conf": 0, "cls": 0}

for i in range(len(output)):
    losses = Loss(output[i], labels.float(), anchors[i], inp_dim=inp_dim, num_anchors = 3, num_classes = 20)
    for i, name in enumerate(loss_item):
        loss_item[name] += losses[i]
loss_item

{'total_loss': tensor(81.8016, device='cuda:0', grad_fn=<AddBackward0>),
 'x': tensor(2.2884, device='cuda:0', grad_fn=<AddBackward0>),
 'y': tensor(2.1074, device='cuda:0', grad_fn=<AddBackward0>),
 'w': tensor(1.6042, device='cuda:0', grad_fn=<AddBackward0>),
 'h': tensor(1.2552, device='cuda:0', grad_fn=<AddBackward0>),
 'conf': tensor(69.0602, device='cuda:0', grad_fn=<AddBackward0>),
 'cls': tensor(5.4863, device='cuda:0', grad_fn=<AddBackward0>)}

### Section 3: Evaluation 

The evaulation is crucial during training to pick up the best performance weights. The "eval" function is built to evaluate the model performance. The funcation accepts the model prediction after object confidence and NMS filters (output from function "write_results") and ground truth label data for dataloader, return the batch precision, recall and F1 score.

The number of batch ground truth can be directly derived from ground truth label data

The total number of proposals (true positive + negative positive) is calculated for all bounding boxes with object score > 0.5

The number of true positive is calculated if IoU against ground truth bounding box is higher than 0.5 for each class. 

In [8]:
def eval(output, labels, img_width, img_height):

    nProposals = int((output[:, 5] > 0.5).sum().item())
    nGT = 0
    nCorrect = 0
    for b in range(labels.shape[0]):  # for each image
        prediction = output[output[:,0] == b]  # filter out the predictions of corresponding image
        for t in range(labels.shape[1]):  # for each object
            if labels[b, t].sum() == 0:  # if the row is empty
                continue
            nGT += 1
            gt_label = convert_label(labels[b, t].unsqueeze(0), img_width, img_height)
            gt_box = gt_label[:, 1:5]
            for i in range(prediction.shape[0]):
                pred_box = prediction[i, 1:5].unsqueeze(0)
                iou = bbox_iou(pred_box, gt_box)
                pred_label = prediction[i, -1]
                target_label = gt_label[0, 0]
                if iou > 0.5 and pred_label == target_label:
                    nCorrect += 1
    recall = float(nCorrect / nGT) if nGT else 1
    precision = float(nCorrect / nProposals) if nProposals else 0
    F1_score = 2 * recall * precision / (recall + precision + 1e-16)

    return F1_score, precision, recall

In [9]:
def convert_label(image_anno, img_width, img_height):

    """
    Function: convert image annotation : center x, center y, w, h (normalized) to x1, y1, x2, y2 for corresponding img
    """
    x_center = image_anno[:, 1]
    y_center = image_anno[:, 2]
    width = image_anno[:, 3]
    height = image_anno[:, 4]

    output = torch.zeros_like(image_anno)
    output[:,0] = image_anno[:,0]
    output[:, 1], output[:, 3] = x_center - width / 2, x_center + width / 2
    output[:, 2], output[:, 4] = y_center - height / 2, y_center + height / 2

    output[:, [1, 3]] *= img_width
    output[:, [2, 4]] *= img_height

    return output.type(torch.FloatTensor)

Let's use the pretrained model to test the results 

In [13]:
Final_pre = write_results(Final_pre, confidence=0.5, num_classes=20, nms_conf=0.4)

if isinstance(Final_pre, int) == False:
    F1_score, precision, recall = eval(Final_pre.cpu(), labels, img_width=inp_dim, img_height=inp_dim)
else:
    F1_score, precision, recall = 0, 0, 0

print('Recall: {:.4f} Precision: {:.4f} F1 Score: {:.4f}'.format(recall, precision, F1_score))                                                                                    

Recall: 0.8750 Precision: 1.0000 F1 Score: 0.9333


### Section 4:  Training 

So far, we already prepared all the necessay toolkits for training. We will train the network using transfer learning. The pretrained Yolo_V3 weights can be downloaded from [YOLO Website](https://pjreddie.com/darknet/yolo/). we can use the this weights either as an initialization or a fixed feature extractor. As learned from Standard cs231n course, since we have very similar dataset but not quite a lot of data, it is approporiate to freeze the weights for all of the network expect the last three layers for each detection. However, if you have other types of datasets, you can switch your strategy based on the task of interests. 

In this project we already generate the initialization weights, you can just download it from this [link](https://pan.baidu.com/s/1-O-jD0uU3OM6yNaUSLjAhw#list/path=%2F).

We use the Adam optimization algorithm as the default choice, SGD+Momentum with learning rate decay might be worth to try if you are looking for approaches to drive more accuracy, but it requres more tuning. 

F1 score is eventually applied to pick up the best weights. The results for epoch 0 are presented below. The overall training takes 5 hours using Nvidia Quadro P4000 for 24 epochs 

In [14]:
model = Darknet()
model.load_state_dict(torch.load('weights/Dartnet_VOC_weights_ini'))

#  This section is to freeze all the network except the output three layers
for name, param in model.named_parameters():
    param.requires_grad = False
    if int(name.split('.')[1]) in (79, 80, 81, 91, 92, 93, 103, 104, 105):
        param.requires_grad = True

model = model.to(device)
optimizer = torch.optim.Adam(filter(lambda p: p.requires_grad, model.parameters()))

since = time.time()

best_model_wts = copy.deepcopy(model.state_dict())
best_F1_score = 0.0

num_epochs = 1
for epoch in range(num_epochs):
    print('Epoch {}/{}'.format(epoch, num_epochs-1))
    print('-' * 10)

    # Each epoch has a training and validation phase
    for phase in ['train', 'val']:
        if phase == 'train':
            model.train() # set model to training mode
        else:
            model.eval() # set model to evaluate mode

        running_loss, running_xy_loss, running_wh_loss, running_conf_loss, running_cls_loss = 0.0, 0.0, 0.0, 0.0, 0.0
        running_recall, running_precision, running_F1_score = 0.0, 0.0, 0.0

        # iterate over data
        for i_batch, sample_batched in enumerate(dataloaders[phase]):
            inputs, labels = sample_batched['input_img'], sample_batched['label']
            inputs = inputs.to(device)
            labels = labels.to(device)

            # zero the parameter gradients
            optimizer.zero_grad()

            # forward
            # track history if only in train
            with torch.set_grad_enabled(phase == 'train'):

                Final_pre, *output = model(inputs, CUDA)
                Final_pre = write_results(Final_pre, confidence=0.5, num_classes=20, nms_conf=0.4)

                anchors = (
                [(10, 13), (16, 30), (33, 23)], [(30, 61), (62, 45), (59, 119)], [(116, 90), (156, 198), (373, 326)])

                loss_item = {"total_loss": 0, "x": 0, "y": 0, "w": 0, "h": 0, "conf": 0, "cls": 0}

                for i in range(len(output)):
                    losses = Loss(output[i], labels.float(), anchors[i], inp_dim=inp_dim, num_anchors = 3, num_classes = 20)
                    for i, name in enumerate(loss_item):
                        loss_item[name] += losses[i]

                if isinstance(Final_pre, int) == False:
                    F1_score, precision, recall = eval(Final_pre.cpu(), labels, img_width=inp_dim, img_height=inp_dim)
                else:
                    F1_score, precision, recall = 0, 0, 0

                loss = loss_item['total_loss']
                xy_loss = loss_item['x']+loss_item['y']
                wh_loss = loss_item['w']+loss_item['h']
                conf_loss = loss_item['conf']
                cls_loss = loss_item['cls']

                # backward + optimize only if in training phase
                if phase == 'train':
                    loss.backward()
                    optimizer.step()

                # statistics
                running_loss += loss.item()
                running_xy_loss += xy_loss.item()
                running_wh_loss += wh_loss.item()
                running_conf_loss += conf_loss.item()
                running_cls_loss += cls_loss.item()
                running_recall += recall
                running_precision += precision
                running_F1_score += F1_score

        epoch_loss = running_loss / ((i_batch+1)*batch_size)
        epoch_xy_loss = running_xy_loss / ((i_batch+1)*batch_size)
        epoch_wh_loss = running_wh_loss / ((i_batch + 1) * batch_size)
        epoch_conf_loss = running_conf_loss / ((i_batch + 1) * batch_size)
        epoch_cls_loss = running_cls_loss / ((i_batch + 1) * batch_size)

        epoch_recall = running_recall / (i_batch+1)
        epoch_precision = running_precision / (i_batch+1)
        epoch_F1_score = running_F1_score / (i_batch+1)

        print(
            '{} Loss: {:.4f} Recall: {:.4f} Precision: {:.4f} F1 Score: {:.4f}'.format(phase, epoch_loss, epoch_recall,
                                                                                       epoch_precision, epoch_F1_score))
        print(
            '{} xy: {:.4f} wh: {:.4f} conf: {:.4f} class: {:.4f}'.format(phase, epoch_xy_loss, epoch_wh_loss,
                                                                                                        epoch_conf_loss,
                                                                                                        epoch_cls_loss))

        # deep copy the model
        if phase == 'val' and epoch_F1_score > best_F1_score:
            best_F1_score = epoch_F1_score
            best_model_wts = copy.deepcopy(model.state_dict())

time_elapsed = time.time() - since
print('Training complete in {:.0f}m {:.0f}s'.format(time_elapsed // 60, time_elapsed % 60))
print('Best F1 score: {:4f}'.format(best_F1_score))

model.load_state_dict(best_model_wts)
torch.save(model.state_dict(), 'Dartnet_VOC_Weights')

Epoch 0/0
----------
train Loss: 62.8359 Recall: 0.4240 Precision: 0.7345 F1 Score: 0.5238
train xy: 0.8823 wh: 1.5822 conf: 54.4661 class: 5.9054
val Loss: 26.4952 Recall: 0.5652 Precision: 0.7699 F1 Score: 0.6418
val xy: 0.7917 wh: 0.9582 conf: 21.5911 class: 3.1542
Training complete in 14m 27s
Best F1 score: 0.641817
