In [1]:
import torch

# The Predictions Vector

Our system divides the input image into a S × S grid. If the center of an object falls into a grid cell, that grid cell is responsible for detecting that object. 

每张图像分成7x7的grid，每个grid都用于目标检测。

Each grid cell predicts B bounding boxes and confidence scores for those boxes. These confidence scores reflect how confident the model is that the box contains an object and also how accurate it thinks the box is that it predicts.

每个grid预测B个bounding box，每个bounding box类别与置信度
$$confidence = Pr(object)*IOU_{pred}^{truth}$$

https://hackernoon.com/understanding-yolo-f5a74bbc7967
Each grid cell predicts B bounding boxes as well as C class probabilities. The bounding box prediction has 5 components: (x, y, w, h, confidence)  
每个grid cell预测B个包围盒，其中有C类别概率，包围盒有五个部分 (x, y, w, h, confidence)

The (x, y) coordinates represent the center of the box, relative to the grid cell location (remember that, if the center of the box does not fall inside the grid cell, than this cell is not responsible for it). These coordinates are normalized to fall between 0 and 1. The (w, h) box dimensions are also normalized to [0, 1], relative to the image size. Let’s look at an example:  
(x, y)坐标表示了盒子的相对于grid cell的中央，并且，该坐标相当于网格大小归一化，(w, h)相对于图像大小归一化，两个归一化都归一化到[0,1]，如下图所示：  

<p align="center">
  <img width="460" height="300" src="./images/01.png">
</p>

Note that the confidence reflects the presence or absence of an object of any class.  

Now that we understand the 5 components of the box prediction, remember that each grid cell makes B of those predictions, so there are in total S x S x B * 5 outputs related to bounding box predictions.  
一共有$S\times{}S\times{}B\times{}5$个预测值与边界框的预测相关。

It is also necessary to predict the class probabilities, Pr(Class(i) | Object). This probability is conditioned on the grid cell containing one object (see this if you don’t know that conditional probability means). In practice, it means that if no object is present on the grid cell, the loss function will not penalize it for a wrong class prediction, as we will see later. The network only predicts one set of class probabilities per cell, regardless of the number of boxes B. That makes S x S x C class probabilities in total  
这里的Pr(Class(i) | Object)是条件概率。如果一个单元格没有对象存在，损失函数不会因为错误的类预测而惩罚他。不论B值为多少，每个网格只预测一组类概率，最终会生成S x S x C个类概率。

Adding the class predictions to the output vector, we get a S x S x (B * 5 +C) tensor as output.  
最终会有S x S x (B * 5 +C)的张量输出。  
<p align="center">
    <img src="./images/output.png" width="600">
</p>

# The Network

 The network structure looks like a normal CNN, with convolutional and max pooling layers, followed by 2 fully connected layers in the end:

<p align="center">
    <img src="./images/main_network.png">
<p/>

# The loss function
Part 1
$$\lambda_{coord}\sum_{i=0}^{S^2}\sum_{j=0}^{B}\mathbb{1}_{ij}^{obj}{[(x_i-\hat{x}_i)^2+(y_i-\hat{y}_i)^2]}$$

This equation computes the loss related to the predicted bounding box position (x,y). Don’t worry about λ for now, just consider it a given constant. The function computes a sum over each bounding box predictor (j = 0.. B) of each grid cell (i = 0 .. S^2). 𝟙 obj is defined as follows:  
- 1, If an object is present in grid cell i and the jth bounding box predictor is “responsible” for that prediction
- 0, otherwise  

这一部分计算边界位置框的损失

YOLO predicts multiple bounding boxes per grid cell. At training time we only want one bounding box predictor to be responsible for each object. We assign one predictor to be “responsible” for predicting an object based on which prediction has the highest current IOU with the ground truth.  
YOLO使用IOU最大的边界框进行回归预测

The other terms in the equation should be easy to understand: (x, y) are the predicted bounding box position and (x̂, ŷ) hat are the actual position from the training data.

Part 2
$$\lambda_{coord}\sum_{i=0}^{S^2}\sum_{j=0}^{B}\mathbb{1}_{ij}^{obj}{[(\sqrt{w_i}-\sqrt{\hat{w}_i})^2+(\sqrt{h_i}-\sqrt{\hat{h}_i})^2]}$$

This is the loss related to the predicted box width / height. The equation looks similar to the first one, except for the square root. What’s up with that? Quoting the paper again:
Our error metric should reflect that small deviations in large boxes matter less than in small boxes. To partially address this we predict the square root of the bounding box width and height instead of the width and height directly.  
使用根号放大较小的边界框，背后的含义是：小边界框上的小误差应该大于大边界框上的小误差

In [2]:
from torchsummary import summary

In [3]:
class MainModel(torch.nn.Module):
    def __init__(self):
        super(MainModel, self).__init__()

        self.pred = torch.nn.Sequential(
            # Conv 1 7x7x64 s=2
            torch.nn.Conv2d(3, 64, (7, 7), 2, padding=(3, 3)),  # 64
            torch.nn.LeakyReLU(0.1),
            # Max Pool 1
            torch.nn.MaxPool2d((2, 2), 2),  # 64
            # Conv 2
            torch.nn.Conv2d(64, 192, (3, 3), padding=(1,1)),  # 192
            # Max Pool 2
            torch.nn.MaxPool2d((2, 2), 2),  # 192
            # Conv 3
            torch.nn.Conv2d(192, 128, (1, 1)),  # 128
            # Conv 4
            torch.nn.Conv2d(128, 256, (3, 3), padding=(1,1)),  # 256
            # Conv 5
            torch.nn.Conv2d(256, 256, (1, 1)),  # 256
            # Conv 6
            torch.nn.Conv2d(256, 512, (1, 1)),  # 512
            # Max Pool 3
            torch.nn.MaxPool2d((2, 2), 2),  # 512
            # Conv 7
            torch.nn.Conv2d(512, 256, (1, 1)),  # 256
            # Conv 8
            torch.nn.Conv2d(256, 512, (3, 3), padding=(1,1)),  # 512
            # Conv 9
            torch.nn.Conv2d(512, 256, (1, 1)),  # 256
            # Conv 10
            torch.nn.Conv2d(256, 512, (3, 3), padding=(1,1)),  # 512
            # Conv 11
            torch.nn.Conv2d(512, 256, (1, 1)),  # 256
            # Conv 12
            torch.nn.Conv2d(256, 512, (3, 3), padding=(1,1)),  # 512
            # Conv 13
            torch.nn.Conv2d(512, 256, (1, 1)),  # 256
            # Conv 14
            torch.nn.Conv2d(256, 512, (3, 3), padding=(1,1)),  # 512
            # Conv 15
            torch.nn.Conv2d(512, 512, (1, 1)),  # 512
            # Conv 16
            torch.nn.Conv2d(512, 1024, (3, 3), padding=(1,1)),  # 1024
            # Max Pool 4
            torch.nn.MaxPool2d((2, 2), 2),  # 1024
            # Conv 17
            torch.nn.Conv2d(1024, 512, (1, 1)),  # 512
            # Conv 18
            torch.nn.Conv2d(512, 1024, (3, 3), padding=(1,1)),  # 1024
            # Conv 19
            torch.nn.Conv2d(1024, 512, (1, 1)),  # 512
            # Conv 20
            torch.nn.Conv2d(512, 1024, (3, 3), padding=(1,1)),  # 1024
            # Conv 21
            torch.nn.Conv2d(1024, 1024, (3, 3), padding=(1,1)),  # 1024
            # Conv 22
            torch.nn.Conv2d(1024, 1024, (3, 3), 2, padding=(1,1)),  # 1024
            # Conv 23
            torch.nn.Conv2d(1024, 1024, (3, 3), padding=(1,1)),  # 1024
            # Conv 24
            torch.nn.Conv2d(1024, 1024, (3, 3), padding=(1,1)),  # 1024
            # FC1
            torch.nn.Flatten(),
            torch.nn.Linear(7*7*1024, 4096),  # 4096
            # FC2
            torch.nn.Linear(4096, 1470),  # 7x7x30=1470
        )

    def forward(self, x):
        return self.pred(x)


In [4]:
my_nn = MainModel().cuda()

In [5]:
summary(my_nn,(3, 448, 448))

----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
            Conv2d-1         [-1, 64, 224, 224]           9,472
         MaxPool2d-2         [-1, 64, 112, 112]               0
            Conv2d-3        [-1, 192, 112, 112]         110,784
         MaxPool2d-4          [-1, 192, 56, 56]               0
            Conv2d-5          [-1, 128, 56, 56]          24,704
            Conv2d-6          [-1, 256, 56, 56]         295,168
            Conv2d-7          [-1, 256, 56, 56]          65,792
            Conv2d-8          [-1, 512, 56, 56]         131,584
         MaxPool2d-9          [-1, 512, 28, 28]               0
           Conv2d-10          [-1, 256, 28, 28]         131,328
           Conv2d-11          [-1, 512, 28, 28]       1,180,160
           Conv2d-12          [-1, 256, 28, 28]         131,328
           Conv2d-13          [-1, 512, 28, 28]       1,180,160
           Conv2d-14          [-1, 256,

**Some comments about the architecture:**
- Note that the architecture was crafted for use in the Pascal VOC dataset, where the authors used S=7, B=2 and C=20. This explains why the final feature maps are 7x7, and also explains the size of the output (7x7x(2*5+20)). Use of this network with a different grid size or different number of classes might require tuning of the layer dimensions.  
这个架构是专门设计给VOC数据集的，VOC数据集里面的图片是448x448的，448/7=64，使用其他分辨率的图片最好调整网络结构
- The authors mention that there is a fast version of YOLO, with fewer convolutional layers. The table above, however, display the full version.
- The sequences of 1x1 reduction layers and 3x3 convolutional layers were inspired by the GoogLeNet (Inception) model  
1x1卷积层
- The final layer uses a linear activation function. All other layers use a leaky
- 

# 1x1 Convolution
The 1x1 convolution operation involves convoling the inputs with filters of size 1x1, usually with zero-padding and stirde of 1.  
A general-purpose convolutional layer which outputs a tensor of shape (B,K,H,W) where,
- B : Batch Size
- K : The number of convolutional filters of kernels
- H : Height
- W : Width
changing filter dimension from K to F.
## Benefits?
- Small-sized filters make a low number of parameters
- Spatial downsampling (though pooling) may cause information loss, 1x1 convolution can be a alternatives.