YOLOv2寫一半，有一次在尹相志老師的臉書上的留言提到自己還在寫YOLOv2，他說，從v3開始就好了，所以我就決定就從v3開始寫就好了。我相信即使到現在有v7(2022-08-01)，v3還是非常的實用。

下面先給出眾多的參考資料，瞭解一個架構不容易，參考的自然也多了，正確的寫出引用、參照是學習過程中的必要：

* [YOLO: Real-Time Object Detection](https://pjreddie.com/darknet/yolo/)
* [pjreddie/darknet](https://github.com/pjreddie/darknet)
* [YOLOv3_論文翻譯連結](https://hackmd.io/@shaoeChen/SyjI6W2zB/https%3A%2F%2Fhackmd.io%2F%40shaoeChen%2FryHg904h9)
* [YOLOv3深度解析](https://blog.csdn.net/leviopku/article/details/82660381)
* [qqwweee/keras-yolo3](https://github.com/qqwweee/keras-yolo3)
* [YunYang1994/tensorflow-yolov3](https://github.com/YunYang1994/tensorflow-yolov3)
* [joymyhome_Yolov3 config file中pad的理解](https://blog.csdn.net/joymyhome/article/details/106349084)

相關前置資料的處理可以參考另作[Arch_YOLOv2_dataset_preprocess.ipynb](https://github.com/shaoeChen/deeplearning/blob/master/tf2/Arch_YOLOv2_dataset_preprocess.ipynb)

我的docker上執行的版本為tensorflow 2.1，雖然現在流行人生苦短我用PyTorch，不過我還是先繼續tf + keras。

In [1]:
import tensorflow as tf
tf.__version__

'2.1.0'

指定使用的gpu

In [2]:
gpus = tf.config.experimental.list_physical_devices(device_type='GPU')
tf.config.experimental.set_visible_devices(devices=gpus[0], device_type='GPU')

In [3]:
gpus

[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'),
 PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU')]

整個模型的話可以從[作者的git](https://github.com/pjreddie/darknet/blob/master/cfg/darknet53.cfg)上查到，因為他在論文中放的資料應該是比較概略，所以如果單純的從論文來看的話應該只能知道架構而不知道細節。最主要在裡面用了大量的residual block，也因此整個網路深到53層，命名為Darnket-53。

關於Residual的部份，如果有興趣的話也可以參考在下：
1. [git_tf2/Arch_ResNet50](https://github.com/shaoeChen/deeplearning/blob/master/tf2/Arch_ResNet50.ipynb)
2. [論文翻譯ResNet](https://hackmd.io/@shaoeChen/SyjI6W2zB/https%3A%2F%2Fhackmd.io%2F%40shaoeChen%2FSy_e1mCEU)

當然這裡面還會有network in network的觀念，如果有興趣都可以再延伸閱讀。

首先我們先來建構模型，同時也給出論文中的模型架構，方便閱讀：  
<img src="https://hackmd.io/_uploads/SyeUzk_2c.png" width="25%">

接下來就可以開始處理模型的部份了，為了能夠更詳細的說明，我在這邊就分段的說明。

首先是最一開始進入的conv layer，根據上面的架構圖，這有著32個filter，並且filter size為3x3，如果你有看官方架構文件的話，這邊也有做往外一個pixel的padding，也因此第一個conv layer輸出的大小依然為256x256，並且有32個filters，所以為256x256x32。

In [4]:
def darknet_53(input_shape: tuple):
    """模型的概略部份可以參考上圖，詳細部份記得去看作者的git"""
    x_input = tf.keras.layers.Input(input_shape)
        
    x = tf.keras.layers.Conv2D(filters=32, kernel_size=(3, 3), strides=1, padding='same')(x_input)
    x = tf.keras.layers.BatchNormalization(name='bn1')(x)
    x = tf.keras.layers.LeakyReLU()(x)
    
    model = tf.keras.models.Model(inputs=x_input, outputs=x, name='YOLOv3')
    return model

In [5]:
input_shape = (256, 256, 3)
darknet = darknet_53(input_shape)

In [6]:
darknet.summary()

Model: "YOLOv3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         [(None, 256, 256, 3)]     0         
_________________________________________________________________
conv2d (Conv2D)              (None, 256, 256, 32)      896       
_________________________________________________________________
bn1 (BatchNormalization)     (None, 256, 256, 32)      128       
_________________________________________________________________
leaky_re_lu (LeakyReLU)      (None, 256, 256, 32)      0         
Total params: 1,024
Trainable params: 960
Non-trainable params: 64
_________________________________________________________________


當然我們知道，第二個conv layer的架構是一樣的，只有filter的數量還有stride之類的些許差異，所以為了模型版面的整潔，我們就設置一個函數來處理標準捲積的動作。

如果你有去看官方文件中的darknet53.cfg的話，應該就會發現有一個pad=1的參數，基本上，整個YOLOv3的pad都是1，這個1一直到後來我才知道並非指每個方向都外推1個pixel，而是代表`(kernel size - 1) / 2`，所以filter size=3的情況下，就會是外推2個pixel(直接進位)。

相關的計算也可以參考吳恩達老師的[深度學習課程](https://hackmd.io/@shaoeChen/BJDUj508z)

* 捲積之後的維度為$\dfrac{n+2p-f}{s} + 1$

以架構圖來看，這會有兩種情況：
1. 輸入之後的維度不變，代表它做了same padding，filter size為3x3，帶入公式來看，(256 + 2p - 3) / 1 + 1= 256，2p = 2，因此p = 1，不過這不用我們處理，我們直接把參數的部份設置`padding=same`
2. 執行downsampling，(256 + 2p - 3) / 2 + 1 = 128，因此p = 0.5，2p = 1，當p = 1的時候就是四邊都推，p = 0.5的時候就當做是只邊兩邊，我是這樣想的啦，不然課程中是有提到直接採rounddown的方式，但其實設置same padding的話結果是一樣的，也許在論文中有特別指出指定的外推方向是我沒有注意到的

In [7]:
def common_conv(input_x, filters: int, filter_size: tuple, name: str, strides: int = 1):
    """處理常規的捲積
    
    input_x: input tensor
    filters: filter的數量
    filter_size: filter size, ex (3, 3), (1, 1)
    strides: 步幅，如果downsampling就會設置為2
        _is_downsampling: 是否為下採樣，是的話就外推兩個邊，這部份可以透過strides來做判斷
    
    """ 
    _padding = 'same'
    
    if strides == 2:
        input_x = tf.keras.layers.ZeroPadding2D(((1, 0), (1, 0)), name=name + '_zero_padding')(input_x)
        _padding = 'valid'
        
    
    x = tf.keras.layers.Conv2D(filters=filters, 
                               kernel_size=filter_size, 
                               strides=strides,
                               name=name + '_conv_1',
                               padding=_padding
                              )(input_x)
    x = tf.keras.layers.BatchNormalization(name=name + '_bn_2')(x)
    x = tf.keras.layers.LeakyReLU(name=name + '_leaky_3')(x)
    return x

In [8]:
def darknet_53(input_shape: tuple):
    """模型的概略部份可以參考上圖，詳細部份記得去看作者的git"""
    x_input = tf.keras.layers.Input(input_shape)
    
    x = common_conv(x_input, filters=32, filter_size=(3, 3), name='layer_1', strides=1)
    x = common_conv(x, filters=32, filter_size=(3, 3), name='layer_2', strides=2)
    
    model = tf.keras.models.Model(inputs=x_input, outputs=x, name='YOLOv3')
    return model

In [9]:
input_shape = (256, 256, 3)
darknet = darknet_53(input_shape)
darknet.summary()

Model: "YOLOv3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_2 (InputLayer)         [(None, 256, 256, 3)]     0         
_________________________________________________________________
layer_1_conv_1 (Conv2D)      (None, 256, 256, 32)      896       
_________________________________________________________________
layer_1_bn_2 (BatchNormaliza (None, 256, 256, 32)      128       
_________________________________________________________________
layer_1_leaky_3 (LeakyReLU)  (None, 256, 256, 32)      0         
_________________________________________________________________
layer_2_zero_padding (ZeroPa (None, 257, 257, 32)      0         
_________________________________________________________________
layer_2_conv_1 (Conv2D)      (None, 128, 128, 32)      9248      
_________________________________________________________________
layer_2_bn_2 (BatchNormaliza (None, 128, 128, 32)      128  

結果來看，我們的函數設置應該是正確的才對，接續著就是殘差的處理，前一個layer的output，也就是這個layer的input會跟這個layer的output做相加的動作。

這邊我們會先定義一個函數來處理這個殘差塊的部份，很明顯的，每一個架構區塊都是重複的，兩個conv layer，然後做residual處理。而且我們可以再發現一個點，進入residual block之前的output會跟residual block的output的filter數量是一致的，而且input、output維度也是一致的。這意謂著我們可以很輕易的把input與output相加。

In [10]:
def residual_block(pre_output, filter_list: tuple, name_list: list):
    """從論文中的架構圖可以看的出來，residual block是由兩個conv layer建構起來
    
    pre_output: 前一個layer的output
    filter_list:  兩次捲積的filter數量
    name_list: 兩次捲積的layer name    
    """
    
    assert len(filter_list) == 2    
    filter_nums_1, filter_nums_2 = filter_list
    layer_name_1 = 'layer_' + str(name_list[0])
    layer_name_2 = 'layer_' + str(name_list[1])
    
    x = common_conv(pre_output, filters=filter_nums_1, filter_size=(1, 1), name=layer_name_1, strides=1)
    x = common_conv(x, filters=filter_nums_2, filter_size=(3, 3), name=layer_name_2, strides=1)
    
    x = tf.keras.layers.Add()([x, pre_output])
    return x
    
    

下面就可以真正的來設置Darknet-53的模型

In [11]:
def darknet_53(input_shape: tuple):
    """模型的概略部份可以參考上圖，詳細部份記得去看作者的git"""
    x_input = tf.keras.layers.Input(input_shape)
    
    x = common_conv(x_input, filters=32, filter_size=(3, 3), name='layer_1', strides=1)
    # Downsample
    x = common_conv(x, filters=64, filter_size=(3, 3), name='layer_2', strides=2)    
    # 1x
    x = residual_block(x, filter_list=(32, 64), name_list=[3, 4])
    # Downsample
    x = common_conv(x, filters=128, filter_size=(3, 3), name='layer_5', strides=2)
    # 2x
    x = residual_block(x, filter_list=(64, 128), name_list=[6, 7])
    x = residual_block(x, filter_list=(64, 128), name_list=[8, 9])
    # Downsample
    x = common_conv(x, filters=256, filter_size=(3, 3), name='layer_10', strides=2)
    # 8x
    x = residual_block(x, filter_list=(128, 256), name_list=[11, 12])
    x = residual_block(x, filter_list=(128, 256), name_list=[13, 14])
    x = residual_block(x, filter_list=(128, 256), name_list=[15, 16])
    x = residual_block(x, filter_list=(128, 256), name_list=[17, 18])
    x = residual_block(x, filter_list=(128, 256), name_list=[19, 20])
    x = residual_block(x, filter_list=(128, 256), name_list=[21, 22])
    x = residual_block(x, filter_list=(128, 256), name_list=[23, 24])
    x = residual_block(x, filter_list=(128, 256), name_list=[25, 26])
    # Downsample
    x = common_conv(x, filters=512, filter_size=(3, 3), name='layer_27', strides=2)
    # 8x
    x = residual_block(x, filter_list=(256, 512), name_list=[28, 29])
    x = residual_block(x, filter_list=(256, 512), name_list=[30, 31])
    x = residual_block(x, filter_list=(256, 512), name_list=[32, 33])
    x = residual_block(x, filter_list=(256, 512), name_list=[34, 35])
    x = residual_block(x, filter_list=(256, 512), name_list=[36, 37])
    x = residual_block(x, filter_list=(256, 512), name_list=[38, 39])
    x = residual_block(x, filter_list=(256, 512), name_list=[40, 41])
    x = residual_block(x, filter_list=(256, 512), name_list=[42, 43])
    # Downsample
    x = common_conv(x, filters=1024, filter_size=(3, 3), name='layer_44', strides=2)
    # 4x
    x = residual_block(x, filter_list=(512, 1024), name_list=[45, 46])
    x = residual_block(x, filter_list=(512, 1024), name_list=[47, 48])
    x = residual_block(x, filter_list=(512, 1024), name_list=[49, 50])
    x = residual_block(x, filter_list=(512, 1024), name_list=[51, 52])
    # global average pool
    x = tf.keras.layers.AveragePooling2D(pool_size=(2, 2), name='avg_pooling')(x)
    # fully connected and softmax
    x = tf.keras.layers.Flatten()(x)
    x = tf.keras.layers.Dense(units=1000, 
                              activation='softmax', 
                              name='softmax', 
                              kernel_initializer=tf.keras.initializers.glorot_uniform(seed=10)
                             )(x)
    model = tf.keras.models.Model(inputs=x_input, outputs=x, name='Darknet-53')
    return model

In [12]:
input_shape = (256, 256, 3)
darknet = darknet_53(input_shape)
darknet.summary()

Model: "Darknet-53"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_3 (InputLayer)            [(None, 256, 256, 3) 0                                            
__________________________________________________________________________________________________
layer_1_conv_1 (Conv2D)         (None, 256, 256, 32) 896         input_3[0][0]                    
__________________________________________________________________________________________________
layer_1_bn_2 (BatchNormalizatio (None, 256, 256, 32) 128         layer_1_conv_1[0][0]             
__________________________________________________________________________________________________
layer_1_leaky_3 (LeakyReLU)     (None, 256, 256, 32) 0           layer_1_bn_2[0][0]               
_________________________________________________________________________________________

模型的設置上，我故意很明顯又直白到簡直就是囉哩八唆的去設置layer的編號而不是用迴圈，為的就是能夠直觀說明，到最後的average pooling之前剛好是52，加上最後一個fully connected剛好53，也就是Darknet-53的由來。

這是Darknet-53，不是YOLOv3，慢慢來，每天學一點，一天不要學太多東西。