這邊我會假設你已經看過[Darknet-53](https://github.com/shaoeChen/deeplearning/blob/master/tf2/Arch_YOLOv3_1_Darknet-53_structure.ipynb)的說明，而且你也已經完全明白模型的架構是怎麼一回事。

下面先給出眾多的參考資料，瞭解一個架構不容易，參考的自然也多了，正確的寫出引用、參照是學習過程中的必要：

* [YOLO: Real-Time Object Detection](https://pjreddie.com/darknet/yolo/)
* [pjreddie/darknet](https://github.com/pjreddie/darknet)
* [YOLOv3_論文翻譯連結](https://hackmd.io/@shaoeChen/SyjI6W2zB/https%3A%2F%2Fhackmd.io%2F%40shaoeChen%2FryHg904h9)
* [YOLOv3深度解析](https://blog.csdn.net/leviopku/article/details/82660381)
* [qqwweee/keras-yolo3](https://github.com/qqwweee/keras-yolo3)
* [YunYang1994/tensorflow-yolov3](https://github.com/YunYang1994/tensorflow-yolov3)
* [joymyhome_Yolov3 config file中pad的理解](https://blog.csdn.net/joymyhome/article/details/106349084)

相關前置資料的處理可以參考另作[Arch_YOLOv2_dataset_preprocess.ipynb](https://github.com/shaoeChen/deeplearning/blob/master/tf2/Arch_YOLOv2_dataset_preprocess.ipynb)

我的docker上執行的版本為tensorflow 2.1，雖然現在流行人生苦短我用PyTorch，不過我還是先繼續tf + keras。

In [1]:
import tensorflow as tf
tf.__version__

'2.1.0'

指定使用的gpu

In [2]:
gpus = tf.config.experimental.list_physical_devices(device_type='GPU')
tf.config.experimental.set_visible_devices(devices=gpus[0], device_type='GPU')

In [3]:
gpus

[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'),
 PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU')]

先把建構Darknet的資料處理完畢，這邊不再做任何說明。

In [4]:
def common_conv(input_x, filters: int, filter_size: tuple, name: str, strides: int = 1, 
                is_activate: bool = True, is_bn: bool = True):
    """處理常規的捲積
    
    input_x: input tensor
    filters: filter的數量
    filter_size: filter size, ex (3, 3), (1, 1)
    strides: 步幅，如果downsampling就會設置為2
        _is_downsampling: 是否為下採樣，是的話就外推兩個邊，這部份可以透過strides來做判斷
    is_activate: 是否含啟動函數，因為架構上的啟動函數都是leaky relu，所以設置為bool，只看有沒有
    is_bn: 是否做batch normalization    
    """ 
    _padding = 'same'
    
    if strides == 2:
        input_x = tf.keras.layers.ZeroPadding2D(((1, 0), (1, 0)), name=name + '_zero_padding')(input_x)
        _padding = 'valid'
        
    
    x = tf.keras.layers.Conv2D(filters=filters, 
                               kernel_size=filter_size, 
                               strides=strides,
                               name=name + '_conv_1',
                               padding=_padding
                              )(input_x)
    if is_activate:
        x = tf.keras.layers.LeakyReLU(name=name + '_leaky_3')(x)
        
    if is_bn:
        x = tf.keras.layers.BatchNormalization(name=name + '_bn_2')(x)
        
    return x

In [10]:
def residual_block(pre_output, filter_list: tuple, name_list: list):
    """從論文中的架構圖可以看的出來，residual block是由兩個conv layer建構起來
    
    pre_output: 前一個layer的output
    filter_list:  兩次捲積的filter數量
    name_list: 兩次捲積的layer name    
    """    
    assert len(filter_list) == 2    
    
    filter_nums_1, filter_nums_2 = filter_list
    layer_name_1 = 'layer_' + str(name_list[0])
    layer_name_2 = 'layer_' + str(name_list[1])
    
    x = common_conv(pre_output, filters=filter_nums_1, filter_size=(1, 1), name=layer_name_1, strides=1)
    x = common_conv(x, filters=filter_nums_2, filter_size=(3, 3), name=layer_name_2, strides=1)
    
    x = tf.keras.layers.Add()([x, pre_output])
    return x        

In [7]:
def easy_reshape(l, n):
    """for list reshape"""
    return [l[i: i+n] for i in range(0, len(l), n)]

基本上，[YOLOv3](https://github.com/pjreddie/darknet/blob/master/cfg/yolov3.cfg)的話跟darknet-53不是那麼一樣，darknet-53是它的結構的一部份。下面給出YOLOv3的結構圖。

<img src="https://hackmd.io/_uploads/rJJwnl5pq.png" width="25%">

上圖是我自己看著[cfg檔](https://github.com/pjreddie/darknet/blob/master/cfg/yolov3.cfg)然後算啊算畫出來的一個YOLOv3的架構圖，最左邊的部份就是Darknet-53，不難發現，從Darknet裡面有三個output，這就是[論文2-3](https://hackmd.io/@shaoeChen/SyjI6W2zB/https%3A%2F%2Fhackmd.io%2F%40shaoeChen%2FryHg904h9#23-Predictions-Across-Scales)所提到的部份。現在我們就根據上面說明的部份來調整Darknet。  

備註：不確定索引是不是應該從0開始算，如果是的話那37就會是36。

基本上YOLOv3的架構非常的規律，還可以再做適度的封裝，不過這部份為了說明，我就不特別處理，就採用囉哩八唆的方式來呈現。

In [8]:
def darknet_53(input_shape: tuple):
    """模型的概略部份可以參考上圖，詳細部份記得去看作者的git
    
    在YOLOv3中，會從darknet中取得三個output，分別是第36、61個layer以及最後的output，
    值得注意的是，最後的output是沒有flatten、pool、softmax之類的操作。
    """
    x_input = tf.keras.layers.Input(input_shape)
    
    x = common_conv(x_input, filters=32, filter_size=(3, 3), name='layer_1', strides=1)
    # Downsample
    x = common_conv(x, filters=64, filter_size=(3, 3), name='layer_2', strides=2)    
    # 1x    
    x = residual_block(x, filter_list=(32, 64), name_list=[3, 4])
    
    # Downsample
    x = common_conv(x, filters=128, filter_size=(3, 3), name='layer_5', strides=2)
    # 2x
    _name_list = easy_reshape(list(range(6, 10)), 2)
    for i in range(2):
        x = residual_block(x, filter_list=(64, 128), name_list=_name_list[i])   
        
    # Downsample
    x = common_conv(x, filters=256, filter_size=(3, 3), name='layer_10', strides=2)
    # 8x
    _name_list = easy_reshape(list(range(11, 27)), 2)
    for i in range(8):
        x = residual_block(x, filter_list=(128, 256), name_list=_name_list[i])
        
    # 36-layer
    output_36 = x   
    
    # Downsample
    x = common_conv(x, filters=512, filter_size=(3, 3), name='layer_27', strides=2)
    # 8x
    _name_list = easy_reshape(list(range(28, 44)), 2)
    for i in range(8):
        x = residual_block(x, filter_list=(256, 512), name_list=_name_list[i])
        
    # 61-layer
    output_61 = x
    
    # Downsample
    x = common_conv(x, filters=1024, filter_size=(3, 3), name='layer_44', strides=2)
    # 4x
    _name_list = easy_reshape(list(range(45, 53)), 2)
    for i in range(4):
        x = residual_block(x, filter_list=(512, 1024), name_list=_name_list[i])
        
    
    return output_36, output_61, x

In [11]:
input_shape = (416, 416, 3)
output_36, output_61, output_last = darknet_53(input_shape)

In [12]:
output_36.shape, output_61.shape, output_last.shape

(TensorShape([None, 52, 52, 256]),
 TensorShape([None, 26, 26, 512]),
 TensorShape([None, 13, 13, 1024]))

現在就來設置YOLOv3。每一個output的filter的數量會是255是因為：(類別(80個) + 座標(4個) + 置信度(1個)) * 預測框(3個)。論文中有提到，他是根據不同尺度來做預測，因此有不同大小的輸出。不同大小的輸出就可會體現在前面的NxN的部份，13x13、26x26、52x52，這代表的是我們把一張輸入的照片分割成NxN，然後每個grid cell都會有3個框的預測結果。

In [35]:
def yolov3(input_shape: tuple):
    """YOLOv3
    
    input_shape: 輸入的維度
    
    從論文中我們知道，yolo的架構中會從darknet內取得三個output，再做後續的feature map的處理，
    
    """
    darknet_52x52, darknet_26x26, darknet_13x13 = darknet_53(input_shape)
    
    # 首先處理13x13x255的feature maps
    
    feature_13x13 = common_conv(darknet_13x13, filters=512, filter_size=(1, 1), name='yolo_13x13_1', strides=1)    
    feature_13x13 = common_conv(feature_13x13, filters=1024, filter_size=(3, 3), name='yolo_13x13_2', strides=1)    
    feature_13x13 = common_conv(feature_13x13, filters=512, filter_size=(1, 1), name='yolo_13x13_3', strides=1)    
    feature_13x13 = common_conv(feature_13x13, filters=1024, filter_size=(3, 3), name='yolo_13x13_4', strides=1)    
    feature_13x13 = common_conv(feature_13x13, filters=512, filter_size=(1, 1), name='yolo_13x13_5', strides=1)    
        
    # 這邊代表的就是13x13的那個結構中的第五次conv之後的output
    yolo_13x13 = feature_13x13 
    
    feature_13x13 = common_conv(feature_13x13, filters=1024, filter_size=(3, 3), name='yolo_13x13_6', strides=1)            
    # 這邊的輸出是對應上圖左二YOLO1
    feature_13x13_output = common_conv(feature_13x13, filters=255, filter_size=(1, 1), name='yolo_13x13_7', 
                                       strides=1, is_activate=False, is_bn=False)
    
    # 這邊處理的是26x26的feature map
    # 先從yolo 13x13中過來的feature map會先做上採樣，也就是upsampling之後再跟darknet取得的26x26的feature map結合
    feature_26x26 = common_conv(yolo_13x13, filters=256, filter_size=(1, 1), name='yolo_26x26_input', strides=1)
    feature_26x26 = tf.keras.layers.UpSampling2D(name='yolo_26x26_upsampling')(feature_26x26)
    feature_26x26 = tf.keras.layers.Concatenate()([feature_26x26, darknet_26x26])
    
    feature_26x26 = common_conv(feature_26x26, filters=256, filter_size=(1, 1), name='yolo_26x26_1', strides=1)    
    feature_26x26 = common_conv(feature_26x26, filters=512, filter_size=(3, 3), name='yolo_26x26_2', strides=1)    
    feature_26x26 = common_conv(feature_26x26, filters=256, filter_size=(1, 1), name='yolo_26x26_3', strides=1)    
    feature_26x26 = common_conv(feature_26x26, filters=512, filter_size=(3, 3), name='yolo_26x26_4', strides=1)    
    feature_26x26 = common_conv(feature_26x26, filters=256, filter_size=(1, 1), name='yolo_26x26_5', strides=1)  
    
    # 這邊代表的就是26x26的那個結構中的第五次conv之後的output
    yolo_26x26 = feature_26x26
    
    feature_26x26 = common_conv(feature_26x26, filters=512, filter_size=(3, 3), name='yolo_26x26_6', strides=1)    
    feature_26x26_output = common_conv(feature_26x26, filters=255, filter_size=(1, 1), name='yolo_26x26_7', 
                                       strides=1, is_activate=False, is_bn=False)
    
    
    # 這邊處理的是52x52的feature map
    # 先從yolo 26x26中過來的feature map會先做上採樣，也就是upsampling之後再跟darknet取得的52x52的feature map結合
    feature_52x52 = common_conv(yolo_26x26, filters=128, filter_size=(1, 1), name='yolo_52x52_input', strides=1)
    feature_52x52 = tf.keras.layers.UpSampling2D(name='yolo_52x52_upsampling')(feature_52x52)
    feature_52x52 = tf.keras.layers.Concatenate()([feature_52x52, darknet_52x52])
    
    feature_52x52 = common_conv(feature_52x52, filters=128, filter_size=(1, 1), name='yolo_52x52_1', strides=1)    
    feature_52x52 = common_conv(feature_52x52, filters=256, filter_size=(3, 3), name='yolo_52x52_2', strides=1)    
    feature_52x52 = common_conv(feature_52x52, filters=128, filter_size=(1, 1), name='yolo_52x52_3', strides=1)    
    feature_52x52 = common_conv(feature_52x52, filters=256, filter_size=(3, 3), name='yolo_52x52_4', strides=1)    
    feature_52x52 = common_conv(feature_52x52, filters=128, filter_size=(1, 1), name='yolo_52x52_5', strides=1)      
    feature_52x52 = common_conv(feature_52x52, filters=256, filter_size=(3, 3), name='yolo_52x52_6', strides=1)    
    feature_52x52_output = common_conv(feature_52x52, filters=255, filter_size=(1, 1), name='yolo_52x52_7', 
                                       strides=1, is_activate=False, is_bn=False)
    
    
    return feature_13x13_output, feature_26x26_output, feature_52x52_output

In [36]:
yolo_13x13, yolo_26x26, yolo_52x52 = yolov3(input_shape)

In [37]:
yolo_13x13.shape, yolo_26x26.shape, yolo_52x52.shape

(TensorShape([None, 13, 13, 255]),
 TensorShape([None, 26, 26, 255]),
 TensorShape([None, 52, 52, 255]))

到這邊也算是成功的把YOLOv3的骨幹弄起來了，每天學一點，一天不要學太多東西。後續就可以來處理訓練YOLOv3的問題。