This is a simple example for implementing the YOLO by MxNetR. The idea is devoloped by Joseph Chet Redmon, and related details can be found in his website. This is a simple example for MxNetR user, I will use a relatively small dataset for demonstrating how it work.
This repository is still being continuously updated, and the future plan includes following items:
-
To adjust the appropriate hyperparameters for more accurate yolo model.
-
To add a COCO2017 training example and model.
-
A clearer explanation for yolo model.
You can use the code "1. Prediction.R" for predicting an image. Here we prepared a well-trained model for your experiment. The 'yolo_v3 (1)-0000.params' and 'yolo_v3 (1)-symbol.json' can be found in the folder 'model/yolo model (voc2007)'. Here we use the 'test_img.jpeg' for testing the model. The left image is the raw image, and the right one is the prediction result by yolo v3 model.
Let try to predict other image!
The Pascal VOC challenge is a very popular dataset for building and evaluating algorithms for image classification, object detection, and segmentation. I will use the mirror website for downloading VOC2007 dataset. You can use the code "1. download VOC2007.R" to quickly download this dataset (439 MB for training and 431 MB for testing). Note: if you just want to use pre-trained model, you don't need to download this dataset.
To simplify the problem, we will resize all images as 256×256. You can use the codes "2-1. pre-processing image (train & val).R" and "2-2. pre-processing image (test).R" to do this work. Resized images will be converted as .RData and they totally used about 60 MB pre file for storing them. You can find them in the folder 'voc2007/data'
After we get the bounding box infomation in pevious stage, we can caculate the anchor boxes by k-mean clustering analysis. In yolo v3, there are 9 anchor boxes belonging feature map with 8 stride (3 smallest), 16 stride (3 meddle size), and 32 stride (3 biggest), respectively. You can use the codes "3. Define the anchor boxes (for yolo v3).R" for conducting this process. Finally, we will get the anchor_boxs (yolo v3).RData for further application.
The first step for using MxNet to train a yolo model is to build an iterator. You can use the codes "1. Encode, Decode & Iterator.R" for conducting this process. It is worth noting that bounding boxes are needed to encode as a special form for following training. Moreover, the encoded labels also need to pass a decoding process for restoring bounding boxes. The encode and decode function are the core of the yolo model. If you want to clearly understand the principle of yolo model, you can dismantle these functions to learn. The test codes for generating images are also included in that code, let's try it!
The next step is to define the model architecture. We use a pretrained model (training by imagenet for image recognition) and fine tune it. Here we contains a MxNet implementation of a MobileNets_V2-based YOLO networks. For details with Google's MobileNets, please read the following papers:
- [v1] MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications
- [v2] Inverted Residuals and Linear Bottlenecks: Mobile Networks for Classification, Detection and Segmentation
The mobilenet-v2 model was contributed by yuantangliang,and it can be downloaded from this repository. The top-1/5 accuracy rates by using single center crop (crop size: 224x224, image size: 256xN):
Network | Top-1 | Top-5 | sha256sum | Architecture |
---|---|---|---|---|
MobileNet v2 | 71.90 | 90.49 | a3124ce7 (13.5 MB) | netscope |
The model should be saved in folder 'model/pretrained model' for following use. If you want to train a more accurate model, you can to select other pretrained model from MxNet model zoo. I select a lightweight model is due to the limitation of Github that I cannot upload a file more than 100 MB. The code "2. Model architecture.R" includes yolo predict architecture and loss function, you can try to learn yolo v3 from these codes.
Now we can start to train this model! Because yolo v2 suggest that multi-scale training, so the training code is complex. The support functions can be found from "3. Support functions.R", and finally you can use "4. Train a yolo model.R" for training this model. It is worth noting that the total training time in this sample is about 35 hours in single P100 GPU server.
Finally, we get a model, snd the MAP50 in testing set is 24.12%. The reason of this MAP is considered as the serious overfitting, so you can add training samples. Following image is the selected predicting results by our model:
You can use the code "5. Test the model performance.R" for conducting this process. Because this is a simple example for yolo v3, our database only includes 4,008 training images and 1,003 validation images, so I consider this result is very good.
Note: this toy example is very small, even to use cpu for training also can get an acceptable speed (~5 samples/sec).
The pikachu dataset is a simple object detection task building by MxNet support team. It use a synthetic toy dataset by rendering images from open-sourced 3D Pikachu models.
For more detail. Please see:
- https://gluon.mxnet.io/chapter08_computer-vision/object-detection.html.
- http://zh.gluon.ai/chapter_computer-vision/pikachu.html.
I will use this website for downloading this dataset. You can use the code "1. Build jpg data from source.R" to quickly download this dataset (84 MB for training and 10 MB for testing), and further procees them to jpeg file. Note: this repository has included all data of pikachu dataset, so you can skip this step.
For follow-up training tasks, we need to process these data to .RData file. You can use the codes "2-1. Processing image (train).R" and "2-2. Processing image (val).R" to do this work. You can find them in the folder 'pikachu/data'. Note: you can also skip this step.
The first step for using MxNet to train a yolo model is to build an iterator. You can use the codes "1. Encode, Decode & Iterator.R" for conducting this process. It is worth noting that bounding boxes are needed to encode as a special form for following training. Moreover, the encoded labels also need to pass a decoding process for restoring bounding boxes. The encode and decode function are the core of the yolo model. Note: yolo v1 is not include the anchor box, this means the anchor box happens to be equal as the grid size.
The next step is to define the model architecture. We use a pretrained model (training by imagenet for image recognition) and fine tune it. Here we contains a MxNet implementation of a MobileNets_V2-based YOLO networks. For details, please read the above descriptions. The code "2. Model architecture.R" includes yolo predict architecture and loss function, you can try to learn yolo v1 from these codes.
Now we can start to train this model! The support functions can be found from "3. Support functions.R", and finally you can use "4. Train a yolo model.R" for training this model.
Finally, we get a model, snd the MAP50 in validation set is XX.XX%. You can use the code "1. Prediction.R" for predicting an image. Here we have trained a model for your predicting. The 'yolo_v1-0000.params' and 'yolo_v1-symbol.json' can be found in the folder 'model/yolo model (pikachu)'. The 50st epoch model predictions are show as following:
We can see that there are many candidate boxes beside Pikachu. Most of the boxes are actually wrong, however, if you only display the box with highest probability, the prediction results will be very good.
You can use the code "5. Test the model performance.R" for calculating MAP50. Finally, the MAP50 in validation set is 17.70%.