In this project we leverage state-of-the-art deep neural networks architectures for image classification, object recognition and semantic segmentation to implement a framework that aids autonomous driving by understanding the vehicle's surrounding scene.
A detailed report about the work done can be found in this Overleaf project.
Additionally, a Google Slides presentation can be found in this link.
HDF5 weights of the trained deep neural networks can be found here.
Prior to all experiments for each problem type (classification, detection and segmentation) we have performed an analysis of the datasets to facilitate the interpretation of the results obtained.
See this README for instructions on how to run the experiments and utilities.
In order to choose a good-performing object recognition network for our system, we have tested several CNNs with different architectures: VGG (2014), ResNet (2015) and DenseNet (2016). These networks have been both trained from scratch and fine-tuned using some pre-trained weights. The experiments have been carried out using different datasets: TT100K classsification dataset and BelgiumTS dataset for traffic sign detection, and KITTI Vision Benchmark for cars, trucks, cyclists and other typical elements in driving scenes. Finally, we have tuned several parameters of the architectures and the training process in order to get better results.
-
models/denseNet_FCN.py
- adaptation of this implementation of DenseNet to the framework and generalization of the axes of the batch normalization layers, which was only working correctly for Theano. -
models/resnet.py
- adaptation of the resnet50 Keras model to the framework and included L2 regularization for the weights (not included in Keras Applications) -
models/vgg.py
- changed implementation to include L2 regularization for the weights (not included in Keras Applications) -
callbacks/callbacks.py
andcallbacks/callbacks_factory.py
- implemented a new callback, LRDecayScheduler, that allows the user to decay the learning rate by a predefined factor (such that lr <-- lr / decay_factor) at specific epochs, or alternatively at all epochs. -
analyze_datasets.py
- analyzes all the datasets in the specified folder by counting the number of images per class per set (train, validation, test), and creates a CSV file with the results and a plot of the (normalized) distribution for all sets. -
optimization.py
- automatically generates the config files for the optimization of a model, using a grid search, and launches the experiments. -
run_all.sh
- bash script to launch all the experiments in this project, including object recognition, object detection and semantic segmentation.
- VGG:
- Analyze dataset - We extracted a CSV, statistical conclusions and plots of the classes distributions in the dataset(TT100K_TrafficSigns). Plots and comments in the report.
- Train from scratch using TT100K.
- Comparison between crop and resize.
- Evaluate different pre-processings in the configuration file: subtracting mean and std feature-wise.
- Transfer learning from TT100k dataset to Belgium dataset
- Train from scratch and fine-tune VGG with KITTI dataset
- ResNet:
- Implement it and adapt it to the framework
- Train from scratch with TT100K dataset
- Fine-tuning from ImageNet weights with the TT100K dataset
- Fine-tuning from ImageNet weights with the KITTI dataset
- Compare fine-tuning vs train from scratch
- DenseNet:
- Implement it and adapt it to the framework
- Train from scratch with TT100K dataset
- Boost performance
- Grid-search to search hyperparams for ResNet
- Refined ResNet fine-tuning over ImageNet weights to boost the performance on TT100K dataset
- Implemented LR decay scheduler, that has proved to be helpful in improving the performance of the networks
- Try data augmentation and different parameters on DenseNet
For object detection we have considered two single-shot models: the most recent version of You Only Look Once (YOLO) together with its smaller counterpart, Tiny-YOLO, and Single-Shot Multibox Detector (SSD). The first two have been trained by fine-tuning the pre-trained ImageNet weights, while the latter has been trained from scratch. All these models have been trained to detect a variety of traffic signs in the TT100K detection dataset and to detect pedestrians, cars and trucks in the Udacity dataset.
-
models/ssd300.py
,ssd_utils.py
andmetrics.py
- adaptation of this implementation of SSD300 to the framework, including the loss and batch generator utilities required to train it. -
analyze_datasets.py
- extended functionality to analyze detection datasets and report distributions over several variables. -
eval_detection_fscore.py
- extended to evaluate the SSD model. Included options to control the detection and NMS thersholds. Added option to store the predictions for the first image in each processed chunk. Generalized the script to ignore specific classes, so that they are not taken into account when computing the metrics.
- YOLO:
- Fine-tune from ImageNet weights on TT100K detection dataset
- Fine-tune from ImageNet weights on Udacity dataset
- Evaluate performance on TT100K detection dataset
- Evaluate performance on Udacity dataset
- Tiny YOLO:
- Fine-tune from ImageNet weights on TT100K detection dataset
- Fine-tune from ImageNet weights on Udacity dataset
- Evaluate performance on TT100K detection dataset
- Evaluate performance on Udacity dataset
- Compare results and performance between TinyYOLO and YOLO
- SSD:
- Implement it and adapt it to the framework
- Train from scratch on TT100K detection dataset
- Train from scratch on Udacity dataset
- Evaluate performance on TT100K detection dataset
- Evaluate performance on Udacity dataset
- Dataset Analysis
- Analyze TT100K detection dataset: distribution of classes, bounding boxes' aspect ratios and bounding boxes' areas per dataset split
- Analyze Udacity dataset: distribution of classes, bounding boxes' aspect ratios and bounding boxes' areas per dataset split
- Assess similarities and differences between splits on Udacity dataset
- Boost performance
- Fine-tune Tiny YOLO from baseline weights on TT100K detection
- Fine-tune Tiny YOLO and use preprocessing and data augmentation techniques to overcome the differences in dataset splits in Udacity dataset, thus improving the performance of the model on this dataset
For the semantic segmentation task, we have implemented and tested SegNet, DeepLabv2, Multi-Scale Context Aggregation by Dilated Convolutions and Tiramisu. We also compare the results with FCN8.
-
models/segnet.py
- implementation from scratch of both the vgg and basic version, following the original paper and the Caffe Segnet code and Caffe Segnet basic code -
models/deeplabV2.py
- adaptation of this implementation of DeepLabv2 to the framework and included L2 regularization for the weights -
models/tiramisu.py
- implementation based on the Theano / Lasagne code from the original paper -
models/dilation.py
- adaptation of this implementation -
initializations/initializations.py
- added Identity initialization -
analyze_datasets.py
- extended the implementation to analyze segmentation datasets
- FNC8:
- Read paper
- Train network on CamVid dataset
- Train network on CityScapes dataset
- Evaluate performance on CamVid dataset
- Evaluate performance on CityScapes dataset
- Segnet:
- Read paper
- Implement network in the framework (vgg and basic version)
- Train network on CamVid dataset
- Boost performance
- Evaluate performance on CamVid dataset
- DeepLabv2:
- Read paper
- Implement network in the framework
- Train network on CamVid dataset
- Boost performance
- Evaluate performance on CamVid dataset
- DilatedNet:
- Read paper
- Implement network in the framework
- Train network on CamVid dataset
- Boost performance
- Evaluate performance on CamVid dataset
- Tiramisu:
- Read paper
- Implement network in the framework
- Train network on CamVid dataset
- Boost performance
- Evaluate performance on CamVid dataset
- Dataset Analysis
- Analyze the distribution of classes of all data splits for all the available segmentation datasets: Camvid, cityscapes, KITTI, Pascal2012, Polyps and Synthia cityscapes.
Prior to choosing our final system we have carried out several experiments using different architectures, different parameters and different datasets. A summary of the experiments done can be found here.
[1] V. Badrinarayanan, A. Kendall, and R. Cipolla. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. arXiv preprint arXiv:1511.00561, 2015.
[2] M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, L. D. Jackel, M. Monfort, U. Muller,J. Zhang, X. Zhang, J. Zhao, and K. Zieba. End to End Learning for Self-Driving Cars. arXiv:1604.07316 [cs], Apr. 2016. arXiv: 1604.07316.
[3] C. Chen, A. Seff, A. Kornhauser, and J. Xiao. Deepdriving: Learning affordance for direct perception in autonomous driving. In The IEEE International Conference on Computer Vision (ICCV), December 2015.
[4] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. arXiv:1606.00915 [cs], June 2016. arXiv: 1606.00915.
[5] A. Geiger, P. Lenz, and R. Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In Conference on Computer Vision and Pattern Recognition (CVPR), 2012.
[6] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. CoRR, abs/1512.03385, 2015. SUMMARY
[7] G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten. Densely Connected Convolutional Networks. Aug. 2016. arXiv: 1608.06993.
[8] S. Jégou, M. Drozdzal, D. Vazquez, A. Romero, and Y. Bengio. The one hundred layers tiramisu: Fully convolutional densenets for semantic segmentation. arXiv preprint. arXiv:1611.09326, 2016.
[9] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet Classification with Deep Convolutional Neural Networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1097–1105. Curran Associates, Inc., 2012.
[10] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.Y. Fu, and A. C. Berg. SSD: Single Shot MultiBox Detector. arXiv:1512.02325 [cs], 9905:21–37, 2016. arXiv:1512.02325.
[11] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3431–3440, 2015.
[12] M. Mathias, R. Timofte, R. Benenson, and L. Van Gool. Traffic sign recognition – how far are we from the solution? International Joint Conference on Neural Networks (IJCNN), 2013.
[13] J. Redmon, S. K. Divvala, R. B. Girshick, and A. Farhadi. You only look once: Unified, real-time object detection. CoRR, abs/1506.02640, 2015. SUMMARY
[14] J. Redmon and A. Farhadi. YOLO9000: better, faster, stronger. CoRR, abs/1612.08242, 2016.
[15] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. arXiv:1506.01497 [cs], June 2015. arXiv: 1506.01497.
[16] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. Sept. 2014. arXiv: 1409.0575.
[17] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014. SUMMARY
[18] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Computer Vision and Pattern Recognition (CVPR), 2015.
[19] F. Yu and V. Koltun. Multi-Scale Context Aggregation by Dilated Convolutions. arXiv:1511.07122 [cs], Nov. 2015. arXiv: 1511.07122.
[20] Z. Zhu, D. Liang, S. Zhang, X. Huang, B. Li, and S. Hu. Traffic-sign detection and classification in the wild. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.