Skip to content

Easy benchmarking of all public open-source implementations of convnets

Notifications You must be signed in to change notification settings

rootlessweed/convnet-benchmarks

 
 

Repository files navigation

convnet-benchmarks

Easy benchmarking of all public open-source implementations of convnets. A summary is provided in the section below.

Work in progress! I am still working through each convolution module in each library, THIS IS NOT AN EXHAUSTIVE LIST!
  • After getting an initial baseline with the single module below (and getting inital benchmark scripts), I will benchmark a full AlexNet/MattNet/Overfeat

Machine: 6-core Intel i7-3930K @ 3.20GHz + NVIDIA Titan Black + Ubuntu 14.04 x86_64

###Spatial Convolution layer (3D input 3D output, densely connected)

forward + backprop (wrt input and weights)
Original Library Class/Function Benchmarked Total Time (ms) Total forward time (ms) Total backward time (ms) Peak Memory Formula Limitations
Theano (experimental)*** conv2d_fft 1178 304 874
Caffe ConvolutionLayer 1787 537 1250
cuda-convnet2 * ConvLayer 1818 416 1402
NVidia CuDNN * cudnn.SpatialConvolution 1861 513 1348
Torch-7 nn.SpatialConvolutionBHWD 1892 581 1311
Torch-7 SpatialConvolutionMM 1936 581 1355
Theano (experimental) CorrMM 2063 630 1433
cuda-convnet** pylearn2.cuda_convnet 3287 727 2560
ccv ccv_convnet_layer 809+bw 809
Theano (legacy) conv2d 70774 3833 66941
cherry-picking**** best per layer 985 191 794
  • * indicates that the library was tested with Torch bindings of the specific kernels.
  • ** indicates that the library was tested with Pylearn2 bindings.
  • *** This is an experimental module which used FFT to calculate convolutions. It uses a lot of memory according to @benanne
  • **** The last row shows results obtainable when choosing the best-performing library for each layer.
  • L1 - Input: 128x128 Batch-size 128, Feature maps: 3->96, Kernel Size: 11x11, Stride: 1x1
  • L2 - Input: 64x64 Batch-size 128, Feature maps: 64->128, Kernel Size: 9x9, Stride: 1x1
  • L3 - Input: 32x32 Batch-size 128, Feature maps: 128->128, Kernel Size: 9x9, Stride: 1x1
  • L4 - Input: 16x16 Batch-size 128, Feature maps: 128->128, Kernel Size: 7x7, Stride: 1x1
  • L5 - Input: 13x13 Batch-size 128, Feature maps: 384->384, Kernel Size: 3x3, Stride: 1x1
  • The table is ranked according to the total time forward+backward calls for layers (L1 + L2 + L3 + L4 + L5)

#####Breakdown

forward

Columns L1, L2, L3, L4, L5, Total are times in milliseconds

Original Library Class/Function Benchmarked L1 L2 L3 L4 L5 Total
Theano (experimental)*** conv2d_fft 138 73 30 9 39 304
Caffe ConvolutionLayer<Dtype> 100 205 158 35 39 537
cuda-convnet2 * ConvLayer 63 241 86 9 17 416
NVidia CuDNN cudnn.SpatialConvolution 94 274 101 12 32 513
Torch-7 nn.SpatialConvolutionBHWD 182 279 94 11 15 581
Torch-7 nn.SpatialConvolutionMM 105 239 168 32 37 581
Theano (experimental) CorrMM 100 251 197 38 44 630
cuda-convnet** pylearn2.cuda_convnet 92 412 159 19 45 727
ccv ccv_convnet_layer 121 437 182 23 44 809
Theano (legacy) conv2d 408 2310 739 99 277 3833
cherry-picking**** best per layer 63 72 30 9 17 191
backward (gradInput + gradWeight)

Columns L1, L2, L3, L4, L5, Total are times in milliseconds

Original Library Class/Function Benchmarked L1 L2 L3 L4 L5 Total
Theano (experimental)*** conv2d_fft 449 218 89 28 90 874
Caffe ConvolutionLayer<Dtype> 307 599 242 42 60 1250
cuda-convnet2 * ConvLayer 586 570 190 19 37 1402
NVidia CuDNN cudnn.SpatialConvolution 226 736 297 32 57 1348
Torch-7 nn.SpatialConvolutionBHWD 513 562 187 21 28 1311
Torch-7 nn.SpatialConvolutionMM 301 673 270 47 64 1355
Theano (experimental) CorrMM 282 733 295 51 72 1433
cuda-convnet** pylearn2.cuda_convnet 618 1305 473 50 114 2560
ccv ccv_convnet_layer
Theano (legacy) conv2d 53997 9752 2202 299 691 66941
cherry-picking**** best per layer 285 337 118 17 37 794
gradInput

Columns L1, L2, L3, L4, L5, Total are times in milliseconds

Original Library Class/Function Benchmarked L1 L2 L3 L4 L5 Total
Theano (experimental)*** conv2d_fft 250 111 54 19 48 482
Caffe ConvolutionLayer<Dtype> 86 271 120 20 26 523
cuda-convnet2 * ConvLayer 131 230 82 8 16 467
Theano (experimental) CorrMM 87 328 142 25 31 613
NVidia CuDNN cudnn.SpatialConvolution 111 421 180 17 21 750
Torch-7 nn.SpatialConvolutionBHWD 276 277 102 11 14 680
Torch-7 nn.SpatialConvolutionMM 91 302 129 23 27 572
cuda-convnet** pylearn2.cuda_convnet 155 647 230 23 47 1102
ccv ccv_convnet_layer
Theano (legacy) conv2d 53340 2690 1044 171 406 57651
cherry-picking**** best per layer 86 230 82 8 16 422
gradWeights

Columns L1, L2, L3, L4, L5, Total are times in milliseconds

Original Library Class/Function Benchmarked L1 L2 L3 L4 L5 Total
Theano (experimental)*** conv2d_fft 199 107 35 9 42 392
Caffe ConvolutionLayer<Dtype> 221 328 122 22 34 727
cuda-convnet2 * ConvLayer 455 340 108 11 21 935
NVidia CuDNN cudnn.SpatialConvolution 115 315 117 15 36 598
Torch-7 nn.SpatialConvolutionBHWD 237 285 85 10 14 631
Torch-7 nn.SpatialConvolutionMM 210 371 141 24 37 783
Theano (experimental) CorrMM 195 405 153 26 41 820
cuda-convnet** pylearn2.cuda_convnet 463 658 243 27 67 2069
ccv ccv_convnet_layer
Theano (legacy) conv2d 657 7062 1158 128 285 9290
cherry-picking**** best per layer 199 107 36 9 21 372

About

Easy benchmarking of all public open-source implementations of convnets

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published