convnet-benchmarks

Easy benchmarking of all public open-source implementations of convnets. A summary is provided in the section below.

After getting an initial baseline with the single module below (and getting inital benchmark scripts), I will benchmark a full AlexNet/MattNet/Overfeat

Machine: 6-core Intel i7-3930K @ 3.20GHz + NVIDIA Titan Black + Ubuntu 14.04 x86_64

###Spatial Convolution layer (3D input 3D output) #####:forward() Columns L1, L2, L3, L4, L5, Total are times in milliseconds

Original Library	Class/Function Benchmarked	Device	L1	L2	L3	L4	L5	Total
Theano (experimental)***	pylearn2.mlp.ConvElemwise	GPU	205	75	28	9	5	322
cuda-convnet2 *	ConvLayer	GPU	69	242	87	9	17	424
Caffe	ConvolutionLayer<Dtype>	GPU	102	203	158	39	52	554
Torch-7	nn.SpatialConvolutionMM	GPU	105	240	168	41	55	609
cuda-convnet**	pylearn2.cuda_convnet	GPU	98	404	149	16	38	705
ccv	ccv_convnet_layer	GPU	121	437	182	23	44	809
Theano (legacy)**	pylearn2.mlp.ConvElemwise	GPU	418	2299	672	88	272	3749

* indicates that the library was tested with Torch bindings of the specific kernels.
** indicates that the library was tested with Pylearn2 bindings.
*** This is an experimental module which used FFT to calculate convolutions. It uses a lot of memory according to @benanne
L1 - Input: 128x128 Batch-size 128, Feature maps: 3->96, Kernel Size: 11x11, Stride: 1x1
L2 - Input: 64x64 Batch-size 128, Feature maps: 64->128, Kernel Size: 9x9, Stride: 1x1
L3 - Input: 32x32 Batch-size 128, Feature maps: 128->128, Kernel Size: 9x9, Stride: 1x1
L4 - Input: 16x16 Batch-size 128, Feature maps: 128->128, Kernel Size: 7x7, Stride: 1x1
L5 - Input: 13x13 Batch-size 128, Feature maps: 384->384, Kernel Size: 3x3, Stride: 1x1
The table is ranked according to the total time (L1 + L2 + L3 + L4 + L5)

Provide feedback