convnet-benchmarks

Easy benchmarking of all public open-source implementations of convnets. A summary is provided in the section below.

After getting an initial baseline with the single module below (and getting inital benchmark scripts), I will benchmark a full AlexNet/MattNet/Overfeat

Machine: 6-core Intel i7-3930K @ 3.20GHz + NVIDIA Titan Black + Ubuntu 14.04 x86_64

###Spatial Convolution layer (3D input 3D output, densely connected)

Original Library	Class/Function Benchmarked	Total Time (ms)	Total forward time (ms)	Total backward time (ms)
Theano (experimental)***	conv2d_fft	1178	304	874
Caffe	ConvolutionLayer	1787	537	1250
cuda-convnet2 *	ConvLayer	1818	416	1402
NVidia CuDNN *	cudnn.SpatialConvolution	1861	513	1348
Torch-7	nn.SpatialConvolutionBHWD	1892	581	1311
Torch-7	SpatialConvolutionMM	1936	581	1355
Theano (experimental)	CorrMM	2063	630	1433
cuda-convnet**	pylearn2.cuda_convnet	3287	727	2560
ccv	ccv_convnet_layer	809+bw	809
Theano (legacy)	conv2d	70774	3833	66941
cherry-picking****	best per layer	985	191	794

* indicates that the library was tested with Torch bindings of the specific kernels.
** indicates that the library was tested with Pylearn2 bindings.
*** This is an experimental module which used FFT to calculate convolutions. It uses a lot of memory according to @benanne
**** The last row shows results obtainable when choosing the best-performing library for each layer.
L1 - Input: 128x128 Batch-size 128, Feature maps: 3->96, Kernel Size: 11x11, Stride: 1x1
L2 - Input: 64x64 Batch-size 128, Feature maps: 64->128, Kernel Size: 9x9, Stride: 1x1
L3 - Input: 32x32 Batch-size 128, Feature maps: 128->128, Kernel Size: 9x9, Stride: 1x1
L4 - Input: 16x16 Batch-size 128, Feature maps: 128->128, Kernel Size: 7x7, Stride: 1x1
L5 - Input: 13x13 Batch-size 128, Feature maps: 384->384, Kernel Size: 3x3, Stride: 1x1
The table is ranked according to the total time forward+backward calls for layers (L1 + L2 + L3 + L4 + L5)

#####Breakdown

Columns L1, L2, L3, L4, L5, Total are times in milliseconds

Original Library	Class/Function Benchmarked	L1	L2	L3	L4	L5	Total
Theano (experimental)***	conv2d_fft	138	73	30	9	39	304
Caffe	ConvolutionLayer<Dtype>	100	205	158	35	39	537
cuda-convnet2 *	ConvLayer	63	241	86	9	17	416
NVidia CuDNN	cudnn.SpatialConvolution	94	274	101	12	32	513
Torch-7	nn.SpatialConvolutionBHWD	182	279	94	11	15	581
Torch-7	nn.SpatialConvolutionMM	105	239	168	32	37	581
Theano (experimental)	CorrMM	100	251	197	38	44	630
cuda-convnet**	pylearn2.cuda_convnet	92	412	159	19	45	727
ccv	ccv_convnet_layer	121	437	182	23	44	809
Theano (legacy)	conv2d	408	2310	739	99	277	3833
cherry-picking****	best per layer	63	72	30	9	17	191

Columns L1, L2, L3, L4, L5, Total are times in milliseconds

Original Library	Class/Function Benchmarked	L1	L2	L3	L4	L5	Total
Theano (experimental)***	conv2d_fft	449	218	89	28	90	874
Caffe	ConvolutionLayer<Dtype>	307	599	242	42	60	1250
cuda-convnet2 *	ConvLayer	586	570	190	19	37	1402
NVidia CuDNN	cudnn.SpatialConvolution	226	736	297	32	57	1348
Torch-7	nn.SpatialConvolutionBHWD	513	562	187	21	28	1311
Torch-7	nn.SpatialConvolutionMM	301	673	270	47	64	1355
Theano (experimental)	CorrMM	282	733	295	51	72	1433
cuda-convnet**	pylearn2.cuda_convnet	618	1305	473	50	114	2560
ccv	ccv_convnet_layer
Theano (legacy)	conv2d	53997	9752	2202	299	691	66941
cherry-picking****	best per layer	285	337	118	17	37	794

Columns L1, L2, L3, L4, L5, Total are times in milliseconds

Original Library	Class/Function Benchmarked	L1	L2	L3	L4	L5	Total
Theano (experimental)***	conv2d_fft	250	111	54	19	48	482
Caffe	ConvolutionLayer<Dtype>	86	271	120	20	26	523
cuda-convnet2 *	ConvLayer	131	230	82	8	16	467
Theano (experimental)	CorrMM	87	328	142	25	31	613
NVidia CuDNN	cudnn.SpatialConvolution	111	421	180	17	21	750
Torch-7	nn.SpatialConvolutionBHWD	276	277	102	11	14	680
Torch-7	nn.SpatialConvolutionMM	91	302	129	23	27	572
cuda-convnet**	pylearn2.cuda_convnet	155	647	230	23	47	1102
ccv	ccv_convnet_layer
Theano (legacy)	conv2d	53340	2690	1044	171	406	57651
cherry-picking****	best per layer	86	230	82	8	16	422

Columns L1, L2, L3, L4, L5, Total are times in milliseconds

Original Library	Class/Function Benchmarked	L1	L2	L3	L4	L5	Total
Theano (experimental)***	conv2d_fft	199	107	35	9	42	392
Caffe	ConvolutionLayer<Dtype>	221	328	122	22	34	727
cuda-convnet2 *	ConvLayer	455	340	108	11	21	935
NVidia CuDNN	cudnn.SpatialConvolution	115	315	117	15	36	598
Torch-7	nn.SpatialConvolutionBHWD	237	285	85	10	14	631
Torch-7	nn.SpatialConvolutionMM	210	371	141	24	37	783
Theano (experimental)	CorrMM	195	405	153	26	41	820
cuda-convnet**	pylearn2.cuda_convnet	463	658	243	27	67	2069
ccv	ccv_convnet_layer
Theano (legacy)	conv2d	657	7062	1158	128	285	9290
cherry-picking****	best per layer	199	107	36	9	21	372

Name		Name	Last commit message	Last commit date
Latest commit History 231 Commits
CUV		CUV
TorontoDeepLearning-convnet		TorontoDeepLearning-convnet
caffe		caffe
ccv		ccv
convnet.js		convnet.js
cuda-convnet2		cuda-convnet2
cxxnet		cxxnet
eblearn		eblearn
matlab-DeepLearnToolbox		matlab-DeepLearnToolbox
nnforge		nnforge
theano		theano
torch7		torch7
.gitignore		.gitignore
README.md		README.md

rootlessweed/convnet-benchmarks