Benchmark TensorFlow #66

Closed
soumith opened this Issue Nov 11, 2015 · 112 comments

Comments

Projects
None yet
@soumith
Owner

soumith commented Nov 11, 2015

Google's TensorFlow benchmarks are here!

I've run the benchmarks on the Imagenet Winners.
When I saw issues with the numbers, memory etc., I emailed @Yangqing to confirm what I'm seeing, and that it is expected.

With that disclaimer out of the way, here's some things that you should know about TensorFlow (as of the pip version that I installed today):

  • in-place ReLU seems non-existent in practice.
    • Yangqing says: "right now there are little in-place operations in TensorFlow and we pretty much rely on the scheduler and the memory pool to allocate and deallocate memory"
  • Supports CuDNN R2. No R3 support yet, Yangqing says the next version they are going to support is likely R4.

Coming to the benchmarks:

  • Googlenet with batchsize 128 goes Out of Memory. The largest batch-size I could fit is 16 (tried 16, 32, 64, 128)
  • VGG with batchsize 64 goes Out of Memory (Edit: VGG memory issue was solved by using the BFC allocator updated by GOOG). The largest batch-size I could fit is 32 (tried 32, 64).
  • I've also computed Torch7+CuDNN-R2 baselines for these batch-sizes.

AlexNet (One Weird Trick paper) - Input 128x3x224x224

Library Time (ms) forward (ms) backward (ms)
CuDNN-R3 (Torch) 96 32 64
Nervana (Neon) 101 32 69
CuDNN-R2 (Torch) 231 70 161
TensorFlow 326 96 230

Overfeat [fast] - Input 128x3x231x231

Library Time (ms) forward (ms) backward (ms)
CuDNN-R3 (Torch) 326 113 213
fbfft (Torch) 342 114 227
CuDNN-R2 (Torch) 810 234 576
TensorFlow 1084 316 768

OxfordNet [Model-A] - Input 64x3x224x224

Library Time (ms) forward (ms) backward (ms)
Nervana 590 180 410
CuDNN-R3 (Torch) 615 196 418
CuDNN-R2 (Torch) 1099 342 757
TensorFlow 1840 545 1295

GoogleNet V1 - Input 16x3x224x224

Library Time (ms) forward (ms) backward (ms)
CuDNN-R2 (Torch) 564 174 390
TensorFlow 590 54 536

Note that at batch size of 16, googlenet with CuDNN-R2 + Torch likely runs into dispatching overhead, so it's an exotic comparison, but not practically very interesting or encouraging.

There you go.

I'm assuming that the first release of TensorFlow is still quite unpolished, and that they will improve it over time with various memory and time optimizations baked in.

@soumith

This comment has been minimized.

Show comment
Hide comment
@soumith

soumith Nov 11, 2015

Owner

The benchmark scripts and raw outputs are located here: https://github.com/soumith/convnet-benchmarks/tree/master/tensorflow

Owner

soumith commented Nov 11, 2015

The benchmark scripts and raw outputs are located here: https://github.com/soumith/convnet-benchmarks/tree/master/tensorflow

@scott-gray

This comment has been minimized.

Show comment
Hide comment
@scott-gray

scott-gray Nov 11, 2015

The lack of in place operations is rather surprising. Once you have the full DAG it should be rather easy to apply a liveness algorithm to it to optimize tensor allocations. For an example see this: http://www.diku.dk/hjemmesider/ansatte/torbenm/ICD/Register.pdf (just replace register with tensor).

I'm kind of curious if there's any support for automatically compounding operations together or of leveraging kernels that have some compounding built in (like the alpha/beta params of gemm). I'm pretty close to maximizing the amount of compounding that's possible in my benchmark networks. And because I write all my own kernels I can further compound things that aren't possible with closed source libraries like cuDNN. For example, I'm now able to compute the mean along the PQN dimension inside the conv and gemm kernels at no cost. This cuts down the bandwidth required by batch norm in fprop by a third.

Though on the whole I think TensorFlow seems like a great platform to build on. I'd say there's a good chance my kernels will make their way there sooner rather than later. You can find new benchmarks of my latest winograd kernels in the updated paper here: http://arxiv.org/abs/1509.09308

What I'll be working on next is basically going to be taking a lot of what I learned implementing winograd and refreshing all of my conv/pooling/gemm kernels to support very small minibatches at near full utilization. This should have a big impact on the level at which you can scale these networks and the speed at which they converge. Here's a great paper exploring this: http://arxiv.org/abs/1509.04210

The lack of in place operations is rather surprising. Once you have the full DAG it should be rather easy to apply a liveness algorithm to it to optimize tensor allocations. For an example see this: http://www.diku.dk/hjemmesider/ansatte/torbenm/ICD/Register.pdf (just replace register with tensor).

I'm kind of curious if there's any support for automatically compounding operations together or of leveraging kernels that have some compounding built in (like the alpha/beta params of gemm). I'm pretty close to maximizing the amount of compounding that's possible in my benchmark networks. And because I write all my own kernels I can further compound things that aren't possible with closed source libraries like cuDNN. For example, I'm now able to compute the mean along the PQN dimension inside the conv and gemm kernels at no cost. This cuts down the bandwidth required by batch norm in fprop by a third.

Though on the whole I think TensorFlow seems like a great platform to build on. I'd say there's a good chance my kernels will make their way there sooner rather than later. You can find new benchmarks of my latest winograd kernels in the updated paper here: http://arxiv.org/abs/1509.09308

What I'll be working on next is basically going to be taking a lot of what I learned implementing winograd and refreshing all of my conv/pooling/gemm kernels to support very small minibatches at near full utilization. This should have a big impact on the level at which you can scale these networks and the speed at which they converge. Here's a great paper exploring this: http://arxiv.org/abs/1509.04210

@yuzcccc

This comment has been minimized.

Show comment
Hide comment
@yuzcccc

yuzcccc Nov 11, 2015

Hi, I strongly recommand to add mxnet https://github.com/dmlc/mxnet into comparision which in my opinion may be the fastest DL library :)

yuzcccc commented Nov 11, 2015

Hi, I strongly recommand to add mxnet https://github.com/dmlc/mxnet into comparision which in my opinion may be the fastest DL library :)

@mavenlin

This comment has been minimized.

Show comment
Hide comment
@mavenlin

mavenlin Nov 11, 2015

+1 for benchmarking mxnet, the fastest now.

+1 for benchmarking mxnet, the fastest now.

@strongbanker

This comment has been minimized.

Show comment
Hide comment
@strongbanker

strongbanker Nov 11, 2015

+1 for benchmarking mxnet

+1 for benchmarking mxnet

@fvisin

This comment has been minimized.

Show comment
Hide comment
@fvisin

fvisin Nov 11, 2015

I would also love to see a comparison with Theano http://deeplearning.net/software/theano/ as it is another widely adopted deep learning library.

fvisin commented Nov 11, 2015

I would also love to see a comparison with Theano http://deeplearning.net/software/theano/ as it is another widely adopted deep learning library.

@nkoumchatzky

This comment has been minimized.

Show comment
Hide comment
@nkoumchatzky

nkoumchatzky Nov 11, 2015

Thanks for benchmarking!

Thanks for benchmarking!

@aaronwro

This comment has been minimized.

Show comment
Hide comment
@aaronwro

aaronwro Nov 11, 2015

+1 would love to see tensorflow benchmarked against mxnet, Theano, Autograd for Torch, and Caffe.

+1 would love to see tensorflow benchmarked against mxnet, Theano, Autograd for Torch, and Caffe.

@vincentvanhoucke

This comment has been minimized.

Show comment
Hide comment
@vincentvanhoucke

vincentvanhoucke Nov 11, 2015

Thanks @soumith! Yes, our only launch criterion for convnets was 'GoogLeNet within distance from CuDNN[R2]', and we've punted on a lot of performance work, including upgrading CuDNN, until after the initial release. Expect a lot of movement on that front in the coming weeks.

Thanks @soumith! Yes, our only launch criterion for convnets was 'GoogLeNet within distance from CuDNN[R2]', and we've punted on a lot of performance work, including upgrading CuDNN, until after the initial release. Expect a lot of movement on that front in the coming weeks.

@soumith

This comment has been minimized.

Show comment
Hide comment
@soumith

soumith Nov 11, 2015

Owner

@aaronwro @fvisin it's already benchmarked against Torch, Theano, Caffe. Look at the readme on the main page ( https://github.com/soumith/convnet-benchmarks/blob/master/README.md ).
I definitely need to pull my socks up and benchmark MXNet and Chainer.

@vincentvanhoucke thanks for your response. I assumed that you'll fix it over the next weeks / months :)

Owner

soumith commented Nov 11, 2015

@aaronwro @fvisin it's already benchmarked against Torch, Theano, Caffe. Look at the readme on the main page ( https://github.com/soumith/convnet-benchmarks/blob/master/README.md ).
I definitely need to pull my socks up and benchmark MXNet and Chainer.

@vincentvanhoucke thanks for your response. I assumed that you'll fix it over the next weeks / months :)

@vincentvanhoucke

This comment has been minimized.

Show comment
Hide comment
@vincentvanhoucke

vincentvanhoucke Nov 11, 2015

@scott-gray let us know if you need help with compounding / graph rewriting. The graph representation is meant to make these kinds of operations possible, and the common subexpression elimination that TF currently uses is also meant as a demonstration of that. I suspect we might need to do a bit more to provide good APIs to make it easier to bake in compound kernels.

@scott-gray let us know if you need help with compounding / graph rewriting. The graph representation is meant to make these kinds of operations possible, and the common subexpression elimination that TF currently uses is also meant as a demonstration of that. I suspect we might need to do a bit more to provide good APIs to make it easier to bake in compound kernels.

@soumith

This comment has been minimized.

Show comment
Hide comment
@soumith

soumith Nov 11, 2015

Owner

there seems to be some misinterpretation by random people in social media that because I work for Facebook, I'm attacking TensorFlow. That seems super weird, because I love the vision of TensorFlow, and there's no competition (one can write a XXX frontend for a TensorFlow backend).

My benchmarks have always been independently run, and completely neutral, I've been running them forever now, sad that people misinterpret the slightest of things.
cc: @vincentvanhoucke

Owner

soumith commented Nov 11, 2015

there seems to be some misinterpretation by random people in social media that because I work for Facebook, I'm attacking TensorFlow. That seems super weird, because I love the vision of TensorFlow, and there's no competition (one can write a XXX frontend for a TensorFlow backend).

My benchmarks have always been independently run, and completely neutral, I've been running them forever now, sad that people misinterpret the slightest of things.
cc: @vincentvanhoucke

@clementfarabet

This comment has been minimized.

Show comment
Hide comment
@clementfarabet

clementfarabet Nov 11, 2015

Collaborator

will defend Soumith on this one – he has indeed been running these
benchmarks for quite some time, and complete neutrality.

On Wed, Nov 11, 2015 at 11:33 AM, Soumith Chintala <notifications@github.com

wrote:

there seems to be some misinterpretation by random people in social media
that because I work for Facebook, I'm attacking TensorFlow. That seems
super weird, because I love the vision of TensorFlow, and there's no
competition (one can write a XXX frontend for a TensorFlow backend).

My benchmarks have always been independently run, and completely neutral,
I've been running them forever now, sad that people misinterpret the
slightest of things.
cc: @vincentvanhoucke https://github.com/vincentvanhoucke


Reply to this email directly or view it on GitHub
#66 (comment)
.

Collaborator

clementfarabet commented Nov 11, 2015

will defend Soumith on this one – he has indeed been running these
benchmarks for quite some time, and complete neutrality.

On Wed, Nov 11, 2015 at 11:33 AM, Soumith Chintala <notifications@github.com

wrote:

there seems to be some misinterpretation by random people in social media
that because I work for Facebook, I'm attacking TensorFlow. That seems
super weird, because I love the vision of TensorFlow, and there's no
competition (one can write a XXX frontend for a TensorFlow backend).

My benchmarks have always been independently run, and completely neutral,
I've been running them forever now, sad that people misinterpret the
slightest of things.
cc: @vincentvanhoucke https://github.com/vincentvanhoucke


Reply to this email directly or view it on GitHub
#66 (comment)
.

@fvisin

This comment has been minimized.

Show comment
Hide comment
@fvisin

fvisin Nov 11, 2015

@soumith Excellent, thank you!!

fvisin commented Nov 11, 2015

@soumith Excellent, thank you!!

@vincentvanhoucke

This comment has been minimized.

Show comment
Hide comment
@vincentvanhoucke

vincentvanhoucke Nov 11, 2015

@soumith no good deed goes unpunished ;) Please don't let this deter you from providing this valuable service to the community!

@soumith no good deed goes unpunished ;) Please don't let this deter you from providing this valuable service to the community!

@Yangqing

This comment has been minimized.

Show comment
Hide comment
@Yangqing

Yangqing Nov 11, 2015

Contributor

@soumith , I am sorry that some people interpreted things that way. I've always appreciated your benchmark, which creates a great atmosphere for us to look at bottlenecks and push forward the field as a whole community. We all owe you a big debt of gratitude.

Contributor

Yangqing commented Nov 11, 2015

@soumith , I am sorry that some people interpreted things that way. I've always appreciated your benchmark, which creates a great atmosphere for us to look at bottlenecks and push forward the field as a whole community. We all owe you a big debt of gratitude.

@aaronwro

This comment has been minimized.

Show comment
Hide comment

@soumith thanks!

@jdemouth

This comment has been minimized.

Show comment
Hide comment
@jdemouth

jdemouth Nov 11, 2015

As always, that's super interesting. Thanks for pushing all of us toward more performance.

As always, that's super interesting. Thanks for pushing all of us toward more performance.

@tqchen

This comment has been minimized.

Show comment
Hide comment
@tqchen

tqchen Nov 11, 2015

For memory optimizations, what we have found is that inplace optimization does not matter that much, if the allocator is smart enough to do a static allocation before running the graph(as opposed to relying on a dynamic allocator). We have detailed what can be done here

https://mxnet.readthedocs.org/en/latest/developer-guide/note_memory.html

Which I assume applies to computation graph frameworks such as TF, caffe2 and CGT.
@vincentvanhoucke @Yangqing

tqchen commented Nov 11, 2015

For memory optimizations, what we have found is that inplace optimization does not matter that much, if the allocator is smart enough to do a static allocation before running the graph(as opposed to relying on a dynamic allocator). We have detailed what can be done here

https://mxnet.readthedocs.org/en/latest/developer-guide/note_memory.html

Which I assume applies to computation graph frameworks such as TF, caffe2 and CGT.
@vincentvanhoucke @Yangqing

@tqchen

This comment has been minimized.

Show comment
Hide comment
@tqchen

tqchen Nov 11, 2015

The general idea is not only to share memory of same shape(i.e. inplace) , but also different shapes and size

tqchen commented Nov 11, 2015

The general idea is not only to share memory of same shape(i.e. inplace) , but also different shapes and size

@rajatmonga

This comment has been minimized.

Show comment
Hide comment
@rajatmonga

rajatmonga Nov 11, 2015

@soumith Thanks for running the benchmarks! As @vincentvanhoucke noted in this thread, our goal was to get an early release out so users can start playing with it and provide feedback on what they care about. We are committed to making TensorFlow fast and are actively working on the performance issues you highlight here.

@soumith Thanks for running the benchmarks! As @vincentvanhoucke noted in this thread, our goal was to get an early release out so users can start playing with it and provide feedback on what they care about. We are committed to making TensorFlow fast and are actively working on the performance issues you highlight here.

@alexbw

This comment has been minimized.

Show comment
Hide comment
@alexbw

alexbw Nov 11, 2015

@soumith You're doing a good deed! Haters gonna hate.

alexbw commented Nov 11, 2015

@soumith You're doing a good deed! Haters gonna hate.

@piiswrong

This comment has been minimized.

Show comment
Hide comment
@piiswrong

piiswrong Nov 11, 2015

I'm a little confused by the number. 1300 samples/sec seems too fast even for alexnet on single TitanX. Is this standard training, e.g. io+forward+backward+update, or just forward+backward?

I'm a little confused by the number. 1300 samples/sec seems too fast even for alexnet on single TitanX. Is this standard training, e.g. io+forward+backward+update, or just forward+backward?

@kyieldmark

This comment has been minimized.

Show comment
Hide comment

Nice work.

@antinucleon

This comment has been minimized.

Show comment
Hide comment
@antinucleon

antinucleon Nov 11, 2015

@piiswrong I will help @soumith make the benchmark script.

Anyway we opened everything since beginning. The main purpose is learning from each other but not advertise boring number.

@piiswrong I will help @soumith make the benchmark script.

Anyway we opened everything since beginning. The main purpose is learning from each other but not advertise boring number.

@koraykv

This comment has been minimized.

Show comment
Hide comment
@koraykv

koraykv Nov 11, 2015

I will also add my support to Soumith. He has been running these benchmarks for sometime with complete transparency and neutrality.

koraykv commented Nov 11, 2015

I will also add my support to Soumith. He has been running these benchmarks for sometime with complete transparency and neutrality.

@sermanet

This comment has been minimized.

Show comment
Hide comment
@sermanet

sermanet Nov 11, 2015

@koraykv +1, thanks Soumith!

@koraykv +1, thanks Soumith!

@soumith

This comment has been minimized.

Show comment
Hide comment
@soumith

soumith Nov 11, 2015

Owner

Someone on reddit suggested that I build tensorflow from source, to fix speed issues. That did not help, It produces the same numbers as the pip version on my alexnet script :

https://gist.github.com/soumith/11acc2f0dbc5212ea372

Owner

soumith commented Nov 11, 2015

Someone on reddit suggested that I build tensorflow from source, to fix speed issues. That did not help, It produces the same numbers as the pip version on my alexnet script :

https://gist.github.com/soumith/11acc2f0dbc5212ea372

@jeffdonahue jeffdonahue referenced this issue in tensorflow/tensorflow Nov 11, 2015

Closed

AlexNet with FC layers: backward is very slow? #113

@soumith

This comment has been minimized.

Show comment
Hide comment
@soumith

soumith Nov 11, 2015

Owner

FWIW, Yangqing's fix to avoid CPU-GPU transfers improved results across the board by ~20%. (I've updated the tables above). The memory issues are unchanged.

Owner

soumith commented Nov 11, 2015

FWIW, Yangqing's fix to avoid CPU-GPU transfers improved results across the board by ~20%. (I've updated the tables above). The memory issues are unchanged.

@XericZephyr

This comment has been minimized.

Show comment
Hide comment
@XericZephyr

XericZephyr Nov 11, 2015

+1 for mxnet! Thanks.

+1 for mxnet! Thanks.

@yeqinglee

This comment has been minimized.

Show comment
Hide comment
@yeqinglee

yeqinglee Nov 11, 2015

+1 for mxnet.

+1 for mxnet.

@gujunli

This comment has been minimized.

Show comment
Hide comment
@gujunli

gujunli Nov 11, 2015

@soumith I have a naive question, is the Tensor Flow's result based on c++ code or cuDNN v2? I would guess if you run on Titanx tensor flow will rely on some CUDA library?

gujunli commented Nov 11, 2015

@soumith I have a naive question, is the Tensor Flow's result based on c++ code or cuDNN v2? I would guess if you run on Titanx tensor flow will rely on some CUDA library?

@soumith

This comment has been minimized.

Show comment
Hide comment
@soumith

soumith Nov 11, 2015

Owner

@gujunli it's based on CuDNN V2.

Owner

soumith commented Nov 11, 2015

@gujunli it's based on CuDNN V2.

@mattjj

This comment has been minimized.

Show comment
Hide comment
@mattjj

mattjj Nov 11, 2015

@soumith thanks for running and maintaining these benchmarks; they're always thorough and informative!

mattjj commented Nov 11, 2015

@soumith thanks for running and maintaining these benchmarks; they're always thorough and informative!

@gujunli

This comment has been minimized.

Show comment
Hide comment
@gujunli

gujunli Nov 11, 2015

@soumith Then I don't understand Why Tensor Flow with cuDNN v2 ends up being so slow? Can you share some of your understanding? I will guess TF still calls cuDNN v2 for the conv/pool/relu/FC layers. Remember from your earlier AlexNet results, cuDNN v2 is 231=70+161, Caffe (native) ConvolutionLayer 324=121+203. However Tensor flow is 326=96+230.

gujunli commented Nov 11, 2015

@soumith Then I don't understand Why Tensor Flow with cuDNN v2 ends up being so slow? Can you share some of your understanding? I will guess TF still calls cuDNN v2 for the conv/pool/relu/FC layers. Remember from your earlier AlexNet results, cuDNN v2 is 231=70+161, Caffe (native) ConvolutionLayer 324=121+203. However Tensor flow is 326=96+230.

@scott-gray

This comment has been minimized.

Show comment
Hide comment
@scott-gray

scott-gray Nov 11, 2015

Running the network under nvvp (nvidia visual profiler) should be pretty informative. A well tuned network timeline should just be a solid block of kernel calls with no gaps.

Running the network under nvvp (nvidia visual profiler) should be pretty informative. A well tuned network timeline should just be a solid block of kernel calls with no gaps.

@gujunli

This comment has been minimized.

Show comment
Hide comment
@gujunli

gujunli Nov 11, 2015

@scott-gray so you think TF scheduling may not be efficient? I need to read TF whitepaper to understand how it works. Any one understands?

gujunli commented Nov 11, 2015

@scott-gray so you think TF scheduling may not be efficient? I need to read TF whitepaper to understand how it works. Any one understands?

@scott-gray

This comment has been minimized.

Show comment
Hide comment
@scott-gray

scott-gray Nov 11, 2015

@gujunli I'm just saying if they're just using stock cuDNNv2 then the only reason it would be slower is if there were gaps in the timeline. Seeing where those gaps occur and any extra host/device memcpy traffic would give you a clearer picture of what's going wrong.

@gujunli I'm just saying if they're just using stock cuDNNv2 then the only reason it would be slower is if there were gaps in the timeline. Seeing where those gaps occur and any extra host/device memcpy traffic would give you a clearer picture of what's going wrong.

@raingo

This comment has been minimized.

Show comment
Hide comment
@raingo

raingo Nov 25, 2015

@vrv Gotta. Thanks!

raingo commented Nov 25, 2015

@vrv Gotta. Thanks!

@DavidWiesner

This comment has been minimized.

Show comment
Hide comment
@DavidWiesner

DavidWiesner Nov 26, 2015

@soumith They made some changes to tensorflow TensorFlow: Improve performance of Alexnet. can you update the benchmark for Alexnet

@soumith They made some changes to tensorflow TensorFlow: Improve performance of Alexnet. can you update the benchmark for Alexnet

@ozabluda

This comment has been minimized.

Show comment
Hide comment
@ozabluda

ozabluda Nov 30, 2015

Contributor

@soumith

Googlenet with batchsize 128 goes Out of Memory. The largest batch-size I could fit is 16 (tried 16, 32, 64, 128) [...] if my memory is right, it's really tight on space to do a batch size of 128 Googlenet in 12GB, and one needs in-place ops for sure.

For comparison, here are my measurements of approximate peak memory usage with Torch/cuDNNv3 on Titan-X:

AlexNet (128): 3 GB
OverFeat (128): 5 GB
VGG Model-A (128): OOM
GoogLeNet(128): 9G

VGG Model-A-11 (64): 8 G
VGG Model-B-13(64): 12 G (I think this may fall back on slower algos due to tight memory)
VGG Model-D-16 (64): 12 G (I think this may fall back on slower algos due to tight memory)
VGG Model-E-19 (64): 12 G (I think this may fall back on slower algos due to tight memory)

VGG Model-A-11 (96): 11 G

Contributor

ozabluda commented Nov 30, 2015

@soumith

Googlenet with batchsize 128 goes Out of Memory. The largest batch-size I could fit is 16 (tried 16, 32, 64, 128) [...] if my memory is right, it's really tight on space to do a batch size of 128 Googlenet in 12GB, and one needs in-place ops for sure.

For comparison, here are my measurements of approximate peak memory usage with Torch/cuDNNv3 on Titan-X:

AlexNet (128): 3 GB
OverFeat (128): 5 GB
VGG Model-A (128): OOM
GoogLeNet(128): 9G

VGG Model-A-11 (64): 8 G
VGG Model-B-13(64): 12 G (I think this may fall back on slower algos due to tight memory)
VGG Model-D-16 (64): 12 G (I think this may fall back on slower algos due to tight memory)
VGG Model-E-19 (64): 12 G (I think this may fall back on slower algos due to tight memory)

VGG Model-A-11 (96): 11 G

@alexatknit

This comment has been minimized.

Show comment
Hide comment
@alexatknit

alexatknit Dec 10, 2015

@soumith Since its release I've seen pretty dramatic improvements in tensorflow's memory management and performance. I think it may be time to benchmark 0.6.0.

@soumith Since its release I've seen pretty dramatic improvements in tensorflow's memory management and performance. I think it may be time to benchmark 0.6.0.

@soumith

This comment has been minimized.

Show comment
Hide comment
@soumith

soumith Dec 10, 2015

Owner

@alexatknit will do. i will take some time one of these days to do MXNet, Chainer and TF 0.6. Have been a bit busy lately with wrapping up research.

Owner

soumith commented Dec 10, 2015

@alexatknit will do. i will take some time one of these days to do MXNet, Chainer and TF 0.6. Have been a bit busy lately with wrapping up research.

@zcyang zcyang referenced this issue in tensorflow/tensorflow Dec 12, 2015

Closed

memory issues #492

@hgaiser

This comment has been minimized.

Show comment
Hide comment
@hgaiser

hgaiser Jan 5, 2016

I am looking forward to the updated comparison, have you found time to look into it?

hgaiser commented Jan 5, 2016

I am looking forward to the updated comparison, have you found time to look into it?

@soumith

This comment has been minimized.

Show comment
Hide comment
@soumith

soumith Jan 5, 2016

Owner

TensorFlow Trunk as of 1 hour ago (post 0.6 release) numbers:

AlexNet (One Weird Trick paper) - Input 128x3x224x224

Library Time (ms) forward (ms) backward (ms)
CuDNN-R3 (Torch) 96 32 64
Nervana (Neon) 101 32 69
CuDNN-R2 (Torch) 231 70 161
TensorFlow 0.5 326 96 230
TensorFlow 0.6+ 292 70 222

Overfeat [fast] - Input 128x3x231x231

Library Time (ms) forward (ms) backward (ms)
CuDNN-R3 (Torch) 326 113 213
fbfft (Torch) 342 114 227
CuDNN-R2 (Torch) 810 234 576
TensorFlow 0.5 1084 316 768
TensorFlow 0.6+ 856 204 652

OxfordNet [Model-A] - Input 64x3x224x224

Library Time (ms) forward (ms) backward (ms)
Nervana 590 180 410
CuDNN-R3 (Torch) 615 196 418
CuDNN-R2 (Torch) 1099 342 757
TensorFlow 0.5 1840 545 1295
TensorFlow 0.6+ 1656 347 1309

GoogleNet V1 - Input 128x3x224x224

Library Time (ms) forward (ms) backward (ms)
CuDNN-R3 (Torch) 431 117 313
TensorFlow 0.5 OOM OOM OOM
TensorFlow 0.6+ 1237 246 991

There you go.
The new logs are all checked in.

Owner

soumith commented Jan 5, 2016

TensorFlow Trunk as of 1 hour ago (post 0.6 release) numbers:

AlexNet (One Weird Trick paper) - Input 128x3x224x224

Library Time (ms) forward (ms) backward (ms)
CuDNN-R3 (Torch) 96 32 64
Nervana (Neon) 101 32 69
CuDNN-R2 (Torch) 231 70 161
TensorFlow 0.5 326 96 230
TensorFlow 0.6+ 292 70 222

Overfeat [fast] - Input 128x3x231x231

Library Time (ms) forward (ms) backward (ms)
CuDNN-R3 (Torch) 326 113 213
fbfft (Torch) 342 114 227
CuDNN-R2 (Torch) 810 234 576
TensorFlow 0.5 1084 316 768
TensorFlow 0.6+ 856 204 652

OxfordNet [Model-A] - Input 64x3x224x224

Library Time (ms) forward (ms) backward (ms)
Nervana 590 180 410
CuDNN-R3 (Torch) 615 196 418
CuDNN-R2 (Torch) 1099 342 757
TensorFlow 0.5 1840 545 1295
TensorFlow 0.6+ 1656 347 1309

GoogleNet V1 - Input 128x3x224x224

Library Time (ms) forward (ms) backward (ms)
CuDNN-R3 (Torch) 431 117 313
TensorFlow 0.5 OOM OOM OOM
TensorFlow 0.6+ 1237 246 991

There you go.
The new logs are all checked in.

@rajatmonga

This comment has been minimized.

Show comment
Hide comment
@rajatmonga

rajatmonga Jan 12, 2016

@soumith Thanks for running the numbers again. I know you have been asked to do this a number of times lately and it takes you away from your research. Having these benchmarks have been greatly useful for everyone.

After your run we realized we seem to have regressed in performance since the 0.6.0 release (mostly from our switch over to the public Eigen branch) and over the last few days @zheng-xq and @benoitsteiner along with others have made improvements to get back the performance. When running the benchmarks again at commit d1b8333, we get the following numbers:

Model Total (ms) Forward (ms) Backward (ms)
AlexNet 229 69 160
Overfeat [fast] 839 203 636
OxfordNet 1216 329 887
GoogleNet V1 - Input 128x3x224x224 815 234 581
  • This is measured on an unsuperclocked Titan-X with the default power-limit 250W.
  • For consistency, between each run, we wait for a few minutes for GPU to cool down to room temperature.

These results are also in line with what we see at 0.6.0 release.

We are also looking into setting up performance benchmarks with the builds so we don't hit such performance regressions.

Again, Thanks for all your updates.

@soumith Thanks for running the numbers again. I know you have been asked to do this a number of times lately and it takes you away from your research. Having these benchmarks have been greatly useful for everyone.

After your run we realized we seem to have regressed in performance since the 0.6.0 release (mostly from our switch over to the public Eigen branch) and over the last few days @zheng-xq and @benoitsteiner along with others have made improvements to get back the performance. When running the benchmarks again at commit d1b8333, we get the following numbers:

Model Total (ms) Forward (ms) Backward (ms)
AlexNet 229 69 160
Overfeat [fast] 839 203 636
OxfordNet 1216 329 887
GoogleNet V1 - Input 128x3x224x224 815 234 581
  • This is measured on an unsuperclocked Titan-X with the default power-limit 250W.
  • For consistency, between each run, we wait for a few minutes for GPU to cool down to room temperature.

These results are also in line with what we see at 0.6.0 release.

We are also looking into setting up performance benchmarks with the builds so we don't hit such performance regressions.

Again, Thanks for all your updates.

@vincenzocaselli

This comment has been minimized.

Show comment
Hide comment
@vincenzocaselli

vincenzocaselli Jan 25, 2016

Does anyone has experiences and/or comparisons with DL4J (http://deeplearning4j.org) ?

Does anyone has experiences and/or comparisons with DL4J (http://deeplearning4j.org) ?

@soumith

This comment has been minimized.

Show comment
Hide comment
@soumith

soumith Jan 26, 2016

Owner

@rajatmonga just got back from vacay. It's cool that you guys are setting up contbuilds for perf regressions.

However, I dont get the numbers that you seem to be getting on the tensorflow as of yesterday ( a27d844e05447e65aa279ae5269a2d75590f46f6 ). The numbers are slightly better but not quite the improvement that you are seeing.

Look here for the new numbers: 1f09e1e

Owner

soumith commented Jan 26, 2016

@rajatmonga just got back from vacay. It's cool that you guys are setting up contbuilds for perf regressions.

However, I dont get the numbers that you seem to be getting on the tensorflow as of yesterday ( a27d844e05447e65aa279ae5269a2d75590f46f6 ). The numbers are slightly better but not quite the improvement that you are seeing.

Look here for the new numbers: 1f09e1e

@rajatmonga

This comment has been minimized.

Show comment
Hide comment
@rajatmonga

rajatmonga Jan 27, 2016

@soumith Thanks for running the benchmarks again. It is possible there are some memory related regressions that are hurting performance again. What you have right now is good, lets not worry about this.

We are working on getting cuDNN R4 fully supported and will address the remaining performance issues in that context. May ping this thread once we have a full release with R4, and it will be worthwhile rerunning benchmarks - likely for many of the libraries.

Also, let me know if we can help you with this project in any way - it is very useful to the community, but I am sure it takes a lot of your time as well. Thanks for keeping this going!

@soumith Thanks for running the benchmarks again. It is possible there are some memory related regressions that are hurting performance again. What you have right now is good, lets not worry about this.

We are working on getting cuDNN R4 fully supported and will address the remaining performance issues in that context. May ping this thread once we have a full release with R4, and it will be worthwhile rerunning benchmarks - likely for many of the libraries.

Also, let me know if we can help you with this project in any way - it is very useful to the community, but I am sure it takes a lot of your time as well. Thanks for keeping this going!

@rajatmonga

This comment has been minimized.

Show comment
Hide comment
@rajatmonga

rajatmonga Feb 4, 2016

Yes, That is in our list of tasks and is quite important to make sure we
don't have performance regressions. We haven't been able to get to it yet.

On Thu, Feb 4, 2016 at 9:11 AM Madder notifications@github.com wrote:

Has anyone thought of running these benchmarks periodically as part of
tensorflow's CI for instance?


Reply to this email directly or view it on GitHub
#66 (comment)
.

Yes, That is in our list of tasks and is quite important to make sure we
don't have performance regressions. We haven't been able to get to it yet.

On Thu, Feb 4, 2016 at 9:11 AM Madder notifications@github.com wrote:

Has anyone thought of running these benchmarks periodically as part of
tensorflow's CI for instance?


Reply to this email directly or view it on GitHub
#66 (comment)
.

@cgel

This comment has been minimized.

Show comment
Hide comment
@cgel

cgel Feb 16, 2016

Tf 0.7.0 released!
Looking forward to the updated benchmarks.

cgel commented Feb 16, 2016

Tf 0.7.0 released!
Looking forward to the updated benchmarks.

@MikalaiDrabovich

This comment has been minimized.

Show comment
Hide comment

👍 +1:

@ronghanghu

This comment has been minimized.

Show comment
Hide comment
@ronghanghu

ronghanghu Feb 23, 2016

Great results 👍 👍 👍

Looking forward to the results with cuDNN v4

Great results 👍 👍 👍

Looking forward to the results with cuDNN v4

@Madder

This comment has been minimized.

Show comment
Hide comment
@Madder

Madder Feb 23, 2016

+1

On Tue, Feb 23, 2016 at 10:29 PM, Ronghang Hu notifications@github.com
wrote:

Great results [image: 👍] [image: 👍] [image: 👍]

Looking forward to the results with cuDNN v4


Reply to this email directly or view it on GitHub
#66 (comment)
.

Madder commented Feb 23, 2016

+1

On Tue, Feb 23, 2016 at 10:29 PM, Ronghang Hu notifications@github.com
wrote:

Great results [image: 👍] [image: 👍] [image: 👍]

Looking forward to the results with cuDNN v4


Reply to this email directly or view it on GitHub
#66 (comment)
.

@soumith

This comment has been minimized.

Show comment
Hide comment
@soumith

soumith Feb 29, 2016

Owner

As requested, TF 0.7 + CuDNN R4 has been benchmarked. CuDNN R4 + Torch has also been benchmarked as a baseline.

Within the span of Nervana's Neon, Torch + CuDNN4, TensorFlow + CuDNN4 (and Caffe + CuDNN is likely in the same ballpark as torch), TensorFlow ( at commit tensorflow/tensorflow@1d4f00d ) still lags behind the others by 2x to 3x performance on Alexnet, VGG and Googlenet. It is within 1.5x of Overfeat.

Owner

soumith commented Feb 29, 2016

As requested, TF 0.7 + CuDNN R4 has been benchmarked. CuDNN R4 + Torch has also been benchmarked as a baseline.

Within the span of Nervana's Neon, Torch + CuDNN4, TensorFlow + CuDNN4 (and Caffe + CuDNN is likely in the same ballpark as torch), TensorFlow ( at commit tensorflow/tensorflow@1d4f00d ) still lags behind the others by 2x to 3x performance on Alexnet, VGG and Googlenet. It is within 1.5x of Overfeat.

@soumith

This comment has been minimized.

Show comment
Hide comment
@soumith

soumith Feb 29, 2016

Owner

For full details, see the main README.md: https://github.com/soumith/convnet-benchmarks/blob/master/README.md and the raw logs are located here: 2888b23

Owner

soumith commented Feb 29, 2016

For full details, see the main README.md: https://github.com/soumith/convnet-benchmarks/blob/master/README.md and the raw logs are located here: 2888b23

@soumith

This comment has been minimized.

Show comment
Hide comment
@soumith

soumith Feb 29, 2016

Owner

i have not changed the benchmark scripts in any way, so if the TF benchmark scripts need any change (such as new allocator settings etc.), I welcome the TF folks to let me know.

Owner

soumith commented Feb 29, 2016

i have not changed the benchmark scripts in any way, so if the TF benchmark scripts need any change (such as new allocator settings etc.), I welcome the TF folks to let me know.

@rajatmonga

This comment has been minimized.

Show comment
Hide comment
@rajatmonga

rajatmonga Feb 29, 2016

Thanks Soumith@, this isn't quite where we had seen our numbers at, but we
will look at the tests again and ping you if we notice something.

Thanks again for running these benchmarks!

On Sun, Feb 28, 2016, 4:32 PM Soumith Chintala notifications@github.com
wrote:

i have not changed the benchmark scripts in any way, so if the TF
benchmark scripts need any change (such as new allocator settings etc.), I
welcome the TF folks to let me know.


Reply to this email directly or view it on GitHub
#66 (comment)
.

Thanks Soumith@, this isn't quite where we had seen our numbers at, but we
will look at the tests again and ping you if we notice something.

Thanks again for running these benchmarks!

On Sun, Feb 28, 2016, 4:32 PM Soumith Chintala notifications@github.com
wrote:

i have not changed the benchmark scripts in any way, so if the TF
benchmark scripts need any change (such as new allocator settings etc.), I
welcome the TF folks to let me know.


Reply to this email directly or view it on GitHub
#66 (comment)
.

@soumith

This comment has been minimized.

Show comment
Hide comment
@soumith

soumith Feb 29, 2016

Owner

Thanks Rajat, happy to investigate further. I built TF from source, and configured it with CUDA 7.5 + CuDNN-4, if that helps. The commit is tensorflow/tensorflow@1d4f00d

Owner

soumith commented Feb 29, 2016

Thanks Rajat, happy to investigate further. I built TF from source, and configured it with CUDA 7.5 + CuDNN-4, if that helps. The commit is tensorflow/tensorflow@1d4f00d

@nryant

This comment has been minimized.

Show comment
Hide comment
@nryant

nryant Feb 29, 2016

I've had similar numbers using CUDA 7.0, cuDNN v4, and tensorflow/tensorflow@b889710 on a Titan X. Tried fiddling with device placement and the session config, but it made no material difference in the results. @rajatmonga, out of curiosity are you using cuDNN and nvcc internally, or gpucc?

nryant commented Feb 29, 2016

I've had similar numbers using CUDA 7.0, cuDNN v4, and tensorflow/tensorflow@b889710 on a Titan X. Tried fiddling with device placement and the session config, but it made no material difference in the results. @rajatmonga, out of curiosity are you using cuDNN and nvcc internally, or gpucc?

@soumith

This comment has been minimized.

Show comment
Hide comment
@soumith

soumith Mar 2, 2016

Owner

@nryant Thanks for the additional data point. I am honestly very nervous whenever I have to deliver any negative news on convnet-benchmarks. fwiw, @spezzer on reddit also confirmed that it was a data layout thing as well https://www.reddit.com/r/MachineLearning/comments/487fmo/convnetbenchmarks_updated_with_numbers_for/d0i7ord .
I'm closing this issue now, as we have benchmarked tensorflow across multiple versions and given it enough time and data. Will of course keep updating it over time as appropriate.
Thanks all.

Owner

soumith commented Mar 2, 2016

@nryant Thanks for the additional data point. I am honestly very nervous whenever I have to deliver any negative news on convnet-benchmarks. fwiw, @spezzer on reddit also confirmed that it was a data layout thing as well https://www.reddit.com/r/MachineLearning/comments/487fmo/convnetbenchmarks_updated_with_numbers_for/d0i7ord .
I'm closing this issue now, as we have benchmarked tensorflow across multiple versions and given it enough time and data. Will of course keep updating it over time as appropriate.
Thanks all.

@soumith soumith closed this Mar 2, 2016

@vrv

This comment has been minimized.

Show comment
Hide comment
@vrv

vrv Mar 2, 2016

Contributor

@soumith: I think in this case it's a combination of layout and some Eigen improvements that hadn't made its way upstream -- we're looking at both of these actively. Thanks again for your effort -- we'll let you know when it makes sense to update the numbers (and provide our own for comparison).

Contributor

vrv commented Mar 2, 2016

@soumith: I think in this case it's a combination of layout and some Eigen improvements that hadn't made its way upstream -- we're looking at both of these actively. Thanks again for your effort -- we'll let you know when it makes sense to update the numbers (and provide our own for comparison).

@thinxer

This comment has been minimized.

Show comment
Hide comment
@thinxer

thinxer Mar 6, 2016

A recent commit adds NCHW support for BiasAdd, which results in about 40% speed up.

tensorflow/tensorflow@d6f3ebf

thinxer commented Mar 6, 2016

A recent commit adds NCHW support for BiasAdd, which results in about 40% speed up.

tensorflow/tensorflow@d6f3ebf

@vrv

This comment has been minimized.

Show comment
Hide comment
@vrv

vrv Mar 6, 2016

Contributor

@thinxer: we'll let @soumith know when to update the numbers, but thanks for noticing :)

Contributor

vrv commented Mar 6, 2016

@thinxer: we'll let @soumith know when to update the numbers, but thanks for noticing :)

@soumith

This comment has been minimized.

Show comment
Hide comment
@soumith

soumith Mar 6, 2016

Owner

That's really cool, thanks for letting me know. I'm doing a new, complete set of benchmarks for deep learning, not just convnets, will cover this commit in them

Owner

soumith commented Mar 6, 2016

That's really cool, thanks for letting me know. I'm doing a new, complete set of benchmarks for deep learning, not just convnets, will cover this commit in them

@rajatmonga

This comment has been minimized.

Show comment
Hide comment
@rajatmonga

rajatmonga Mar 6, 2016

Thanks @soumith! No rush though.

We have most of the pieces together to support NCHW and expect to see more
gains once we update the models to use that. Will ping you once that is
ready as well. This commit helps quite a bit (was another regression on our
part). Of course the layout changes will mostly help convnets and not other
kinds of models.

On Sat, Mar 5, 2016 at 9:35 PM Soumith Chintala notifications@github.com
wrote:

That's really cool, thanks for letting me know. I'm doing a new, complete
set of benchmarks for deep learning, not just convnets, will cover this
commit in them


Reply to this email directly or view it on GitHub
#66 (comment)
.

Thanks @soumith! No rush though.

We have most of the pieces together to support NCHW and expect to see more
gains once we update the models to use that. Will ping you once that is
ready as well. This commit helps quite a bit (was another regression on our
part). Of course the layout changes will mostly help convnets and not other
kinds of models.

On Sat, Mar 5, 2016 at 9:35 PM Soumith Chintala notifications@github.com
wrote:

That's really cool, thanks for letting me know. I'm doing a new, complete
set of benchmarks for deep learning, not just convnets, will cover this
commit in them


Reply to this email directly or view it on GitHub
#66 (comment)
.

@shendiaomo

This comment has been minimized.

Show comment
Hide comment
@shendiaomo

shendiaomo Mar 16, 2016

How about tensorflow 0.7?

How about tensorflow 0.7?

@djebm2

This comment has been minimized.

Show comment
Hide comment
@djebm2

djebm2 Mar 18, 2016

Thanks for the benchmark @soumith . Looking forward for new updated TensorFlow.

djebm2 commented Mar 18, 2016

Thanks for the benchmark @soumith . Looking forward for new updated TensorFlow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment