[August 2015] Rejigging the marks... #46

soumith · 2015-08-02T23:10:32Z

With Cudnn R3 coming in, improvements to Nervana, and a new kid on the block called Chainer, faster Facebook kernels, I will be doing a minor re-run of the benchmarks to see how things have improved.

Target date: August 15th.

I am still thinking quite a lot on how to take the benchmarks forward, beyond ConvNets, beyond Images (into NLP, Video and Audio) and beyond single-GPU. If any domain experts have suggestions (especially for Audio and NLP), please do write to me.

The only thing that stopped me from multi-GPU benchmarks was the lack of enough frameworks to do benchmarking. This somewhat seemed to have changed, and a decent number of frameworks now support multi-GPU, so will plan on that.

More fun to come soon.

Checklist:

scott-gray · 2015-08-02T23:28:26Z

Any plans to incorporate batchnorm in this run? What about other potential bottlenecks in the network like the data loading and augmentation? Will fp16 testing now become standard with it being available in cuDNN (as far as I know)?

I think we're working on some standard rnn/lstm network benchmarks so that might be a place to start for other network types.

naibaf7 · 2015-08-08T20:10:55Z

@soumith
How could we include Caffe #2610 (BVLC/caffe#2610) for testing?

hughperkins · 2015-08-09T00:31:42Z

@naibaf7

Technically, objectively:

fork this repo
add a subdirectory, with:
- script that installs your software, runs it, prints benchmarking results
- a short README explaining how to run the script
submit a pull request

Subjectively, my opinion:

you need a unique name
I really like your Greentea name
Suggest using either 'Caffe-Greentea', or simply 'Greentea'

hughperkins · 2015-08-23T03:01:25Z

For cltorch, just noticed the test/test-perf.lua script, in clnn repo, is missing the layer 2 single-layer timings, so I've pushed an update to clnn repo just now:

uncommented single-layer layer 2 test
removed colors
prints the name of each layer now

If you reinstall clnn, it should bring down these changes, if needed.

soumith · 2015-08-23T03:02:03Z

@hughperkins thanks, will do.

naibaf7 · 2015-08-23T03:02:38Z

@soumith
When can we expect results of the benchmarks? :)

soumith · 2015-08-23T03:04:10Z

@naibaf7 i am concluding them. i finished benchmarking everything except fb-cunn, cudnn in FP16 mode and chainer. Hopefully by Monday I will write a detailed comment with my findings.

naibaf7 · 2015-08-23T03:04:55Z

@soumith
Do you also test on the CPU or is only the Titan X evaluated?
Because Caffe could also be run in CPU mode, and Greentea supports CPUs very well in the OpenCL mode.

soumith · 2015-08-23T03:06:30Z

@naibaf7 at the moment I am only doing GPU side of things.
For Caffe, Torch, GreenTea I suppose I can also run CPU without too much effort from my side. I did not think people were interested in those.

hughperkins · 2015-08-23T03:17:03Z

I guess mostly CPU would only be used during development, whereas actual training will tend to be GPU-based?

naibaf7 · 2015-08-23T03:19:43Z

@hughperkins
For the moment, it seems like this. But with upcoming asynchronous solvers and MPI support, as well as good parallelized backends (the Caffe CPU backend is single threaded except for BLAS calls, while the GreenTea OpenCL backend uses parallelized kernels and a parallel CPU BLAS) it might become reasonably interesting again to use existing CPU clusters for training.

A second perspective will be with APU/HSA devices. Using an i7-4790K for example on the Caffe CPU backend gives a speedup of 1x on Alexnet. Using it on the Greentea is almost 2x. The integrated graphics would also evaluate from 1.5x to 2x.
When splitting up Alexnet over the integrated graphics and CPU, 4x the speed of the old CPU backend in Caffe can be reached, using the same device. This already approaches the training speed of mid-end GPUs.

Just something to keep in mind when seeing that the future exascale device we have to work with look much like this:
http://www.hpcwire.com/2015/07/29/amds-exascale-strategy-hinges-on-heterogeneity/

hughperkins · 2015-08-23T03:35:59Z

@fabian Re: APU/HSA devices, interesting, will reply in your PR, to keep this thread clean(er).

hughperkins · 2015-08-23T05:45:25Z

@soumith by the way, quick heads-up, before you inadvertently step into a mine-field :-P There are actually multiple Caffe OpenCL forks, none of which have been officially recognized/endorsed, and I am not aware of any plans to merge any of the forks into main Caffe in the near-term.

Therefore, strongly recommend choosing a name for Fabian's fork which does not imply that it is the one and only Caffe OpenCL fork. I think that using 'greentea' or 'caffe greentea' meets this requirement, and will plausibly provide you a mine-free life :-)

soumith · 2015-08-25T05:18:40Z

@hughperkins thanks for the heads up :) . Everything done except chainer. That should be done tomorrow as well, along with the write-up.

Will do CPU candidates later, but there seem to be a few. I have to collect them, read a bit on each, and get to proper benchmarking each of them. CPU benchmarking gets much more complicated in general.

bhack · 2015-08-25T07:56:44Z

/cc @gujunli

naibaf7 · 2015-08-25T08:44:11Z

@soumith
Oh you plan on doing CPU? Cool :)
Remember to use a good BLAS (OpenBLAS compiled from source or MKL) on your CPU with Greentea and Caffe and remember to configure it in Makefile.config.
Additionally, on Greentea, the CPU must be found with device_query and using the -gpu=x flag, where x is the ID of the CPU.

Thanks :)

hughperkins · 2015-08-25T10:52:21Z

@soumith Note that the 3rd Caffe OpenCL fork is public now :-P https://github.com/gujunli/OpenCL-caffe-upstream-test.

bhack · 2015-08-25T11:53:17Z

@soumith I think that @michaellarabel of phoronix.com has many interesting hardware to run your benchmark. I think that you can talk with him.

naibaf7 · 2015-08-25T13:41:01Z

@bhack @hughperkins @soumith
We compared the OpenCL performance.
Using nVidia hardware, Alexnet will run approximately with the same speed using BVLC/caffe#2610 and BVLC/caffe#2195 PRs of Caffe.

So what you will see in your benchmarks is equally applicable at the moment. For AMD hardware, it would be a bit different right now.

hughperkins · 2015-08-26T00:06:01Z

Note: apparently the link to the AMD repo above is just a test link, gone now.

bhack · 2015-08-26T00:18:29Z

@hughperkins A really strange timing considering its commit history was older then one month and license in source file was already change with AMD copyright.

naibaf7 · 2015-08-26T00:27:57Z

@bhack @hughperkins
Please no speculation for now, just adds to the confusion I think :)
We're planning on discussing how to go forward with OpenCL soon, as it's obviously not very good for the cause to have so many branches.

@bhack
You kind of advertised my branch/PR everywhere.

bhack · 2015-08-26T00:34:12Z

@naibaf7 Yes AMD people under coverage on github doesn't really help to clarify the situation. I'm confident that you can put back the discussion in a public space as soon you can.

hughperkins · 2015-08-26T00:36:18Z

Concur with bhack's view.

naibaf7 · 2015-08-26T00:37:18Z

@bhack @hughperkins

Yes I know but as a clarification, I think AMD is okay with me sharing this:
One of the branches from junli gu was actually a research branch for AMD internal. Robert started his OpenCL branch inofficially in spare time. I shortly after that started my OpenCL branch as part of my thesis with AMD sponsoring.

As of now, there is no OpenCL branch that has the full official support by AMD. Please note that all branches have their pros and cons, and this needs to be resolved, as well as a plan on how to proceed (who does what, plans on merging or feature branch, device abstraction, speeding up the convolutions...)

#2195 and #2610 have also been concurrent projects until now, which did power some advancements but also lead to heated discussions. This will be resolved now and we will plan collaboration.

However it is a bad idea to lead all discussions public first (confusion), especially because only a handful of developers are involved.

hughperkins · 2015-08-26T11:10:32Z

One of the branches from junli gu was actually a research branch for AMD internal. Robert started his OpenCL branch inofficially in spare time. I shortly after that started my OpenCL branch as part of my thesis with AMD sponsoring.

It seems a bit sub-optimal to me for a company which gives the impression to me of not being overly endowed with cash, to use three resources to work independently on the exact same problem, whilst the AMD compiler still is buggy, and optimization plausibly patchy. Even if you argue 'well, two of them were working for free', yes but the opportunity cost is still massive. Theano is still cuda-only, Chainer too, and so on.

naibaf7 · 2015-08-26T11:23:28Z

@hughperkins
That's why I would prefer to not see speculation on here. I just told you what I know the clear the situation, now I'll discuss with AMD how to proceed, so let's see about that first before jumping to conclusions.

Besides, the branches also have some specific advantages and domain specific optimizations, so there's a lot to profit from now when going forward on OpenCL.

soumith · 2015-08-26T14:27:12Z

@hughperkins when a company is large and distributed, parallel efforts might also happen out of interest. let's not waste a discussion on this speculation. FYI there's also a C++ AMP implementation of the Torch backends for AMD (funded by AMD) here: https://github.com/NEELMCW/MCW_CPPAMP_TORCH but not as clean and nice as your stuff.

hughperkins · 2015-08-26T14:55:41Z

@soumith Basically, I dont want to see AMD go the way of Sun, since AMD are pretty much the main competitor to NVIDIA right now, although Intel do have some offerings, but mostly integrated GPUs right now, AFAIK? Sun in my opinion seemed to invest tons of money in opensource, which didnt seem to generate any return. I reckon that AMD in my opinion should either focus on AMD-specific stuff, like the compilers and so on, or else, I dont see any reason why they cant make proprietary, non-free, libraries, that generate revenue. Intel do this with MKL, and MKL seems to be doing ok for itself. I know this might seem odd coming from someone who writes lots of opensource, but I'd rather have an AMD producing non-free stuff and surviving, than producing lots of free stuff, and get eaten :-(

naibaf7 · 2015-08-26T15:30:44Z

@soumith
I've seen you changed the ViennaCL installation script on Greentea. Are you on a distribution which does not ship ViennaCL via APT/DNF/YUM?

soumith · 2015-08-26T17:45:13Z

@naibaf7 i am on ubuntu 14.04, but the makefile wasn't picking it up, so I just did things manually, did not look too much into it.

naibaf7 · 2015-08-26T17:47:38Z

Ok that's interesting. Here and on Amazon servers it did with Ubuntu 14.04... well anyways, cool that you could make it work for you.
The more advanced method is to switch to clBLAS. But I think it's a pretty cool feature to be able to use both BLAS libraries to provide easy installation & flexibility.

bhack · 2015-08-27T08:45:46Z

@michaellarabel has an AMD R9 Fury close at hand. @naibaf7 It would be great to benchmark on that.

gujunli · 2015-08-28T06:22:09Z

I have a R9 Fury, if anyone wants to test performance on it. I am not sure
how easy the access is. but if I can access your code, that will be easy.
@naibaf7 https://github.com/naibaf7 @hughperkins
https://github.com/hughperkins

On Thu, Aug 27, 2015 at 1:45 AM, bhack notifications@github.com wrote:

@michaellarabel https://github.com/michaellarabel has an AMD R9 Fury
under the hand. @naibaf7 https://github.com/naibaf7 It would be great
to benchmark on that.

—
Reply to this email directly or view it on GitHub
#46 (comment)
.

Junli Gu--谷俊丽
Coordinated Science Lab
University of Illinois at Urbana-Champaign

gujunli · 2015-08-28T06:29:17Z

@naibaf7 I like your early comment of you working on HSA/APU. We have done some work of CAFFE on APU. I would like to discuss with you.

gujunli · 2015-08-28T06:33:38Z

@soumith We have some evaluation results of OpenCL CAFFE (reserach lab's internal) on Fury and W9100. I wonder whether it helps for me to share the results with you but now the code for now. We evaluated against TitanX and GTX980. But it might be nice to know where we are now on your performance list.

bhack · 2015-08-28T07:48:21Z

@gujunli is "but now the code for now" interpreted as "but not the code for now"?

bhack · 2015-08-28T09:22:23Z

@soumith Is the Intel framework at https://github.com/01org/idlf testable?

hughperkins · 2015-08-28T11:31:25Z

re: "can I submit results without sourcecode, using a non-publically available library?" What benefit do you see in doing this?

bhack · 2015-08-28T12:01:02Z

@hughperkins It was a my fault to have pointed all to an internal repository of AMD Chinese research center that was maintained on a public github repository.

hughperkins · 2015-08-28T12:03:24Z

@bhack No, I put Junli on my 'follow' list long ago, so I saw anyway :-)

bhack · 2015-08-28T12:11:58Z

@hughperkins It was not an AMD Chinese center but the USA research center in Illinois.

hughperkins · 2015-08-28T12:12:29Z

@bhack You know that AMD might have more than one office right? :-P

bhack · 2015-08-28T12:14:22Z

@hughperkins Yes my bad. Sunnyvale, California

hughperkins · 2015-08-28T12:15:43Z

@bhack hmmm, what I thought I was communicating is not exactly what I communicated. Anyway.... I wouldnt read too much into geographic locations, it is a global corporation.

bhack · 2015-08-28T12:21:59Z

@hughperkins Yes but my "location" was referenced only by declaration of "I am leading the AMD research's DNN project". So probably she is directing this effort with distributed resources around the globe and there is really no physical DNN group co-located in Sunnyvale facilities.

naibaf7 · 2015-08-28T12:48:51Z

@bhack @hughperkins
Okay... you are very off topic now, just saying :D
Not sure if @soumith is very happy with that.

To stop speculation once again:
Currently the DNN people are physically in the same location, so don't worry about coordination, it will be great. AMD is pulling together all the important people to get this done the right way.

I also found AMD more reachable and easier to contact than nVidia and Intel, it is fun to work with them.
I once won an Intel ISEF award and even then it was crazy difficult to get into contact with anyone other than public relations people at Intel.

bhack · 2015-08-28T12:59:37Z

@naibaf7 Yes we are surely off topic, it is true. I'm really happy that AMD is starting to have some kind of direction and coordination now but you can agree with me that the approach was quite confusing and sparse. Removing repositories just after a little bit of posting and without a comment is generally not a good marketing for a big company like AMD.
But nevermind it was only a start with some little false steps.

hughperkins · 2015-08-28T13:03:22Z

Currently the DNN people are physically in the same location, so don't worry about coordination

I'm not sure that's exactly quite true, unless your definition of 'DNN people' is different than my own interpretation of this.

I also found AMD more reachable and easier to contact than nVidia and Intel

I never tried to contact nvidia. Or Intel actually.... I reckon for cuda projects, they have a fair amount of contact with nvidia though. Caffe has sponsorship by nvidia, or at lesat provision of one or more nvidia gpus by nvidia, right on their front page.

bhack · 2015-08-28T13:34:36Z

@hughperkins Really it is off topic. The only important thing here it is to understand what kind version of opencl caffe will be benchmarked. I don't know if it is useful to have benchmark results of private versions but only @soumith could tell us.

hughperkins · 2015-08-28T23:33:33Z

@soumith thank-you very much for providing these benchmark results. Very useful :-)

naibaf7 · 2015-08-28T23:49:38Z

@soumith
Thanks - clearly there is work to do on the GreenTea convolution code.
Also there is some bottleneck in backward processing that needs to be solved.
Performance is expected to get much faster within the next 2 months (Batched-GEMM/GEMV).

Interesting to see how much slower an OpenCL implementation is with minibatches using identical code to CUDA Caffe - with optimized OpenCL code this will change.

bhack · 2015-08-28T23:57:29Z

Kudos to @scott-gray. Actually he has the fastest open source implementation.

naibaf7 · 2015-08-29T00:08:17Z

@bhack
Probably it's a really good idea to replicate his kernels in GCN-assembly then ;)

Oh on another note, this was ViennaCL-BLAS. Probably higher performance with clBLAS for the next time.

@lunochod
Have you seen this? Still a lot of work on the OpenCL (compute kernel) side. Seems like just duplicating CUDA kernels does not really give good performance (as we know already...)

scott-gray · 2015-08-29T00:27:40Z

Looks like cuDNN is catching up. Those VGG numbers are really good for an FFT implementation. Perhaps with a bit more optimization they can overtake spatial domain on 3x3 filters? I wouldn't be surprised if we see much better fp16 numbers from them soon.

My GoogLeNet numbers may look good but I still have a lot of optimizations to make for the smaller feature map values in there. Right now I'm optimized for multiples of 64. I'll get that down to 32 this weekend. My CHWN tensor layout is also really helpful on those inception groupings.

A brand new version of neon is about to be released. You'll be able to run all these networks out of the box (plus lots more). The new syntax is much improved and more torch or keras like (perhaps better even).

Anyway, here's a changelog of updates since the last version:

No more multiplying by zero to implement padding in fprop and bprop (I now slice both the input and the filter)

Figured out a different way to do integer division for the now dynamically sized slice lookup table.

No more atomic adds in bprop. I've cast bprop as fprop upside down and the kernels are nearly identical. It requires a dimshuffle on the filter but this just takes microseconds and a small amount of additional memory that can be shared with all conv ops. Bprop used to be bandwidth bound on those atomic adds.

Tweaked the p,q block ordering to improve L2 cache performance. I'm using a zigzag pattern now for all operations.

Update already had a mode where you could stack all the gemm ops to eliminate atomic adds, but I've streamlined that stacking operation. Update also now fetches 32 rows deep. This comes at the cost of an instruction cache miss inside the main gemm loop, but is easily covered by the occupancy. The reason for doing this is the same for using a 32x33 shared memory block to implement transpose. With N contiguous the update op has the expensive strided access patterns on both the input and delta.

I also eliminate all shared memory bank conflicts when storing the global loads to shared with some clever shifting.

Added a beta param to bprop to allow delta accumulation for inception groupings.

soumith · 2015-08-29T01:04:14Z

sorry, closed the issue as it got side-tracked by lots of other discussions. lets discuss more in
#56

Scott, I'd appreciate if you re-paste your comment there to discuss further.

hughperkins mentioned this issue Aug 8, 2015

Caffe OpenCL support BVLC/caffe#2610

Closed

soumith closed this as completed Aug 29, 2015

[August 2015] Rejigging the marks... #46

[August 2015] Rejigging the marks... #46

Comments

soumith commented Aug 2, 2015

scott-gray commented Aug 2, 2015

naibaf7 commented Aug 8, 2015

hughperkins commented Aug 9, 2015

hughperkins commented Aug 23, 2015

soumith commented Aug 23, 2015

naibaf7 commented Aug 23, 2015

soumith commented Aug 23, 2015

naibaf7 commented Aug 23, 2015

soumith commented Aug 23, 2015

hughperkins commented Aug 23, 2015

naibaf7 commented Aug 23, 2015

hughperkins commented Aug 23, 2015

hughperkins commented Aug 23, 2015

soumith commented Aug 25, 2015

bhack commented Aug 25, 2015

naibaf7 commented Aug 25, 2015

hughperkins commented Aug 25, 2015

bhack commented Aug 25, 2015

naibaf7 commented Aug 25, 2015

hughperkins commented Aug 26, 2015

bhack commented Aug 26, 2015

naibaf7 commented Aug 26, 2015

bhack commented Aug 26, 2015

hughperkins commented Aug 26, 2015

naibaf7 commented Aug 26, 2015

hughperkins commented Aug 26, 2015

naibaf7 commented Aug 26, 2015

soumith commented Aug 26, 2015

hughperkins commented Aug 26, 2015

naibaf7 commented Aug 26, 2015

soumith commented Aug 26, 2015

naibaf7 commented Aug 26, 2015

bhack commented Aug 27, 2015

gujunli commented Aug 28, 2015

gujunli commented Aug 28, 2015

gujunli commented Aug 28, 2015

bhack commented Aug 28, 2015

bhack commented Aug 28, 2015

hughperkins commented Aug 28, 2015

bhack commented Aug 28, 2015

hughperkins commented Aug 28, 2015

bhack commented Aug 28, 2015

hughperkins commented Aug 28, 2015

bhack commented Aug 28, 2015

hughperkins commented Aug 28, 2015

bhack commented Aug 28, 2015

naibaf7 commented Aug 28, 2015

bhack commented Aug 28, 2015

hughperkins commented Aug 28, 2015

bhack commented Aug 28, 2015

hughperkins commented Aug 28, 2015

naibaf7 commented Aug 28, 2015

bhack commented Aug 28, 2015

naibaf7 commented Aug 29, 2015

scott-gray commented Aug 29, 2015

soumith commented Aug 29, 2015