-
Notifications
You must be signed in to change notification settings - Fork 579
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[August 2015] Rejigging the marks... #46
Comments
Any plans to incorporate batchnorm in this run? What about other potential bottlenecks in the network like the data loading and augmentation? Will fp16 testing now become standard with it being available in cuDNN (as far as I know)? I think we're working on some standard rnn/lstm network benchmarks so that might be a place to start for other network types. |
@soumith |
Technically, objectively:
Subjectively, my opinion:
|
For cltorch, just noticed the test/test-perf.lua script, in clnn repo, is missing the layer 2 single-layer timings, so I've pushed an update to clnn repo just now:
If you reinstall clnn, it should bring down these changes, if needed. |
@hughperkins thanks, will do. |
@soumith |
@naibaf7 i am concluding them. i finished benchmarking everything except fb-cunn, cudnn in FP16 mode and chainer. Hopefully by Monday I will write a detailed comment with my findings. |
@soumith |
@naibaf7 at the moment I am only doing GPU side of things. |
I guess mostly CPU would only be used during development, whereas actual training will tend to be GPU-based? |
@hughperkins A second perspective will be with APU/HSA devices. Using an i7-4790K for example on the Caffe CPU backend gives a speedup of 1x on Alexnet. Using it on the Greentea is almost 2x. The integrated graphics would also evaluate from 1.5x to 2x. Just something to keep in mind when seeing that the future exascale device we have to work with look much like this: |
@fabian Re: APU/HSA devices, interesting, will reply in your PR, to keep this thread clean(er). |
@soumith by the way, quick heads-up, before you inadvertently step into a mine-field :-P There are actually multiple Caffe OpenCL forks, none of which have been officially recognized/endorsed, and I am not aware of any plans to merge any of the forks into main Caffe in the near-term. Therefore, strongly recommend choosing a name for Fabian's fork which does not imply that it is the one and only Caffe OpenCL fork. I think that using 'greentea' or 'caffe greentea' meets this requirement, and will plausibly provide you a mine-free life :-) |
@hughperkins thanks for the heads up :) . Everything done except chainer. That should be done tomorrow as well, along with the write-up. Will do CPU candidates later, but there seem to be a few. I have to collect them, read a bit on each, and get to proper benchmarking each of them. CPU benchmarking gets much more complicated in general. |
/cc @gujunli |
@soumith Thanks :) |
@soumith Note that the 3rd Caffe OpenCL fork is public now :-P https://github.com/gujunli/OpenCL-caffe-upstream-test. |
@soumith I think that @michaellarabel of phoronix.com has many interesting hardware to run your benchmark. I think that you can talk with him. |
@bhack @hughperkins @soumith So what you will see in your benchmarks is equally applicable at the moment. For AMD hardware, it would be a bit different right now. |
Note: apparently the link to the AMD repo above is just a test link, gone now. |
@hughperkins A really strange timing considering its commit history was older then one month and license in source file was already change with AMD copyright. |
@bhack @hughperkins @bhack |
@naibaf7 Yes AMD people under coverage on github doesn't really help to clarify the situation. I'm confident that you can put back the discussion in a public space as soon you can. |
Concur with bhack's view. |
Yes I know but as a clarification, I think AMD is okay with me sharing this: As of now, there is no OpenCL branch that has the full official support by AMD. Please note that all branches have their pros and cons, and this needs to be resolved, as well as a plan on how to proceed (who does what, plans on merging or feature branch, device abstraction, speeding up the convolutions...) #2195 and #2610 have also been concurrent projects until now, which did power some advancements but also lead to heated discussions. This will be resolved now and we will plan collaboration. However it is a bad idea to lead all discussions public first (confusion), especially because only a handful of developers are involved. |
It seems a bit sub-optimal to me for a company which gives the impression to me of not being overly endowed with cash, to use three resources to work independently on the exact same problem, whilst the AMD compiler still is buggy, and optimization plausibly patchy. Even if you argue 'well, two of them were working for free', yes but the opportunity cost is still massive. Theano is still cuda-only, Chainer too, and so on. |
@hughperkins Besides, the branches also have some specific advantages and domain specific optimizations, so there's a lot to profit from now when going forward on OpenCL. |
@hughperkins when a company is large and distributed, parallel efforts might also happen out of interest. let's not waste a discussion on this speculation. FYI there's also a C++ AMP implementation of the Torch backends for AMD (funded by AMD) here: https://github.com/NEELMCW/MCW_CPPAMP_TORCH but not as clean and nice as your stuff. |
@soumith Basically, I dont want to see AMD go the way of Sun, since AMD are pretty much the main competitor to NVIDIA right now, although Intel do have some offerings, but mostly integrated GPUs right now, AFAIK? Sun in my opinion seemed to invest tons of money in opensource, which didnt seem to generate any return. I reckon that AMD in my opinion should either focus on AMD-specific stuff, like the compilers and so on, or else, I dont see any reason why they cant make proprietary, non-free, libraries, that generate revenue. Intel do this with MKL, and MKL seems to be doing ok for itself. I know this might seem odd coming from someone who writes lots of opensource, but I'd rather have an AMD producing non-free stuff and surviving, than producing lots of free stuff, and get eaten :-( |
@soumith |
@naibaf7 i am on ubuntu 14.04, but the makefile wasn't picking it up, so I just did things manually, did not look too much into it. |
Ok that's interesting. Here and on Amazon servers it did with Ubuntu 14.04... well anyways, cool that you could make it work for you. |
@michaellarabel has an AMD R9 Fury close at hand. @naibaf7 It would be great to benchmark on that. |
I have a R9 Fury, if anyone wants to test performance on it. I am not sure On Thu, Aug 27, 2015 at 1:45 AM, bhack notifications@github.com wrote:
Junli Gu--谷俊丽 |
@naibaf7 I like your early comment of you working on HSA/APU. We have done some work of CAFFE on APU. I would like to discuss with you. |
@soumith We have some evaluation results of OpenCL CAFFE (reserach lab's internal) on Fury and W9100. I wonder whether it helps for me to share the results with you but now the code for now. We evaluated against TitanX and GTX980. But it might be nice to know where we are now on your performance list. |
@gujunli is "but now the code for now" interpreted as "but not the code for now"? |
@soumith Is the Intel framework at https://github.com/01org/idlf testable? |
re: "can I submit results without sourcecode, using a non-publically available library?" What benefit do you see in doing this? |
@hughperkins It was a my fault to have pointed all to an internal repository of AMD Chinese research center that was maintained on a public github repository. |
@bhack No, I put Junli on my 'follow' list long ago, so I saw anyway :-) |
@hughperkins It was not an AMD Chinese center but the USA research center in Illinois. |
@bhack You know that AMD might have more than one office right? :-P |
@hughperkins Yes my bad. Sunnyvale, California |
@bhack hmmm, what I thought I was communicating is not exactly what I communicated. Anyway.... I wouldnt read too much into geographic locations, it is a global corporation. |
@hughperkins Yes but my "location" was referenced only by declaration of "I am leading the AMD research's DNN project". So probably she is directing this effort with distributed resources around the globe and there is really no physical DNN group co-located in Sunnyvale facilities. |
@bhack @hughperkins To stop speculation once again: I also found AMD more reachable and easier to contact than nVidia and Intel, it is fun to work with them. |
@naibaf7 Yes we are surely off topic, it is true. I'm really happy that AMD is starting to have some kind of direction and coordination now but you can agree with me that the approach was quite confusing and sparse. Removing repositories just after a little bit of posting and without a comment is generally not a good marketing for a big company like AMD. |
I'm not sure that's exactly quite true, unless your definition of 'DNN people' is different than my own interpretation of this.
I never tried to contact nvidia. Or Intel actually.... I reckon for cuda projects, they have a fair amount of contact with nvidia though. Caffe has sponsorship by nvidia, or at lesat provision of one or more nvidia gpus by nvidia, right on their front page. |
@hughperkins Really it is off topic. The only important thing here it is to understand what kind version of opencl caffe will be benchmarked. I don't know if it is useful to have benchmark results of private versions but only @soumith could tell us. |
@soumith thank-you very much for providing these benchmark results. Very useful :-) |
@soumith Interesting to see how much slower an OpenCL implementation is with minibatches using identical code to CUDA Caffe - with optimized OpenCL code this will change. |
Kudos to @scott-gray. Actually he has the fastest open source implementation. |
@bhack Oh on another note, this was ViennaCL-BLAS. Probably higher performance with clBLAS for the next time. @lunochod |
Looks like cuDNN is catching up. Those VGG numbers are really good for an FFT implementation. Perhaps with a bit more optimization they can overtake spatial domain on 3x3 filters? I wouldn't be surprised if we see much better fp16 numbers from them soon. My GoogLeNet numbers may look good but I still have a lot of optimizations to make for the smaller feature map values in there. Right now I'm optimized for multiples of 64. I'll get that down to 32 this weekend. My CHWN tensor layout is also really helpful on those inception groupings. A brand new version of neon is about to be released. You'll be able to run all these networks out of the box (plus lots more). The new syntax is much improved and more torch or keras like (perhaps better even). Anyway, here's a changelog of updates since the last version: No more multiplying by zero to implement padding in fprop and bprop (I now slice both the input and the filter) Figured out a different way to do integer division for the now dynamically sized slice lookup table. No more atomic adds in bprop. I've cast bprop as fprop upside down and the kernels are nearly identical. It requires a dimshuffle on the filter but this just takes microseconds and a small amount of additional memory that can be shared with all conv ops. Bprop used to be bandwidth bound on those atomic adds. Tweaked the p,q block ordering to improve L2 cache performance. I'm using a zigzag pattern now for all operations. Update already had a mode where you could stack all the gemm ops to eliminate atomic adds, but I've streamlined that stacking operation. Update also now fetches 32 rows deep. This comes at the cost of an instruction cache miss inside the main gemm loop, but is easily covered by the occupancy. The reason for doing this is the same for using a 32x33 shared memory block to implement transpose. With N contiguous the update op has the expensive strided access patterns on both the input and delta. I also eliminate all shared memory bank conflicts when storing the global loads to shared with some clever shifting. Added a beta param to bprop to allow delta accumulation for inception groupings. |
sorry, closed the issue as it got side-tracked by lots of other discussions. lets discuss more in Scott, I'd appreciate if you re-paste your comment there to discuss further. |
With Cudnn R3 coming in, improvements to Nervana, and a new kid on the block called Chainer, faster Facebook kernels, I will be doing a minor re-run of the benchmarks to see how things have improved.
Target date: August 15th.
I am still thinking quite a lot on how to take the benchmarks forward, beyond ConvNets, beyond Images (into NLP, Video and Audio) and beyond single-GPU. If any domain experts have suggestions (especially for Audio and NLP), please do write to me.
The only thing that stopped me from multi-GPU benchmarks was the lack of enough frameworks to do benchmarking. This somewhat seemed to have changed, and a decent number of frameworks now support multi-GPU, so will plan on that.
More fun to come soon.
Checklist:
The text was updated successfully, but these errors were encountered: