Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: Please provide AVX2/FMA capable builds #7257

Closed
ghost opened this issue Feb 4, 2017 · 37 comments
Closed

Feature request: Please provide AVX2/FMA capable builds #7257

ghost opened this issue Feb 4, 2017 · 37 comments

Comments

@ghost
Copy link

@ghost ghost commented Feb 4, 2017

I would go out on a limb and guess that the vast majority of Tensorflow users on Linux at least use fairly modern CPUs. It would therefore be beneficial for them to have the prebuilt TF binaries support AVX2/FMA. These two ISA extensions, and especially FMA, tend to speed up GEMM-like math pretty significantly.

It'd be great if TF team provided prebuilt Linux release *.whl that supports AVX2/FMA, perhaps as an alternative, non-default wheel. These should be compatible with Haswell and above. Haswell came out in 2013, lots of people have it by now.

To be clear, this is not a hugely pressing issue, *.whl can be easily rebuilt from source. It'd just make things faster and easier for people with modern CPUs.

What related GitHub issues or StackOverflow threads have you found by searching the web for your problem?

N/A

Environment info

Operating System:
Linux Ubuntu 16.04

Installed version of CUDA and cuDNN:
(please attach the output of ls -l /path/to/cuda/lib/libcud*): NONE

If installed from binary pip package, provide:

  1. A link to the pip package you installed: https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.0.0rc0-cp35-cp35m-linux_x86_64.whl
  2. The output from python -c "import tensorflow; print(tensorflow.__version__)": 1.0.0-rc0

If possible, provide a minimal reproducible example (We usually don't have time to read hundreds of lines of your code)

Code:

import tensorflow as tf
sess = tf.InteractiveSession()

Output:

W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.

What other attempted solutions have you tried?

Compiled from source.

Logs or other output that would be helpful

(If logs are large, please upload as attachment or provide link).

@yaroslavvb

This comment has been minimized.

Copy link
Contributor

@yaroslavvb yaroslavvb commented Feb 4, 2017

As a performance datapoint, this matrix multiplication benchmark goes 0.31 Tops/sec -> 1.05 Tops/sec when enabling avx2/fma on our Intel Xeon 3 @ 2.4 Ghz servers: https://github.com/yaroslavvb/stuff/blob/master/matmul_bench.py

On the other hand, there may be technical issues with infrastructure that make it hard to setup such a build. @caisq for comment

@caisq

This comment has been minimized.

Copy link
Contributor

@caisq caisq commented Feb 6, 2017

+@gunan

I believe our tooling and CI machines have the capacity to run bazel build with --copt=-mavx2 and --copt=-mfma. Gunhan, what do you think of expanding the nightly and release matrices to support those build options? My sense is that we are already a little constrained in terms of the machine resource and manpower to fix breakages.

@gunan

This comment has been minimized.

Copy link
Member

@gunan gunan commented Feb 6, 2017

This was discussed before, and the decision was to make the released binaries work for everyone.
While it is very easy for users to upgrade personal computers, many cloud providers or thousands of machines take longer to upgrade. With 0.12, we tried to enable avx and sse4.1, but we had to roll it back because it is not as common for AMD CPUs to have sse 4.1 as intel CPUs.

So, we decided this to be our policy about SIMD instruction sets going forward:
1 - Our released binaries will be as portable as possible, working out of the box on most of the machines.
2 - For people who need the best TF performance, we recommend building from sources. We will make sure things build well, and building from sources is as easy as possible, but rather than supporting a a new binary package for 10s of CPU architectures out in the wild, we decided the best would be to let users build binaries as needed.

@yaroslavvb

This comment has been minimized.

Copy link
Contributor

@yaroslavvb yaroslavvb commented Feb 6, 2017

Since Intel is getting involved, perhaps they would be willing to maintain an Intel-optimized build of TensorFlow? cc @mahmoud-abuzaina in case he has some connections

@ghost

This comment has been minimized.

Copy link
Author

@ghost ghost commented Feb 6, 2017

@yaroslavvb

This comment has been minimized.

Copy link
Contributor

@yaroslavvb yaroslavvb commented Feb 6, 2017

@dmitry-xnor I guess the issue here is limited resources at Google. Releasing an official wheel with a new configuration means you have to support it and fix issues that arise. I have seen some subtle alignment issues caused by enabling avx2, so troubleshooting such things can take time. And if you don't fix them, people get mad at Google, since the release is "official". Also, this sets a precedent for supporting a "highly optimized" binary + "lowest common denominator" binary, and the standard of "highly optimized binary" can shift over time. I do agree that it would be nice to have an Intel-specific build that's highly optimized.

I'm currently getting around this lack it by launching "build --config=opt --config=cuda" builds weekly and dropping resulting wheel in a shared folder for other users in our company.

@yaroslavvb

This comment has been minimized.

Copy link
Contributor

@yaroslavvb yaroslavvb commented Feb 6, 2017

@dmitry-xnor now to think of it, I could probably drop such binaries into a public shared folder as well, as I'm going through my build process. I'm building with --config=opt --config=cuda with CUDA 8.0 on Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz from Cirrascale which seems like a common configuration. The downside is that I don't have time to setup cloud storage/ research uploading, but if someone gave me an easy recipe to follow, I could do that.

@ghost

This comment has been minimized.

Copy link
Author

@ghost ghost commented Feb 6, 2017

Do you build as yourself interactively or is this an automated build? If you build as yourself, the solution is easily scriptable. Create a GCS bucket for binaries once, then just upload stuff there for every release and make it public like so, using gsutil:

gsutil cp $TENSORFLOW_WHL gs://<bucket name>/
gsutil acl set public-read gs://<bucket name>/$TENSORFLOW_WHL

And publish the resulting download URL somewhere. Note also that you should not rename the WHL or else pip3 will barf.

@yaroslavvb

This comment has been minimized.

Copy link
Contributor

@yaroslavvb yaroslavvb commented Feb 6, 2017

thanks, I'll give it a shot on the next build

@ghost

This comment has been minimized.

Copy link
Author

@ghost ghost commented Feb 6, 2017

Sounds great, thanks! Looking forward to prebuilt WHLs! It just doesn't make sense to use the CPU at 1/3rd the speed. 👍

@yaroslavvb

This comment has been minimized.

Copy link
Contributor

@yaroslavvb yaroslavvb commented Feb 7, 2017

I have just launched a new website for this purpose: TensorFlow Community Wheels. Fully integrated with github.

@gue22

This comment has been minimized.

Copy link

@gue22 gue22 commented Feb 20, 2017

Hey guys,
need to dig some deeper into this thread, but to expedite things some thoughts here in advance:

  1. Am I mistaken? As far as I saw / understood on my machine there is not even optimization for SSE1? How about cutting off CPUs of a certain age / SSE for the default distribution?! (Naturally I'd appreciate a community effort for a finer-grained optimization offering!!)

  2. Do you have any insight how XLA / JIT / AOT announced last Wednesday (2017-02-15) comes to the rescue?

TIA
G.

@ghost

This comment has been minimized.

Copy link
Author

@ghost ghost commented Feb 20, 2017

@cancan101

This comment has been minimized.

Copy link
Contributor

@cancan101 cancan101 commented Mar 1, 2017

Any suggestion for users of the TF docker images? These image have TF pre-installed.

@gunan

This comment has been minimized.

Copy link
Member

@gunan gunan commented Mar 1, 2017

@cancan101

This comment has been minimized.

Copy link
Contributor

@cancan101 cancan101 commented Mar 1, 2017

I am using the GPU devel docker images, but right now I am just "using them" without rebuilding / reinstalling.

It is worth considering how far back in SSE instructions is reasonable to handle when dealing with a machine that needs to have CUDA compute >= 3.0 (for the gpu images or gpu wheels).

@yaroslavvb

This comment has been minimized.

Copy link
Contributor

@yaroslavvb yaroslavvb commented Mar 1, 2017

SSE was causing problems for people running on AMD CPU -- #6809

@ctmakro

This comment has been minimized.

Copy link

@ctmakro ctmakro commented Mar 4, 2017

after upgrading to 1.0 I found the OSX prebuilt version lacked SSE, FMA and AVX support. after searching around for a while there's no alternative except to build it myself. Well then, i'll build it myself.

@gunan

This comment has been minimized.

Copy link
Member

@gunan gunan commented Mar 4, 2017

@yaroslavvb created this repository to link to community supported wheel files.
https://github.com/yaroslavvb/tensorflow-community-wheels

We encourage our community creating and maintaining specialized builds, but we will be creating wheel files that are installable in most platforms. Therefore, I will close this issue.

@mvpel

This comment has been minimized.

Copy link

@mvpel mvpel commented Apr 20, 2017

For example, the glibc library is designed to work anywhere, and has mechanisms to detect the availability of advanced instruction sets and use the proper functions to take advantage of them when they're available, and fall back if they're not. It's not necessary to support a separate binary for every possible processor capability.

Intel Performance Primitives: https://software.intel.com/en-us/articles/intel-integrated-performance-primitives-intel-ipp-is-there-any-function-to-detect-processor-type

Linux Function Multi-Versioning (FMV):
https://clearlinux.org/features/function-multiversioning-fmv

@mvpel

This comment has been minimized.

Copy link

@mvpel mvpel commented Apr 24, 2017

Here's a LWN article on the FMV capabilities provided for C++ in GCC 4.8 and up, and for C in GCC 6:

https://lwn.net/Articles/691932/

@ctmakro

This comment has been minimized.

Copy link

@ctmakro ctmakro commented May 10, 2017

I built these for OS X, with FMA and friends.
https://github.com/ctmakro/tensorflow_custom_python_wheel_build

@apacha

This comment has been minimized.

Copy link

@apacha apacha commented May 19, 2017

@gunan For user needing best performance ... build from sources ...

We will make sure things build well, and building from sources is as easy as possible

That would be acceptable, if there was an easy way of building Tensorflow on Windows. Apparently, there isn't: People try but the official documentation on that clearly states that Windows is currently not supported. So it would be very valuable, if there were optimized builds available or you could follow up on @mvpel's suggestion on detecting cpu and enabling optimizations dynamically. Meanwhile, I will try to follow the instructions from here

@apacha

This comment has been minimized.

Copy link

@apacha apacha commented May 22, 2017

FYI: Following up on my last comment, I built the GPU-Version of Tensorflow with CPU-Optimizations (AVX) enabled and I couldn't see much performance improvements on my side, so I will stick to the pre-build GPU-version that can be installed using pip install tensorflow-gpu==1.1.0

@xinyazhang

This comment has been minimized.

Copy link

@xinyazhang xinyazhang commented May 24, 2017

@apacha From my experiments I found CPU-optimized GPU TF doesn't boost the performance significantly, but it can make the CPU cooler. My processor's temperature often goes up to 80C during training, while the optimized TF usually keeps the temp. below 70C.

@mvpel

This comment has been minimized.

Copy link

@mvpel mvpel commented Jul 5, 2017

@xinyazhang - that can have performance implications, albeit slight, since CPUs will throttle their frequency if they are pushed into the upper limits of their temperature range for too long.

@apacha - There's not much point to vector instructions in a GPU-enabled TF runs, since the work that would be done by those instructions in the CPU is done in the GPU much more quickly, and so the fact that there's little performance improvement with AVX on a GPU-based run is to be expected.

The basic idea is that there's far more machines out there with AVX, AVX2, SSE, etc. than there are with GPUs, and they're much cheaper to rent in the cloud (an AWS c4.large with AVX2 is 10 cents per hour, while the smallest GPU instance p2.xlarge is 90 cents an hour), so wringing out every last bit of CPU performance potential for non-GPU runs can be of benefit provided that a TF job on c4.large doesn't take 9 times longer than on p4.xlarge.

@ghost

This comment has been minimized.

Copy link
Author

@ghost ghost commented Aug 2, 2017

@mvpel "There's not much point to vector instructions in a GPU-enabled TF runs"

CPU is very much a bottleneck with today's faster GPUs on certain models. Typically for e.g. computer vision problems you need to do a bunch of data decoding and augmentation, much of which can't be done on the GPU. This is actually a major problem we had with TF for multi-GPU training. Things were so bad (even with AVX2 and FMA enabled) that we switched to using PyTorch just for data augmentation in our TF pipelines. For what we do, it was an easy 40% throughput gain right off the bat, and code was quite a bit simpler too.

The point is: GPUs are specialized devices, and while they are powerful, they are not really usable for everything. Things are pretty bad even now for high throughput tasks, and I imagine they'll get much worse when we TPUs and NVIDIA V100 GPUs become available.

@danqing

This comment has been minimized.

Copy link

@danqing danqing commented Aug 22, 2017

For anyone looking for optimized builds, we maintain a bunch of these at TinyMind that you can find at https://github.com/mind/wheels.

There are both CPU and GPU builds for all versions post TF1.1. Enjoy :)

@bhack

This comment has been minimized.

Copy link
Contributor

@bhack bhack commented Oct 13, 2017

@gunan Why the TF team cannot official maintain some alternative builds like suggested in the previous comment?

@yaroslavvb

This comment has been minimized.

Copy link
Contributor

@yaroslavvb yaroslavvb commented Oct 13, 2017

@bhack it's a business level decision (what's the best use of Google engineer time?). Providing custom hardware builds is possible by people outside of Google, but there are many Tensorflowy things that can only be done by Googlers.

PS: whole "bake AVX2 into binary" is not that great for open-source ecosystem -- TensorFlow would be better off with dynamic dispatch system like what's used by PyTorch, MKL.

@bhack

This comment has been minimized.

Copy link
Contributor

@bhack bhack commented Oct 13, 2017

I don't think that there is so much effort required other than hardware resources cause I think that AVX2 code paths are still tested in the matrix. When code is tested, and so builded, i think that it is quite automatic to publish it. But never mind, Intel is already maintaining optimized builds with a sort of delay over the official upstream releases.

@yaroslavvb

This comment has been minimized.

Copy link
Contributor

@yaroslavvb yaroslavvb commented Oct 13, 2017

I don't how that Intel build works -- does it/conda automatically figure out which instruction sets your machine has and get the proper version? Or it just automatically pushes XeonV4 optimized build?

This wasn't an issue with MKL because MKL has dynamic dispatch, but TF has to have advanced instructions statically baked in there

@bhack

This comment has been minimized.

Copy link
Contributor

@bhack bhack commented Oct 13, 2017

@danqing

This comment has been minimized.

Copy link

@danqing danqing commented Oct 13, 2017

At any rate, feel free to use the wheels I posted above @bhack - we rely on these ourselves so we will keep maintaining them. :)

@abeepathak96

This comment has been minimized.

Copy link

@abeepathak96 abeepathak96 commented Nov 7, 2017

How can make the tensrflow installed on my machine to compile SSE, AVX, FMA instructions?

@danqing

This comment has been minimized.

Copy link

@danqing danqing commented Nov 7, 2017

If you use ubuntu 16.04, check out the link I posted above (https://github.com/mind/wheels) - there you can find the version you want as well as how to install it.

@lakshayg

This comment has been minimized.

Copy link
Contributor

@lakshayg lakshayg commented Jan 22, 2018

I have tensorflow wheels for a few different configurations at https://github.com/lakshayg/tensorflow-build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
You can’t perform that action at this time.