Feature request: Please provide AVX2/FMA capable builds #7257

ghost · 2017-02-04T08:31:21Z

I would go out on a limb and guess that the vast majority of Tensorflow users on Linux at least use fairly modern CPUs. It would therefore be beneficial for them to have the prebuilt TF binaries support AVX2/FMA. These two ISA extensions, and especially FMA, tend to speed up GEMM-like math pretty significantly.

It'd be great if TF team provided prebuilt Linux release *.whl that supports AVX2/FMA, perhaps as an alternative, non-default wheel. These should be compatible with Haswell and above. Haswell came out in 2013, lots of people have it by now.

To be clear, this is not a hugely pressing issue, *.whl can be easily rebuilt from source. It'd just make things faster and easier for people with modern CPUs.

What related GitHub issues or StackOverflow threads have you found by searching the web for your problem?

N/A

Environment info

Operating System:
Linux Ubuntu 16.04

Installed version of CUDA and cuDNN:
(please attach the output of ls -l /path/to/cuda/lib/libcud*): NONE

If installed from binary pip package, provide:

A link to the pip package you installed: https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.0.0rc0-cp35-cp35m-linux_x86_64.whl
The output from python -c "import tensorflow; print(tensorflow.__version__)": 1.0.0-rc0

If possible, provide a minimal reproducible example (We usually don't have time to read hundreds of lines of your code)

Code:

import tensorflow as tf
sess = tf.InteractiveSession()

Output:

W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.

What other attempted solutions have you tried?

Compiled from source.

Logs or other output that would be helpful

(If logs are large, please upload as attachment or provide link).

The text was updated successfully, but these errors were encountered:

yaroslavvb · 2017-02-04T16:19:17Z

As a performance datapoint, this matrix multiplication benchmark goes 0.31 Tops/sec -> 1.05 Tops/sec when enabling avx2/fma on our Intel Xeon 3 @ 2.4 Ghz servers: https://github.com/yaroslavvb/stuff/blob/master/matmul_bench.py

On the other hand, there may be technical issues with infrastructure that make it hard to setup such a build. @caisq for comment

caisq · 2017-02-06T15:10:32Z

+@gunan

I believe our tooling and CI machines have the capacity to run bazel build with --copt=-mavx2 and --copt=-mfma. Gunhan, what do you think of expanding the nightly and release matrices to support those build options? My sense is that we are already a little constrained in terms of the machine resource and manpower to fix breakages.

gunan · 2017-02-06T18:23:48Z

This was discussed before, and the decision was to make the released binaries work for everyone.
While it is very easy for users to upgrade personal computers, many cloud providers or thousands of machines take longer to upgrade. With 0.12, we tried to enable avx and sse4.1, but we had to roll it back because it is not as common for AMD CPUs to have sse 4.1 as intel CPUs.

So, we decided this to be our policy about SIMD instruction sets going forward:
1 - Our released binaries will be as portable as possible, working out of the box on most of the machines.
2 - For people who need the best TF performance, we recommend building from sources. We will make sure things build well, and building from sources is as easy as possible, but rather than supporting a a new binary package for 10s of CPU architectures out in the wild, we decided the best would be to let users build binaries as needed.

yaroslavvb · 2017-02-06T18:28:32Z

Since Intel is getting involved, perhaps they would be willing to maintain an Intel-optimized build of TensorFlow? cc @mahmoud-abuzaina in case he has some connections

ghost · 2017-02-06T18:34:05Z

IIRC only Microsoft uses AMD CPUs in the cloud and those are on their way out. The proposal is not to make this the new default whl. The proposal is to pre-build binaries for Haswell and above as a second wheel set. Just build them with the same settings you build the regular whls but say -mavx2 and -mfma and throw them into the cloud bucket. That's what users will do when building their own from source. Why not save them the aggravation and half an hour of their time? It adds up.

…

On Mon, Feb 6, 2017 at 10:25 AM, gunan ***@***.***> wrote: This was discussed before, and the decision was to make the released binaries work for everyone. While it is very easy for users to upgrade personal computers, many cloud providers or thousands of machines take longer to upgrade. With 0.12, we tried to enable avx and sse4.1, but we had to roll it back because it is not as common for AMD CPUs to have sse 4.1 as intel CPUs. So, we decided this to be our policy about SIMD instruction sets going forward: 1 - Our released binaries will be as portable as possible, working out of the box on most of the machines. 2 - For people who need the best TF performance, we recommend building from sources. We will make sure things build well, and building from sources is as easy as possible, but rather than supporting a a new binary package for 10s of CPU architectures out in the wild, we decided the best would be to let users build binaries as needed. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#7257 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AX6E09Lzc-Di7h10CqOSdku6bGR6LUFqks5rZ2WPgaJpZM4L3Fgv> .

yaroslavvb · 2017-02-06T18:48:36Z

@dmitry-xnor I guess the issue here is limited resources at Google. Releasing an official wheel with a new configuration means you have to support it and fix issues that arise. I have seen some subtle alignment issues caused by enabling avx2, so troubleshooting such things can take time. And if you don't fix them, people get mad at Google, since the release is "official". Also, this sets a precedent for supporting a "highly optimized" binary + "lowest common denominator" binary, and the standard of "highly optimized binary" can shift over time. I do agree that it would be nice to have an Intel-specific build that's highly optimized.

I'm currently getting around this lack it by launching "build --config=opt --config=cuda" builds weekly and dropping resulting wheel in a shared folder for other users in our company.

yaroslavvb · 2017-02-06T20:44:29Z

@dmitry-xnor now to think of it, I could probably drop such binaries into a public shared folder as well, as I'm going through my build process. I'm building with --config=opt --config=cuda with CUDA 8.0 on Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz from Cirrascale which seems like a common configuration. The downside is that I don't have time to setup cloud storage/ research uploading, but if someone gave me an easy recipe to follow, I could do that.

ghost · 2017-02-06T21:46:27Z

Do you build as yourself interactively or is this an automated build? If you build as yourself, the solution is easily scriptable. Create a GCS bucket for binaries once, then just upload stuff there for every release and make it public like so, using gsutil:

gsutil cp $TENSORFLOW_WHL gs://<bucket name>/
gsutil acl set public-read gs://<bucket name>/$TENSORFLOW_WHL

And publish the resulting download URL somewhere. Note also that you should not rename the WHL or else pip3 will barf.

yaroslavvb · 2017-02-06T22:01:57Z

thanks, I'll give it a shot on the next build

ghost · 2017-02-06T22:56:28Z

Sounds great, thanks! Looking forward to prebuilt WHLs! It just doesn't make sense to use the CPU at 1/3rd the speed. 👍

yaroslavvb · 2017-02-07T05:12:22Z

I have just launched a new website for this purpose: TensorFlow Community Wheels. Fully integrated with github.

gue22 · 2017-02-20T20:57:10Z

Hey guys,
need to dig some deeper into this thread, but to expedite things some thoughts here in advance:

Am I mistaken? As far as I saw / understood on my machine there is not even optimization for SSE1? How about cutting off CPUs of a certain age / SSE for the default distribution?! (Naturally I'd appreciate a community effort for a finer-grained optimization offering!!)
Do you have any insight how XLA / JIT / AOT announced last Wednesday (2017-02-15) comes to the rescue?

TIA
G.

ghost · 2017-02-20T21:03:01Z

Yaroslav has created a repo with links to unofficial builds. TF proper is understandably concerned about the support burden such backward incompatible change would create (i.e. you fire up TF on an old AMD machine and it gives you "invalid instruction"). I think Yaroslav's soluion is a good middle ground, with the possible benefit of e.g. Ryzen specific builds also appearing in the coming months.

…

On Mon, Feb 20, 2017 at 12:58 PM, gue22 ***@***.***> wrote: Hey guys, need to dig some deeper into this thread, but to expedite things some thoughts here in advance: 1. Am I mistaken? As far as I saw / understood on my machine there is not even optimization for SSE1? How about cutting off CPUs of a certain age / SSE for the default distribution?! (Naturally I'd appreciate a community effort for a finer-grained optimization offering!!) 2. Do you have any insight how XLA / JIT / AOT announced last Wednesday (2017-02-15) comes to the rescue? TIA G. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#7257 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AX6E0990riGntEb9EEWHJ7x3VZhgalZ1ks5ref54gaJpZM4L3Fgv> .

cancan101 · 2017-03-01T00:21:11Z

Any suggestion for users of the TF docker images? These image have TF pre-installed.

gunan · 2017-03-01T00:25:45Z

If you are using docker, you should be able to download devel docker images and install from sources that are already on docker images. On Feb 28, 2017 4:22 PM, "Alex Rothberg" <notifications@github.com> wrote: Any suggestion for users of the TF docker images? These image have TF pre-installed. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#7257 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AHlCOb0Y7ZlLY0dUP5k_PrCUSJUoiQEXks5rhLoxgaJpZM4L3Fgv> .

cancan101 · 2017-03-01T00:38:12Z

I am using the GPU devel docker images, but right now I am just "using them" without rebuilding / reinstalling.

It is worth considering how far back in SSE instructions is reasonable to handle when dealing with a machine that needs to have CUDA compute >= 3.0 (for the gpu images or gpu wheels).

yaroslavvb · 2017-03-01T01:20:50Z

SSE was causing problems for people running on AMD CPU -- #6809

ctmakro · 2017-03-04T15:42:23Z

after upgrading to 1.0 I found the OSX prebuilt version lacked SSE, FMA and AVX support. after searching around for a while there's no alternative except to build it myself. Well then, i'll build it myself.

gunan · 2017-03-04T17:42:43Z

@yaroslavvb created this repository to link to community supported wheel files.
https://github.com/yaroslavvb/tensorflow-community-wheels

We encourage our community creating and maintaining specialized builds, but we will be creating wheel files that are installable in most platforms. Therefore, I will close this issue.

mvpel · 2017-04-20T16:59:44Z

For example, the glibc library is designed to work anywhere, and has mechanisms to detect the availability of advanced instruction sets and use the proper functions to take advantage of them when they're available, and fall back if they're not. It's not necessary to support a separate binary for every possible processor capability.

Intel Performance Primitives: https://software.intel.com/en-us/articles/intel-integrated-performance-primitives-intel-ipp-is-there-any-function-to-detect-processor-type

Linux Function Multi-Versioning (FMV):
https://clearlinux.org/features/function-multiversioning-fmv

mvpel · 2017-04-24T17:18:57Z

Here's a LWN article on the FMV capabilities provided for C++ in GCC 4.8 and up, and for C in GCC 6:

https://lwn.net/Articles/691932/

ctmakro · 2017-05-10T18:01:54Z

I built these for OS X, with FMA and friends.
https://github.com/ctmakro/tensorflow_custom_python_wheel_build

apacha · 2017-05-19T11:47:07Z

@gunan For user needing best performance ... build from sources ...

We will make sure things build well, and building from sources is as easy as possible

That would be acceptable, if there was an easy way of building Tensorflow on Windows. Apparently, there isn't: People try but the official documentation on that clearly states that Windows is currently not supported. So it would be very valuable, if there were optimized builds available or you could follow up on @mvpel's suggestion on detecting cpu and enabling optimizations dynamically. Meanwhile, I will try to follow the instructions from here

apacha · 2017-05-22T08:35:20Z

FYI: Following up on my last comment, I built the GPU-Version of Tensorflow with CPU-Optimizations (AVX) enabled and I couldn't see much performance improvements on my side, so I will stick to the pre-build GPU-version that can be installed using pip install tensorflow-gpu==1.1.0

xinyazhang · 2017-05-24T21:47:00Z

@apacha From my experiments I found CPU-optimized GPU TF doesn't boost the performance significantly, but it can make the CPU cooler. My processor's temperature often goes up to 80C during training, while the optimized TF usually keeps the temp. below 70C.

mvpel · 2017-07-05T18:53:29Z

@xinyazhang - that can have performance implications, albeit slight, since CPUs will throttle their frequency if they are pushed into the upper limits of their temperature range for too long.

@apacha - There's not much point to vector instructions in a GPU-enabled TF runs, since the work that would be done by those instructions in the CPU is done in the GPU much more quickly, and so the fact that there's little performance improvement with AVX on a GPU-based run is to be expected.

The basic idea is that there's far more machines out there with AVX, AVX2, SSE, etc. than there are with GPUs, and they're much cheaper to rent in the cloud (an AWS c4.large with AVX2 is 10 cents per hour, while the smallest GPU instance p2.xlarge is 90 cents an hour), so wringing out every last bit of CPU performance potential for non-GPU runs can be of benefit provided that a TF job on c4.large doesn't take 9 times longer than on p4.xlarge.

ghost · 2017-08-02T01:30:27Z

@mvpel "There's not much point to vector instructions in a GPU-enabled TF runs"

CPU is very much a bottleneck with today's faster GPUs on certain models. Typically for e.g. computer vision problems you need to do a bunch of data decoding and augmentation, much of which can't be done on the GPU. This is actually a major problem we had with TF for multi-GPU training. Things were so bad (even with AVX2 and FMA enabled) that we switched to using PyTorch just for data augmentation in our TF pipelines. For what we do, it was an easy 40% throughput gain right off the bat, and code was quite a bit simpler too.

The point is: GPUs are specialized devices, and while they are powerful, they are not really usable for everything. Things are pretty bad even now for high throughput tasks, and I imagine they'll get much worse when we TPUs and NVIDIA V100 GPUs become available.

danqing · 2017-08-22T20:35:33Z

For anyone looking for optimized builds, we maintain a bunch of these at TinyMind that you can find at https://github.com/mind/wheels.

There are both CPU and GPU builds for all versions post TF1.1. Enjoy :)

bhack · 2017-10-13T12:04:59Z

@gunan Why the TF team cannot official maintain some alternative builds like suggested in the previous comment?

yaroslavvb · 2017-10-13T14:09:33Z

@bhack it's a business level decision (what's the best use of Google engineer time?). Providing custom hardware builds is possible by people outside of Google, but there are many Tensorflowy things that can only be done by Googlers.

PS: whole "bake AVX2 into binary" is not that great for open-source ecosystem -- TensorFlow would be better off with dynamic dispatch system like what's used by PyTorch, MKL.

bhack · 2017-10-13T14:54:33Z

I don't think that there is so much effort required other than hardware resources cause I think that AVX2 code paths are still tested in the matrix. When code is tested, and so builded, i think that it is quite automatic to publish it. But never mind, Intel is already maintaining optimized builds with a sort of delay over the official upstream releases.

yaroslavvb · 2017-10-13T15:12:51Z

I don't how that Intel build works -- does it/conda automatically figure out which instruction sets your machine has and get the proper version? Or it just automatically pushes XeonV4 optimized build?

This wasn't an issue with MKL because MKL has dynamic dispatch, but TF has to have advanced instructions statically baked in there

bhack · 2017-10-13T15:39:57Z

The only available info are at https://software.intel.com/en-us/articles/tensorflow-optimizations-on-modern-intel-architecture

danqing · 2017-10-13T16:34:38Z

At any rate, feel free to use the wheels I posted above @bhack - we rely on these ourselves so we will keep maintaining them. :)

abeepathak96 · 2017-11-07T02:16:55Z

How can make the tensrflow installed on my machine to compile SSE, AVX, FMA instructions?

danqing · 2017-11-07T02:49:08Z

If you use ubuntu 16.04, check out the link I posted above (https://github.com/mind/wheels) - there you can find the version you want as well as how to install it.

lakshayg · 2018-01-22T11:56:01Z

I have tensorflow wheels for a few different configurations at https://github.com/lakshayg/tensorflow-build

phitoduck · 2022-12-08T07:49:50Z

Hi all, I'd pay $75 for someone to help me write a Dockerfile for building tensorflow wheels.

We could put it in a public GitHub repo so folks could use it as a reference. I keep getting stuck when trying to follow the official tensorflow docs. My last bazel build ... attempt ended with this Error:

$ git clone -b "r2.10" --single-branch https://github.com/tensorflow/tensorflow.git

$ USE_BAZEL_VERSION=$(cat .bazelversion) \
    bazel build \
    --copt=-mavx2 --copt=-mfma \
    //tensorflow/tools/pip_package:build_pip_package

ERROR: /tensorflow/tensorflow/compiler/mlir/lite/BUILD:295:11: Compiling tensorflow/compiler/mlir/lite/ir/tfl_ops.cc failed: (Exit 1): gcc failed: error executing command /usr/bin/gcc -U_FORTIFY_SOURCE -fstack-protector -Wall -Wunused-but-set-parameter -Wno-free-nonheap-object -fno-omit-frame-pointer -g0 -O2 '-D_FORTIFY_SOURCE=1' -DNDEBUG -ffunction-sections ... (remaining 251 arguments skipped)
gcc: fatal error: Killed signal terminated program cc1plus
compilation terminated.
Target //tensorflow/tools/pip_package:build_pip_package failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 9695.639s, Critical Path: 4770.50s
INFO: 4737 processes: 771 internal, 3966 local.
FAILED: Build did NOT complete successfully

I appreciate the other projects that have been linked here, but none of them have the scripts used for actually doing the builds.

I specifically need a build with AVX2 and FMA, and could use another with AVX2, FMA, and AVX512F (for running on AWS Fargate). Python 3.8-3.10.

I miss GitHub notifications often, so reach out to me on LInkedin if you're interested: https://www.linkedin.com/in/eric-riddoch/

girving added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Feb 6, 2017

yaroslavvb added stat:community support Status - Community Support stat:contribution welcome Status - Contributions welcome and removed stat:awaiting tensorflower Status - Awaiting response from tensorflower labels Feb 6, 2017

yaroslavvb mentioned this issue Feb 18, 2017

Provide a working way to shut up warnings #7652

Closed

This was referenced Feb 24, 2017

"The TensorFlow library wasn't compiled to use SSE instructions, but these are available on your machine and could speed up CPU computations" in "Hello, TensorFlow!" program #7778

Closed

The TensorFlow library wasn't compiled to use ... instructions #7943

Closed

gunan closed this as completed Mar 4, 2017

abhi18av mentioned this issue Mar 10, 2017

Error while install tensorflow via recommended method on Getting Started guide #8096

Closed

cancan101 mentioned this issue Jul 7, 2017

Contribute Pillow-SIMD back to Pillow uploadcare/pillow-simd#8

Open

sjperkins mentioned this issue Jan 31, 2018

Number of rows inconsistent after removing auto-correlations ratt-ru/CubiCal#133

Closed

Feature request: Please provide AVX2/FMA capable builds #7257

Feature request: Please provide AVX2/FMA capable builds #7257

Comments

ghost commented Feb 4, 2017

What related GitHub issues or StackOverflow threads have you found by searching the web for your problem?

Environment info

If possible, provide a minimal reproducible example (We usually don't have time to read hundreds of lines of your code)

What other attempted solutions have you tried?

Logs or other output that would be helpful

yaroslavvb commented Feb 4, 2017

caisq commented Feb 6, 2017

gunan commented Feb 6, 2017

yaroslavvb commented Feb 6, 2017

ghost commented Feb 6, 2017 via email

yaroslavvb commented Feb 6, 2017 • edited Loading

yaroslavvb commented Feb 6, 2017 • edited Loading

ghost commented Feb 6, 2017 • edited by ghost Loading

yaroslavvb commented Feb 6, 2017

ghost commented Feb 6, 2017

yaroslavvb commented Feb 7, 2017 • edited Loading

gue22 commented Feb 20, 2017

ghost commented Feb 20, 2017 via email

cancan101 commented Mar 1, 2017

gunan commented Mar 1, 2017 via email

cancan101 commented Mar 1, 2017 • edited Loading

yaroslavvb commented Mar 1, 2017 • edited Loading

ctmakro commented Mar 4, 2017

gunan commented Mar 4, 2017

mvpel commented Apr 20, 2017

mvpel commented Apr 24, 2017

ctmakro commented May 10, 2017

apacha commented May 19, 2017 • edited Loading

apacha commented May 22, 2017

xinyazhang commented May 24, 2017

mvpel commented Jul 5, 2017

ghost commented Aug 2, 2017

danqing commented Aug 22, 2017

bhack commented Oct 13, 2017

yaroslavvb commented Oct 13, 2017

bhack commented Oct 13, 2017

yaroslavvb commented Oct 13, 2017 • edited Loading

bhack commented Oct 13, 2017

danqing commented Oct 13, 2017

abeepathak96 commented Nov 7, 2017

danqing commented Nov 7, 2017

lakshayg commented Jan 22, 2018

phitoduck commented Dec 8, 2022 • edited Loading

yaroslavvb commented Feb 6, 2017 •

edited

Loading

yaroslavvb commented Feb 6, 2017 •

edited

Loading

ghost commented Feb 6, 2017 •

edited by ghost

Loading

yaroslavvb commented Feb 7, 2017 •

edited

Loading

cancan101 commented Mar 1, 2017 •

edited

Loading

yaroslavvb commented Mar 1, 2017 •

edited

Loading

apacha commented May 19, 2017 •

edited

Loading

yaroslavvb commented Oct 13, 2017 •

edited

Loading

phitoduck commented Dec 8, 2022 •

edited

Loading