-
Notifications
You must be signed in to change notification settings - Fork 74k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature request: Please provide AVX2/FMA capable builds #7257
Comments
As a performance datapoint, this matrix multiplication benchmark goes 0.31 Tops/sec -> 1.05 Tops/sec when enabling avx2/fma on our Intel Xeon 3 @ 2.4 Ghz servers: https://github.com/yaroslavvb/stuff/blob/master/matmul_bench.py On the other hand, there may be technical issues with infrastructure that make it hard to setup such a build. @caisq for comment |
I believe our tooling and CI machines have the capacity to run bazel build with |
This was discussed before, and the decision was to make the released binaries work for everyone. So, we decided this to be our policy about SIMD instruction sets going forward: |
Since Intel is getting involved, perhaps they would be willing to maintain an Intel-optimized build of TensorFlow? cc @mahmoud-abuzaina in case he has some connections |
IIRC only Microsoft uses AMD CPUs in the cloud and those are on their way
out. The proposal is not to make this the new default whl. The proposal is
to pre-build binaries for Haswell and above as a second wheel set. Just
build them with the same settings you build the regular whls but say -mavx2
and -mfma and throw them into the cloud bucket. That's what users will do
when building their own from source. Why not save them the aggravation and
half an hour of their time? It adds up.
…On Mon, Feb 6, 2017 at 10:25 AM, gunan ***@***.***> wrote:
This was discussed before, and the decision was to make the released
binaries work for everyone.
While it is very easy for users to upgrade personal computers, many cloud
providers or thousands of machines take longer to upgrade. With 0.12, we
tried to enable avx and sse4.1, but we had to roll it back because it is
not as common for AMD CPUs to have sse 4.1 as intel CPUs.
So, we decided this to be our policy about SIMD instruction sets going
forward:
1 - Our released binaries will be as portable as possible, working out of
the box on most of the machines.
2 - For people who need the best TF performance, we recommend building
from sources. We will make sure things build well, and building from
sources is as easy as possible, but rather than supporting a a new binary
package for 10s of CPU architectures out in the wild, we decided the best
would be to let users build binaries as needed.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#7257 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AX6E09Lzc-Di7h10CqOSdku6bGR6LUFqks5rZ2WPgaJpZM4L3Fgv>
.
|
@dmitry-xnor I guess the issue here is limited resources at Google. Releasing an official wheel with a new configuration means you have to support it and fix issues that arise. I have seen some subtle alignment issues caused by enabling avx2, so troubleshooting such things can take time. And if you don't fix them, people get mad at Google, since the release is "official". Also, this sets a precedent for supporting a "highly optimized" binary + "lowest common denominator" binary, and the standard of "highly optimized binary" can shift over time. I do agree that it would be nice to have an Intel-specific build that's highly optimized. I'm currently getting around this lack it by launching "build --config=opt --config=cuda" builds weekly and dropping resulting wheel in a shared folder for other users in our company. |
@dmitry-xnor now to think of it, I could probably drop such binaries into a public shared folder as well, as I'm going through my build process. I'm building with |
Do you build as yourself interactively or is this an automated build? If you build as yourself, the solution is easily scriptable. Create a GCS bucket for binaries once, then just upload stuff there for every release and make it public like so, using gsutil:
And publish the resulting download URL somewhere. Note also that you should not rename the WHL or else pip3 will barf. |
thanks, I'll give it a shot on the next build |
Sounds great, thanks! Looking forward to prebuilt WHLs! It just doesn't make sense to use the CPU at 1/3rd the speed. 👍 |
I have just launched a new website for this purpose: TensorFlow Community Wheels. Fully integrated with github. |
Hey guys,
TIA |
Yaroslav has created a repo with links to unofficial builds. TF proper is
understandably concerned about the support burden such backward
incompatible change would create (i.e. you fire up TF on an old AMD machine
and it gives you "invalid instruction"). I think Yaroslav's soluion is a
good middle ground, with the possible benefit of e.g. Ryzen specific builds
also appearing in the coming months.
…On Mon, Feb 20, 2017 at 12:58 PM, gue22 ***@***.***> wrote:
Hey guys,
need to dig some deeper into this thread, but to expedite things some
thoughts here in advance:
1.
Am I mistaken? As far as I saw / understood on my machine there is not
even optimization for SSE1? How about cutting off CPUs of a certain age /
SSE for the default distribution?! (Naturally I'd appreciate a community
effort for a finer-grained optimization offering!!)
2.
Do you have any insight how XLA / JIT / AOT announced last Wednesday
(2017-02-15) comes to the rescue?
TIA
G.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#7257 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AX6E0990riGntEb9EEWHJ7x3VZhgalZ1ks5ref54gaJpZM4L3Fgv>
.
|
Any suggestion for users of the TF docker images? These image have TF pre-installed. |
If you are using docker, you should be able to download devel docker images
and install from sources that are already on docker images.
On Feb 28, 2017 4:22 PM, "Alex Rothberg" <notifications@github.com> wrote:
Any suggestion for users of the TF docker images? These image have TF
pre-installed.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#7257 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AHlCOb0Y7ZlLY0dUP5k_PrCUSJUoiQEXks5rhLoxgaJpZM4L3Fgv>
.
|
I am using the GPU devel docker images, but right now I am just "using them" without rebuilding / reinstalling. It is worth considering how far back in SSE instructions is reasonable to handle when dealing with a machine that needs to have CUDA compute >= 3.0 (for the gpu images or gpu wheels). |
SSE was causing problems for people running on AMD CPU -- #6809 |
after upgrading to 1.0 I found the OSX prebuilt version lacked SSE, FMA and AVX support. after searching around for a while there's no alternative except to build it myself. Well then, i'll build it myself. |
@yaroslavvb created this repository to link to community supported wheel files. We encourage our community creating and maintaining specialized builds, but we will be creating wheel files that are installable in most platforms. Therefore, I will close this issue. |
For example, the glibc library is designed to work anywhere, and has mechanisms to detect the availability of advanced instruction sets and use the proper functions to take advantage of them when they're available, and fall back if they're not. It's not necessary to support a separate binary for every possible processor capability. Intel Performance Primitives: https://software.intel.com/en-us/articles/intel-integrated-performance-primitives-intel-ipp-is-there-any-function-to-detect-processor-type Linux Function Multi-Versioning (FMV): |
Here's a LWN article on the FMV capabilities provided for C++ in GCC 4.8 and up, and for C in GCC 6: |
I built these for OS X, with FMA and friends. |
That would be acceptable, if there was an easy way of building Tensorflow on Windows. Apparently, there isn't: People try but the official documentation on that clearly states that Windows is currently not supported. So it would be very valuable, if there were optimized builds available or you could follow up on @mvpel's suggestion on detecting cpu and enabling optimizations dynamically. Meanwhile, I will try to follow the instructions from here |
FYI: Following up on my last comment, I built the GPU-Version of Tensorflow with CPU-Optimizations (AVX) enabled and I couldn't see much performance improvements on my side, so I will stick to the pre-build GPU-version that can be installed using |
@apacha From my experiments I found CPU-optimized GPU TF doesn't boost the performance significantly, but it can make the CPU cooler. My processor's temperature often goes up to 80C during training, while the optimized TF usually keeps the temp. below 70C. |
@xinyazhang - that can have performance implications, albeit slight, since CPUs will throttle their frequency if they are pushed into the upper limits of their temperature range for too long. @apacha - There's not much point to vector instructions in a GPU-enabled TF runs, since the work that would be done by those instructions in the CPU is done in the GPU much more quickly, and so the fact that there's little performance improvement with AVX on a GPU-based run is to be expected. The basic idea is that there's far more machines out there with AVX, AVX2, SSE, etc. than there are with GPUs, and they're much cheaper to rent in the cloud (an AWS c4.large with AVX2 is 10 cents per hour, while the smallest GPU instance p2.xlarge is 90 cents an hour), so wringing out every last bit of CPU performance potential for non-GPU runs can be of benefit provided that a TF job on c4.large doesn't take 9 times longer than on p4.xlarge. |
@mvpel "There's not much point to vector instructions in a GPU-enabled TF runs" CPU is very much a bottleneck with today's faster GPUs on certain models. Typically for e.g. computer vision problems you need to do a bunch of data decoding and augmentation, much of which can't be done on the GPU. This is actually a major problem we had with TF for multi-GPU training. Things were so bad (even with AVX2 and FMA enabled) that we switched to using PyTorch just for data augmentation in our TF pipelines. For what we do, it was an easy 40% throughput gain right off the bat, and code was quite a bit simpler too. The point is: GPUs are specialized devices, and while they are powerful, they are not really usable for everything. Things are pretty bad even now for high throughput tasks, and I imagine they'll get much worse when we TPUs and NVIDIA V100 GPUs become available. |
For anyone looking for optimized builds, we maintain a bunch of these at TinyMind that you can find at https://github.com/mind/wheels. There are both CPU and GPU builds for all versions post TF1.1. Enjoy :) |
@gunan Why the TF team cannot official maintain some alternative builds like suggested in the previous comment? |
@bhack it's a business level decision (what's the best use of Google engineer time?). Providing custom hardware builds is possible by people outside of Google, but there are many Tensorflowy things that can only be done by Googlers. PS: whole "bake AVX2 into binary" is not that great for open-source ecosystem -- TensorFlow would be better off with dynamic dispatch system like what's used by PyTorch, MKL. |
I don't think that there is so much effort required other than hardware resources cause I think that AVX2 code paths are still tested in the matrix. When code is tested, and so builded, i think that it is quite automatic to publish it. But never mind, Intel is already maintaining optimized builds with a sort of delay over the official upstream releases. |
I don't how that Intel build works -- does it/conda automatically figure out which instruction sets your machine has and get the proper version? Or it just automatically pushes XeonV4 optimized build? This wasn't an issue with MKL because MKL has dynamic dispatch, but TF has to have advanced instructions statically baked in there |
The only available info are at https://software.intel.com/en-us/articles/tensorflow-optimizations-on-modern-intel-architecture |
At any rate, feel free to use the wheels I posted above @bhack - we rely on these ourselves so we will keep maintaining them. :) |
How can make the tensrflow installed on my machine to compile SSE, AVX, FMA instructions? |
If you use ubuntu 16.04, check out the link I posted above (https://github.com/mind/wheels) - there you can find the version you want as well as how to install it. |
I have tensorflow wheels for a few different configurations at https://github.com/lakshayg/tensorflow-build |
Hi all, I'd pay $75 for someone to help me write a Dockerfile for building tensorflow wheels. We could put it in a public GitHub repo so folks could use it as a reference. I keep getting stuck when trying to follow the official tensorflow docs. My last
I appreciate the other projects that have been linked here, but none of them have the scripts used for actually doing the builds. I specifically need a build with I miss GitHub notifications often, so reach out to me on LInkedin if you're interested: https://www.linkedin.com/in/eric-riddoch/ |
I would go out on a limb and guess that the vast majority of Tensorflow users on Linux at least use fairly modern CPUs. It would therefore be beneficial for them to have the prebuilt TF binaries support AVX2/FMA. These two ISA extensions, and especially FMA, tend to speed up GEMM-like math pretty significantly.
It'd be great if TF team provided prebuilt Linux release *.whl that supports AVX2/FMA, perhaps as an alternative, non-default wheel. These should be compatible with Haswell and above. Haswell came out in 2013, lots of people have it by now.
To be clear, this is not a hugely pressing issue,
*.whl
can be easily rebuilt from source. It'd just make things faster and easier for people with modern CPUs.What related GitHub issues or StackOverflow threads have you found by searching the web for your problem?
N/A
Environment info
Operating System:
Linux Ubuntu 16.04
Installed version of CUDA and cuDNN:
(please attach the output of
ls -l /path/to/cuda/lib/libcud*
): NONEIf installed from binary pip package, provide:
https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.0.0rc0-cp35-cp35m-linux_x86_64.whl
python -c "import tensorflow; print(tensorflow.__version__)"
:1.0.0-rc0
If possible, provide a minimal reproducible example (We usually don't have time to read hundreds of lines of your code)
Code:
Output:
What other attempted solutions have you tried?
Compiled from source.
Logs or other output that would be helpful
(If logs are large, please upload as attachment or provide link).
The text was updated successfully, but these errors were encountered: