-
Notifications
You must be signed in to change notification settings - Fork 25.5k
Add the build for runtime dispatch for AVX, AVX2 instruction set #26125
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
We should also make sure the CI build configs cover the correct dispatch keys |
@ezyang , can you let me know how I can change the CI build configs? |
Btw, shall we cover AVX512 VNNI too? From what I understand they are actually very important for fast quantized kernels without saturation. Not sure whether it should be a separate flag or we can fold it with regular AVX512 (in which case we'd skip 512 on SkyLake) |
First, you'll need to check that the CircleCI machines actually support AVX512. Probably the easiest way is to "Rerun with SSH" one of the jobs, ssh in and then poke around to figure out the support. Then, selection of particular kernels is done by way of |
From offline discussion with @dzhulgakov: We haven't seen a use case for the instructions in AVX512 VNNI yet (they seem better suited to be supported in the FBGEMM kernels directly), so let's hold off on adding support for those. |
cmake/Codegen.cmake
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this section needed? Wouldn't it be covered by the code at line 109 above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But that's only set for AVX2 case. no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason we wanna disable split? It looks like it is beneficial on some CPU models but not on some other CPU models. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=49089
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check the comments in original file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, thanks. I interpret that as we wanna disable split because we mostly run on aligned memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@llyfacebook are you aware of AVX512 machines that dont support AVX2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, AVX512 machine should support AVX2, AVX. and AVX2 machine supports AVX.
I think I agree that I need to get rid of the CPU_NO_AVX256_SPLIT_FLAGS definition here, since it is already defined and we always build for all CPU_CAPABILITIES.
Mostly only fbgemm lib will use it and handle the dynamic checking, I guess. |
33201bf
to
436fef3
Compare
436fef3
to
61a6ebb
Compare
8bcdca8
to
cc8c37c
Compare
.jenkins/pytorch/test.sh
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @kostmo, I am still not quite sure how does CI guarantee the allocated machine could be consistent with the build environment. ATEN_CPU_CAPABILITY is used to override the runtime cpu detection. So what if the final machine does not have the avx512 instructions? Can you comment this? Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the final machine does not have avx512 instructions then you are SOL. If that's true, we don't have that much flexibility with the machines that CircleCI provides us, but we can surface to them that we need AVX512 this may be something they can help us with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Whom can I contact with regarding this requirement? Or we have PoC from outside.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Um, I don't think adding a YES_
flag is the correct approach here, seeing as that's not the convention so far and having flags of different parity is confusing. Let's make AVX512 the default and only have NO_* flags?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you suggest one then? We need one flag to enable the AVX512. Actually, the original setting only enables Default and AVX.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now we have both a flag to enable AVX512 and a flag to disable AVX 512. Can you please explain the behavior currently?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic is not that straightforward here, and I didn't intend to change it. You need to check the test.sh here.
if [[ "${BUILD_ENVIRONMENT}" == *-NO_AVX-* ]]; then
export ATEN_CPU_CAPABILITY=default
elif [[ "${BUILD_ENVIRONMENT}" == *-NO_AVX2-* ]]; then
export ATEN_CPU_CAPABILITY=avx
elif [[ "${BUILD_ENVIRONMENT}" == *-NO_AVX512-* ]]; then
export ATEN_CPU_CAPABILITY=avx2
elif [[ "${BUILD_ENVIRONMENT}" == *-YES_AVX512-* ]]; then
export ATEN_CPU_CAPABILITY=avx512
fi
We use the default if "--NO_AVX" appears in the environment string.
We use AVX if "--NO_AVX2-" appears at the head of environment string.
It means we always need one level up to support the current level.
There might be some contract between the name and machine/os allocation. Will check it out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those strings enforce an upper bound on the capability that's being run. By default we'll run the highest the machine supports. Thus, no YES_XXX variants were needed, since that's the default
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First, getting rid of YES_XXX is fine to me and I also tend to do that. The critical thing is that we need to be clear how the machine is allocated according to the string (Unless all the medium machines have AVX512 support.). Basically, we want to a coverage of running the binary on AVX512, AVX2 and AVX, and Default CPUs.
cmake/Codegen.cmake
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also commented on this from your avg_pool diff. Shouldn't -mavx
already be added by the code at line 98 above?
cmake/Codegen.cmake
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@llyfacebook are you aware of AVX512 machines that dont support AVX2?
cc8c37c
to
930467b
Compare
930467b
to
3ce1ed4
Compare
Recapping internal discussion: CircleCI machines don't support AVX512. So we need some other way to test Our preference is to ask CircleCI to get AVX512 (not yet done), and in the meantime maybe check if somewhere else support AVX512 (maybe Azure?) |
xsimd uses an emulator to resolve this issue: |
Did we figure out any way forward on this? Is it just blocked indefinitely because of the testing issue? |
Yes, this is stuck because of testing |
Summary: We already had some optimization implementation using AVX2 for improve the quantized kernel performance. In this diff, we want to enable the runtime dispatch. Test Plan: Sandcastle build and test Also test with a python binary calling into vectorized op. torch.__config__.show() PyTorch built with: - GCC 4.2 - clang 8.0.20181009 - Intel(R) Math Kernel Library Version 2017.0.3 Product Build 20170413 for Intel(R) 64 architecture applications - Intel(R) MKL-DNN v0.18.1 (Git Hash N/A) - OpenMP 1 - **CPU capability usage: AVX2** - Build settings: Differential Revision: D17337251 fbshipit-source-id: 8a69d204e11a6f6436e34d624f1768894a5d5697
3ce1ed4
to
e424b81
Compare
This pull request was exported from Phabricator. Differential Revision: D17337251 |
This pull request has been merged in 09296c3. |
This PR does not work on windows as |
The file |
Thanks. I added this feature and test mainly for our internal usage, since the dynamic dispatch already existed in OSS. Let me add the test in CMakefile. |
@lly-zero-one we are building most tests through globs on the windows side internally (waay downstream), but relying on CircleCI for windows land signals for devs. So it is very important we make sure tests that are not in |
Summary: We already had some optimization implementation using AVX2 for improve the quantized kernel performance. In this diff, we want to enable the runtime dispatch.
Test Plan: Sandcastle build and test
Differential Revision: D17337251