Add the build for runtime dispatch for AVX, AVX2 instruction set #26125

lly-zero-one · 2019-09-12T21:08:10Z

Summary: We already had some optimization implementation using AVX2 for improve the quantized kernel performance. In this diff, we want to enable the runtime dispatch.

Test Plan: Sandcastle build and test

Differential Revision: D17337251

jamesr66a · 2019-09-12T21:13:14Z

We should also make sure the CI build configs cover the correct dispatch keys

lly-zero-one · 2019-09-12T21:27:18Z

@ezyang , can you let me know how I can change the CI build configs?

dzhulgakov · 2019-09-13T05:46:12Z

Btw, shall we cover AVX512 VNNI too? From what I understand they are actually very important for fast quantized kernels without saturation. Not sure whether it should be a separate flag or we can fold it with regular AVX512 (in which case we'd skip 512 on SkyLake)

ezyang · 2019-09-13T13:50:30Z

First, you'll need to check that the CircleCI machines actually support AVX512. Probably the easiest way is to "Rerun with SSH" one of the jobs, ssh in and then poke around to figure out the support.

Then, selection of particular kernels is done by way of ATEN_CPU_CAPABILITY in .jenkins/pytorch/test.sh. You need to edit the CircleCI config to add another job for AVX512. If you want someone to walk you through the CI scripts in person, @kostmo is probably a good person to go find.

jamesr66a · 2019-09-13T20:42:48Z

From offline discussion with @dzhulgakov:

We haven't seen a use case for the instructions in AVX512 VNNI yet (they seem better suited to be supported in the FBGEMM kernels directly), so let's hold off on adding support for those.

jamesr66a · 2019-09-13T20:50:28Z

cmake/Codegen.cmake

Is this section needed? Wouldn't it be covered by the code at line 109 above?

But that's only set for AVX2 case. no?

Is there a reason we wanna disable split? It looks like it is beneficial on some CPU models but not on some other CPU models. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=49089

Check the comments in original file?

Which comment?

https://stackoverflow.com/questions/52626726/why-doesnt-gcc-resolve-mm256-loadu-pd-as-single-vmovupd#tab-top

OK, thanks. I interpret that as we wanna disable split because we mostly run on aligned memory.

@llyfacebook are you aware of AVX512 machines that dont support AVX2?

No, AVX512 machine should support AVX2, AVX. and AVX2 machine supports AVX.

I think I agree that I need to get rid of the CPU_NO_AVX256_SPLIT_FLAGS definition here, since it is already defined and we always build for all CPU_CAPABILITIES.

cmake/Codegen.cmake

lly-zero-one · 2019-09-15T03:40:02Z

Btw, shall we cover AVX512 VNNI too? From what I understand they are actually very important for fast quantized kernels without saturation. Not sure whether it should be a separate flag or we can fold it with regular AVX512 (in which case we'd skip 512 on SkyLake)

Mostly only fbgemm lib will use it and handle the dynamic checking, I guess.

lly-zero-one · 2019-09-18T07:01:56Z

.jenkins/pytorch/test.sh

Hi @kostmo, I am still not quite sure how does CI guarantee the allocated machine could be consistent with the build environment. ATEN_CPU_CAPABILITY is used to override the runtime cpu detection. So what if the final machine does not have the avx512 instructions? Can you comment this? Thanks.

If the final machine does not have avx512 instructions then you are SOL. If that's true, we don't have that much flexibility with the machines that CircleCI provides us, but we can surface to them that we need AVX512 this may be something they can help us with.

Thanks. Whom can I contact with regarding this requirement? Or we have PoC from outside.

jamesr66a · 2019-09-18T18:33:52Z

.circleci/cimodel/data/pytorch_build_definitions.py

Um, I don't think adding a YES_ flag is the correct approach here, seeing as that's not the convention so far and having flags of different parity is confusing. Let's make AVX512 the default and only have NO_* flags?

Can you suggest one then? We need one flag to enable the AVX512. Actually, the original setting only enables Default and AVX.

Right now we have both a flag to enable AVX512 and a flag to disable AVX 512. Can you please explain the behavior currently?

The logic is not that straightforward here, and I didn't intend to change it. You need to check the test.sh here.

if [[ "${BUILD_ENVIRONMENT}" == *-NO_AVX-* ]]; then export ATEN_CPU_CAPABILITY=default elif [[ "${BUILD_ENVIRONMENT}" == *-NO_AVX2-* ]]; then export ATEN_CPU_CAPABILITY=avx elif [[ "${BUILD_ENVIRONMENT}" == *-NO_AVX512-* ]]; then export ATEN_CPU_CAPABILITY=avx2 elif [[ "${BUILD_ENVIRONMENT}" == *-YES_AVX512-* ]]; then export ATEN_CPU_CAPABILITY=avx512 fi

We use the default if "--NO_AVX" appears in the environment string.
We use AVX if "--NO_AVX2-" appears at the head of environment string.
It means we always need one level up to support the current level.

There might be some contract between the name and machine/os allocation. Will check it out.

Those strings enforce an upper bound on the capability that's being run. By default we'll run the highest the machine supports. Thus, no YES_XXX variants were needed, since that's the default

First, getting rid of YES_XXX is fine to me and I also tend to do that. The critical thing is that we need to be clear how the machine is allocated according to the string (Unless all the medium machines have AVX512 support.). Basically, we want to a coverage of running the binary on AVX512, AVX2 and AVX, and Default CPUs.

jamesr66a · 2019-09-18T18:35:10Z

cmake/Codegen.cmake

Also commented on this from your avg_pool diff. Shouldn't -mavx already be added by the code at line 98 above?

jamesr66a · 2019-09-18T18:35:54Z

cmake/Codegen.cmake

@llyfacebook are you aware of AVX512 machines that dont support AVX2?

ezyang · 2019-09-20T18:16:25Z

Recapping internal discussion: CircleCI machines don't support AVX512. So we need some other way to test

Our preference is to ask CircleCI to get AVX512 (not yet done), and in the meantime maybe check if somewhere else support AVX512 (maybe Azure?)

xuhdev · 2019-09-25T21:28:24Z

xsimd uses an emulator to resolve this issue:
https://github.com/QuantStack/xsimd/blob/0b38f1a176399b93509097aa26224ceb5ca3f52c/.travis.yml#L200

jamesr66a · 2020-01-17T18:36:47Z

Did we figure out any way forward on this? Is it just blocked indefinitely because of the testing issue?

ezyang · 2020-01-21T18:49:39Z

Yes, this is stuck because of testing

Summary: We already had some optimization implementation using AVX2 for improve the quantized kernel performance. In this diff, we want to enable the runtime dispatch. Test Plan: Sandcastle build and test Also test with a python binary calling into vectorized op. torch.__config__.show() PyTorch built with: - GCC 4.2 - clang 8.0.20181009 - Intel(R) Math Kernel Library Version 2017.0.3 Product Build 20170413 for Intel(R) 64 architecture applications - Intel(R) MKL-DNN v0.18.1 (Git Hash N/A) - OpenMP 1 - **CPU capability usage: AVX2** - Build settings: Differential Revision: D17337251 fbshipit-source-id: 8a69d204e11a6f6436e34d624f1768894a5d5697

facebook-github-bot · 2020-02-19T20:01:03Z

This pull request was exported from Phabricator. Differential Revision: D17337251

facebook-github-bot · 2020-03-11T02:14:58Z

This pull request has been merged in 09296c3.

EscapeZero · 2020-03-19T01:01:19Z

This PR does not work on windows as setenv is not an available function. Should be ifdef to use _putenv how did this get past CircleCI? @peterjc123

peterjc123 · 2020-03-19T02:10:02Z

This PR does not work on windows as setenv is not an available function. Should be ifdef to use _putenv how did this get past CircleCI? @peterjc123

The file test/cpp/api/dispatch.cpp is not listed in CMakeList.txt (https://github.com/pytorch/pytorch/blob/master/test/cpp/api/CMakeLists.txt), so it is actually not tested.

lly-zero-one · 2020-03-19T02:59:32Z

Thanks. I added this feature and test mainly for our internal usage, since the dynamic dispatch already existed in OSS. Let me add the test in CMakefile.

EscapeZero · 2020-03-19T15:39:00Z

@lly-zero-one we are building most tests through globs on the windows side internally (waay downstream), but relying on CircleCI for windows land signals for devs. So it is very important we make sure tests that are not in fb specific folders go through normal CMake testing flow.

pytorchbot added module: build Build system issues module: operators labels Sep 12, 2019

lly-zero-one mentioned this pull request Sep 12, 2019

Support the AVX512 runtime dispatch #26109

Open

lly-zero-one requested review from jamesr66a, colesbury and apaszke September 12, 2019 21:11

lly-zero-one requested a review from ezyang September 12, 2019 21:34

jamesr66a reviewed Sep 13, 2019

View reviewed changes

lly-zero-one force-pushed the export-D17337251 branch from 33201bf to 436fef3 Compare September 15, 2019 22:25

lly-zero-one force-pushed the export-D17337251 branch from 436fef3 to 61a6ebb Compare September 16, 2019 23:36

lly-zero-one force-pushed the export-D17337251 branch 2 times, most recently from 8bcdca8 to cc8c37c Compare September 18, 2019 06:58

pytorchbot added the module: ci Related to continuous integration label Sep 18, 2019

lly-zero-one commented Sep 18, 2019

View reviewed changes

jamesr66a requested changes Sep 18, 2019

View reviewed changes

lly-zero-one force-pushed the export-D17337251 branch from cc8c37c to 930467b Compare September 19, 2019 05:53

lly-zero-one force-pushed the export-D17337251 branch from 930467b to 3ce1ed4 Compare September 19, 2019 05:55

lly-zero-one added this to the 1.3 milestone Sep 24, 2019

jamesr66a removed this from the 1.3 milestone Sep 30, 2019

lly-zero-one force-pushed the export-D17337251 branch from 3ce1ed4 to e424b81 Compare February 19, 2020 20:01

lly-zero-one requested review from ebetica, goldsborough and yf225 as code owners February 19, 2020 20:01

lly-zero-one changed the title ~~Add the build for runtime dispatch for AVX, AVX2 and AVX512 instruction set~~ Add the build for runtime dispatch for AVX, AVX2 instruction set Feb 19, 2020

facebook-github-bot closed this in 09296c3 Mar 10, 2020

facebook-github-bot added the merged label Mar 11, 2020

mruberry added the Merged label Oct 28, 2020

Add the build for runtime dispatch for AVX, AVX2 instruction set #26125

Add the build for runtime dispatch for AVX, AVX2 instruction set #26125

Uh oh!

Conversation

lly-zero-one commented Sep 12, 2019

Uh oh!

jamesr66a commented Sep 12, 2019

Uh oh!

lly-zero-one commented Sep 12, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dzhulgakov commented Sep 13, 2019

Uh oh!

ezyang commented Sep 13, 2019

Uh oh!

jamesr66a commented Sep 13, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lly-zero-one commented Sep 15, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lly-zero-one Sep 18, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lly-zero-one Sep 19, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ezyang commented Sep 20, 2019

Uh oh!

xuhdev commented Sep 25, 2019

Uh oh!

jamesr66a commented Jan 17, 2020

Uh oh!

ezyang commented Jan 21, 2020

Uh oh!

facebook-github-bot commented Feb 19, 2020

Uh oh!

facebook-github-bot commented Mar 11, 2020

Uh oh!

EscapeZero commented Mar 19, 2020

Uh oh!

peterjc123 commented Mar 19, 2020

Uh oh!

lly-zero-one commented Sep 12, 2019 •

edited

Loading

lly-zero-one Sep 18, 2019 •

edited

Loading

lly-zero-one Sep 19, 2019 •

edited

Loading