Skip to content

Conversation

lly-zero-one
Copy link
Contributor

Summary: We already had some optimization implementation using AVX2 for improve the quantized kernel performance. In this diff, we want to enable the runtime dispatch.

Test Plan: Sandcastle build and test

Differential Revision: D17337251

@jamesr66a
Copy link
Collaborator

We should also make sure the CI build configs cover the correct dispatch keys

@lly-zero-one
Copy link
Contributor Author

lly-zero-one commented Sep 12, 2019

@ezyang , can you let me know how I can change the CI build configs?

@lly-zero-one lly-zero-one requested a review from ezyang September 12, 2019 21:34
@dzhulgakov
Copy link
Collaborator

Btw, shall we cover AVX512 VNNI too? From what I understand they are actually very important for fast quantized kernels without saturation. Not sure whether it should be a separate flag or we can fold it with regular AVX512 (in which case we'd skip 512 on SkyLake)

@ezyang
Copy link
Contributor

ezyang commented Sep 13, 2019

First, you'll need to check that the CircleCI machines actually support AVX512. Probably the easiest way is to "Rerun with SSH" one of the jobs, ssh in and then poke around to figure out the support.

Then, selection of particular kernels is done by way of ATEN_CPU_CAPABILITY in .jenkins/pytorch/test.sh. You need to edit the CircleCI config to add another job for AVX512. If you want someone to walk you through the CI scripts in person, @kostmo is probably a good person to go find.

@jamesr66a
Copy link
Collaborator

From offline discussion with @dzhulgakov:

We haven't seen a use case for the instructions in AVX512 VNNI yet (they seem better suited to be supported in the FBGEMM kernels directly), so let's hold off on adding support for those.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this section needed? Wouldn't it be covered by the code at line 109 above?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But that's only set for AVX2 case. no?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we wanna disable split? It looks like it is beneficial on some CPU models but not on some other CPU models. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=49089

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check the comments in original file?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which comment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, thanks. I interpret that as we wanna disable split because we mostly run on aligned memory.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@llyfacebook are you aware of AVX512 machines that dont support AVX2?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, AVX512 machine should support AVX2, AVX. and AVX2 machine supports AVX.

I think I agree that I need to get rid of the CPU_NO_AVX256_SPLIT_FLAGS definition here, since it is already defined and we always build for all CPU_CAPABILITIES.

@lly-zero-one
Copy link
Contributor Author

Btw, shall we cover AVX512 VNNI too? From what I understand they are actually very important for fast quantized kernels without saturation. Not sure whether it should be a separate flag or we can fold it with regular AVX512 (in which case we'd skip 512 on SkyLake)

Mostly only fbgemm lib will use it and handle the dynamic checking, I guess.

@lly-zero-one lly-zero-one force-pushed the export-D17337251 branch 2 times, most recently from 8bcdca8 to cc8c37c Compare September 18, 2019 06:58
@pytorchbot pytorchbot added the module: ci Related to continuous integration label Sep 18, 2019
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @kostmo, I am still not quite sure how does CI guarantee the allocated machine could be consistent with the build environment. ATEN_CPU_CAPABILITY is used to override the runtime cpu detection. So what if the final machine does not have the avx512 instructions? Can you comment this? Thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the final machine does not have avx512 instructions then you are SOL. If that's true, we don't have that much flexibility with the machines that CircleCI provides us, but we can surface to them that we need AVX512 this may be something they can help us with.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Whom can I contact with regarding this requirement? Or we have PoC from outside.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Um, I don't think adding a YES_ flag is the correct approach here, seeing as that's not the convention so far and having flags of different parity is confusing. Let's make AVX512 the default and only have NO_* flags?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you suggest one then? We need one flag to enable the AVX512. Actually, the original setting only enables Default and AVX.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now we have both a flag to enable AVX512 and a flag to disable AVX 512. Can you please explain the behavior currently?

Copy link
Contributor Author

@lly-zero-one lly-zero-one Sep 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic is not that straightforward here, and I didn't intend to change it. You need to check the test.sh here.

if [[ "${BUILD_ENVIRONMENT}" == *-NO_AVX-* ]]; then
  export ATEN_CPU_CAPABILITY=default
elif [[ "${BUILD_ENVIRONMENT}" == *-NO_AVX2-* ]]; then
  export ATEN_CPU_CAPABILITY=avx
elif [[ "${BUILD_ENVIRONMENT}" == *-NO_AVX512-* ]]; then
  export ATEN_CPU_CAPABILITY=avx2
elif [[ "${BUILD_ENVIRONMENT}" == *-YES_AVX512-* ]]; then
  export ATEN_CPU_CAPABILITY=avx512
fi

We use the default if "--NO_AVX" appears in the environment string.
We use AVX if "--NO_AVX2-" appears at the head of environment string.
It means we always need one level up to support the current level.

There might be some contract between the name and machine/os allocation. Will check it out.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those strings enforce an upper bound on the capability that's being run. By default we'll run the highest the machine supports. Thus, no YES_XXX variants were needed, since that's the default

Copy link
Contributor Author

@lly-zero-one lly-zero-one Sep 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, getting rid of YES_XXX is fine to me and I also tend to do that. The critical thing is that we need to be clear how the machine is allocated according to the string (Unless all the medium machines have AVX512 support.). Basically, we want to a coverage of running the binary on AVX512, AVX2 and AVX, and Default CPUs.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also commented on this from your avg_pool diff. Shouldn't -mavx already be added by the code at line 98 above?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@llyfacebook are you aware of AVX512 machines that dont support AVX2?

@ezyang
Copy link
Contributor

ezyang commented Sep 20, 2019

Recapping internal discussion: CircleCI machines don't support AVX512. So we need some other way to test

Our preference is to ask CircleCI to get AVX512 (not yet done), and in the meantime maybe check if somewhere else support AVX512 (maybe Azure?)

@lly-zero-one lly-zero-one added this to the 1.3 milestone Sep 24, 2019
@xuhdev
Copy link
Collaborator

xuhdev commented Sep 25, 2019

@jamesr66a jamesr66a removed this from the 1.3 milestone Sep 30, 2019
@jamesr66a
Copy link
Collaborator

Did we figure out any way forward on this? Is it just blocked indefinitely because of the testing issue?

@ezyang
Copy link
Contributor

ezyang commented Jan 21, 2020

Yes, this is stuck because of testing

Summary: We already had some optimization implementation using AVX2 for improve the quantized kernel performance. In this diff, we want to enable the runtime dispatch.

Test Plan:
Sandcastle build and test

Also test with a python binary calling into vectorized op.

torch.__config__.show()
PyTorch built with:
  - GCC 4.2
  - clang 8.0.20181009
  - Intel(R) Math Kernel Library Version 2017.0.3 Product Build 20170413 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v0.18.1 (Git Hash N/A)
  - OpenMP 1
  - **CPU capability usage: AVX2**
  - Build settings:

Differential Revision: D17337251

fbshipit-source-id: 8a69d204e11a6f6436e34d624f1768894a5d5697
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D17337251

@lly-zero-one lly-zero-one changed the title Add the build for runtime dispatch for AVX, AVX2 and AVX512 instruction set Add the build for runtime dispatch for AVX, AVX2 instruction set Feb 19, 2020
@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 09296c3.

@EscapeZero
Copy link
Contributor

This PR does not work on windows as setenv is not an available function. Should be ifdef to use _putenv how did this get past CircleCI? @peterjc123

@peterjc123
Copy link
Collaborator

This PR does not work on windows as setenv is not an available function. Should be ifdef to use _putenv how did this get past CircleCI? @peterjc123

The file test/cpp/api/dispatch.cpp is not listed in CMakeList.txt (https://github.com/pytorch/pytorch/blob/master/test/cpp/api/CMakeLists.txt), so it is actually not tested.

@lly-zero-one
Copy link
Contributor Author

Thanks. I added this feature and test mainly for our internal usage, since the dynamic dispatch already existed in OSS. Let me add the test in CMakefile.

@EscapeZero
Copy link
Contributor

@lly-zero-one we are building most tests through globs on the windows side internally (waay downstream), but relying on CircleCI for windows land signals for devs. So it is very important we make sure tests that are not in fb specific folders go through normal CMake testing flow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Merged module: build Build system issues module: ci Related to continuous integration
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants