Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

build with mkl-dnn by default #13303

Closed
wants to merge 4 commits into from
Closed

Conversation

gujinghui
Copy link
Collaborator

build with mkl-dnn by default

@gujinghui
Copy link
Collaborator Author

@Jianhui-Li @jgong5

@soumith
Copy link
Member

soumith commented Oct 30, 2018

[E ideep_operator.h:62] IDEEP error:could not initialize a memory descriptor
Oct 30 09:25:58 ERROR:sccache::server: Compilation failed: Output { status: ExitStatus(ExitStatus(256)), stdout: "", stderr: "/tmp/ccfAyF9m.s: Assembler messages:\n/tmp/ccfAyF9m.s:424: Error: no such instruction: `vextracti32x8 $0x1,%zmm0,%ymm0\'\n/tmp/ccfAyF9m.s:426: Error: no such instruction: `vpmullq %zmm0,%zmm1,%zmm0\'\n/tmp/ccfAyF9m.s:427: Error: no such instruction: `vpmullq %zmm0,%zmm2,%zmm2\'\n/tmp/ccfAyF9m.s:431: Error: no such instruction: `vpmullq %zmm0,%zmm2,%zmm0\'\n/tmp/ccfAyF9m.s:434: Error: no such instruction: `vpmullq %zmm2,%zmm0,%zmm2\'\n/tmp/ccfAyF9m.s:437: Error: no such instruction: `vpmullq %zmm0,%zmm2,%zmm2\'\n/tmp/ccfAyF9m.s:552: Error: no such instruction: `vextracti32x8 $0x1,%zmm0,%ymm1\'\n/tmp/ccfAyF9m.s:555: Error: no such instruction: `vpmullq %zmm0,%zmm1,%zmm0\'\n/tmp/ccfAyF9m.s:556: Error: no such instruction: `vpmullq %zmm0,%zmm2,%zmm2\'\n/tmp/ccfAyF9m.s:562: Error: no such instruction: `vpmullq %zmm2,%zmm0,%zmm2\'\n/tmp/ccfAyF9m.s:565: Error: no such instruction: `vpmullq %zmm2,%zmm0,%zmm2\'\n/tmp/ccfAyF9m.s:568: Error: no such instruction: `vpmullq %zmm2,%zmm0,%zmm0\'\n/tmp/ccfAyF9m.s:679: Error: no such instruction: `vextracti32x8 $0x1,%zmm0,%ymm1\'\n/tmp/ccfAyF9m.s:682: Error: no such instruction: `vpmullq %zmm0,%zmm1,%zmm0\'\n/tmp/ccfAyF9m.s:683: Error: no such instruction: `vpmullq %zmm0,%zmm2,%zmm2\'\n/tmp/ccfAyF9m.s:688: Error: no such instruction: `vpmullq %zmm2,%zmm1,%zmm1\'\n/tmp/ccfAyF9m.s:690: Error: no such instruction: `vpmullq %zmm1,%zmm0,%zmm1\'\n/tmp/ccfAyF9m.s:693: Error: no such instruction: `vpmullq %zmm1,%zmm0,%zmm0\'\n" }
Oct 30 09:25:47 ERROR:sccache::server: Compilation failed: Output { status: ExitStatus(ExitStatus(256)), stdout: "", stderr: "/tmp/ccjDDAUX.s: Assembler messages:\n/tmp/ccjDDAUX.s:598: Error: operand size mismatch for `vmovdqu64\'\n/tmp/ccjDDAUX.s:621: Error: operand size mismatch for `vmovdqu64\'\n/tmp/ccjDDAUX.s:1157: Error: operand size mismatch for `vmovdqu64\'\n/tmp/ccjDDAUX.s:3545: Error: operand type mismatch for `vxorps\'\n/tmp/ccjDDAUX.s:3560: Error: operand type mismatch for `vxorps\'\n/tmp/ccjDDAUX.s:5656: Error: no such instruction: `vpermi2w %zmm17,%zmm15,%zmm18\'\n/tmp/ccjDDAUX.s:5671: Error: no such instruction: `vpmovwb %zmm18,%ymm3\'\n/tmp/ccjDDAUX.s:5672: Error: no such instruction: `vpermi2w %zmm19,%zmm16,%zmm4\'\n/tmp/ccjDDAUX.s:5673: Error: no such instruction: `vpmovwb %zmm4,%ymm4\'\n/tmp/ccjDDAUX.s:6736: Error: no such instruction: `vpermi2w %zmm6,%zmm4,%zmm2\'\n/tmp/ccjDDAUX.s:6741: Error: no such instruction: `vpmovwb %zmm2,%ymm2\'\n/tmp/ccjDDAUX.s:6754: Error: no such instruction: `vpermi2w %zmm7,%zmm5,%zmm16\'\n/tmp/ccjDDAUX.s:6755: Error: no such instruction: `vpmovwb %zmm16,%ymm1\'\n/tmp/ccjDDAUX.s:7899: Error: operand size mismatch for `vmovdqu64\'\n/tmp/ccjDDAUX.s:7916: Error: operand size mismatch for `vmovdqu64\'\n/tmp/ccjDDAUX.s:10975: Error: operand size mismatch for `vmovdqu64\'\n/tmp/ccjDDAUX.s:11530: Error: operand size mismatch for `vmovdqa64\'\n/tmp/ccjDDAUX.s:11614: Error: operand size mismatch for `vmovdqa64\'\n/tmp/ccjDDAUX.s:11722: Error: operand size mismatch for `vmovdqa64\'\n/tmp/ccjDDAUX.s:11803: Error: operand size mismatch for `vmovdqa64\'\n/tmp/ccjDDAUX.s:15557: Error: operand size mismatch for `vmovdqa64\'\n" }
Oct 30 09:25:47 

many other failures across CI. Maybe iDeep is not ready for default enabling yet?

@gujinghui
Copy link
Collaborator Author

@soumith
Trying to root-cause these issues. :)

For #1, should be caused by the ideep case, copy_op_test. Will fix it soon.

For other compiling issues, according to previous study, should be caused by building environment when building mkl-dnn, for example, the version of binutils. Could you tell who can help me on such issue?

@soumith
Copy link
Member

soumith commented Oct 30, 2018

@gujinghui for these cases, you can reproduce the build environment using shell scripts in https://github.com/pytorch/pytorch/tree/master/.jenkins/pytorch

I have just pasted the errors from the logs of the continuous build (you can press the "Details" button in the entries below that say "ci/circleci: caffe2_onnx_py2_gcc5_ubuntu16_04_test" etc.

@gujinghui
Copy link
Collaborator Author

@soumith
Thanks for reply.

The cases with below failure have been modified here.

[E ideep_operator.h:62] IDEEP error:could not initialize a memory descriptor

The reason is zero dim is still not supported for now both in ideep and mkl-dnn.
I have to skip such cases to avoid failures now.

But the coming upgrade will introduce zero dim feature soon.
The cases will be enabled after next upgrade.

Besides, we still have rest 3 kinds of failures in this PR:

  1. illegal instructions in compiling or cases.
Oct 30 09:25:58 ERROR:sccache::server: Compilation failed: Output { status: ExitStatus(ExitStatus(256)), stdout: "", stderr: "/tmp/ccfAyF9m.s: Assembler messages:\n/tmp/ccfAyF9m.s:424: Error: no such instruction: `vextracti32x8 $0x1,%zmm0,%ymm0\'\n/tmp/ccfAyF9m.s:426: Error: no such instruction: `vpmullq %zmm0,%zmm1,%zmm0\'\n/tmp/ccfAyF9m.s:427: Error: no such instruction: `vpmullq %zmm0,%zmm2,%zmm2\'\n/tmp/ccfAyF9m.s:431: Error: no such instruction: `vpmullq %zmm0,%zmm2,%zmm0\'\n/tmp/ccfAyF9m.s:434: Error: no such instruction: `vpmullq %zmm2,%zmm0,%zmm2\'\n/tmp/ccfAyF9m.s:437: Error: no such instruction: `vpmullq %zmm0,%zmm2,%zmm2\'\n/tmp/ccfAyF9m.s:552: Error: no such instruction: `vextracti32x8 $0x1,%zmm0,%ymm1\'\n/tmp/ccfAyF9m.s:555: Error: no such instruction: `vpmullq %zmm0,%zmm1,%zmm0\'\n/tmp/ccfAyF9m.s:556: Error: no such instruction: `vpmullq %zmm0,%zmm2,%zmm2\'\n/tmp/ccfAyF9m.s:562: Error: no such instruction: `vpmullq %zmm2,%zmm0,%zmm2\'\n/tmp/ccfAyF9m.s:565: Error: no such instruction: `vpmullq %zmm2,%zmm0,%zmm2\'\n/tmp/ccfAyF9m.s:568: Error: no such instruction: `vpmullq %zmm2,%zmm0,%zmm0\'\n/tmp/ccfAyF9m.s:679: Error: no such instruction: `vextracti32x8 $0x1,%zmm0,%ymm1\'\n/tmp/ccfAyF9m.s:682: Error: no such instruction: `vpmullq %zmm0,%zmm1,%zmm0\'\n/tmp/ccfAyF9m.s:683: Error: no such instruction: `vpmullq %zmm0,%zmm2,%zmm2\'\n/tmp/ccfAyF9m.s:688: Error: no such instruction: `vpmullq %zmm2,%zmm1,%zmm1\'\n/tmp/ccfAyF9m.s:690: Error: no such instruction: `vpmullq %zmm1,%zmm0,%zmm1\'\n/tmp/ccfAyF9m.s:693: Error: no such instruction: `vpmullq %zmm1,%zmm0,%zmm0\'\n" }

I'm trying to repro this locally. But still suspect this should be building environment issue.
If got solid evident, will return to you soon.

  1. not find libmkldnn.dylib on macOS.
    Perhaps, it should be FindMKLDNN.cmake issue, which gives lib path for libmkldnn.
    The lib path should be corrected for macOS.

  2. crashed due to wrong deconstructor of mkldnn_engine
    Under debuging. If got any update, will send here.

@kleisauke
Copy link

FYI: Building PyTorch (using CMake's ExternalProject) with the -DUSE_MKLDNN=ON flag causes linkage errors for me (after deleting the build directory):

make[2]: *** No rule to make target 'third_party/pytorch/src/pytorch-build/lib/libmkldnn.so', needed by '../lib/libtest.so.5.0.0'.  Stop.
make[1]: *** [CMakeFiles/Makefile2:109: src/api/CMakeFiles/test.dir/all] Error 2
make: *** [Makefile:95: all] Error 2

It seems that libmkldnn.so is installed into the build folder (for e.g. /home/pytorch-test/build/third_party/pytorch/src/pytorch-build/lib/libmkldnn.so), see:

LIST(APPEND MKLDNN_LIBRARIES "${PROJECT_BINARY_DIR}/lib/${MKLDNN_LIB}")

Maybe MKL-DNN needs to be installed in CMAKE_INSTALL_PREFIX?

@gujinghui
Copy link
Collaborator Author

FYI: Building PyTorch (using CMake's ExternalProject) with the -DUSE_MKLDNN=ON flag causes linkage errors for me (after deleting the build directory):

make[2]: *** No rule to make target 'third_party/pytorch/src/pytorch-build/lib/libmkldnn.so', needed by '../lib/libtest.so.5.0.0'.  Stop.
make[1]: *** [CMakeFiles/Makefile2:109: src/api/CMakeFiles/test.dir/all] Error 2
make: *** [Makefile:95: all] Error 2

It seems that libmkldnn.so is installed into the build folder (for e.g. /home/pytorch-test/build/third_party/pytorch/src/pytorch-build/lib/libmkldnn.so), see:

LIST(APPEND MKLDNN_LIBRARIES "${PROJECT_BINARY_DIR}/lib/${MKLDNN_LIB}")

Maybe MKL-DNN needs to be installed in CMAKE_INSTALL_PREFIX?

what's your build command? do you have full build log?

@kleisauke
Copy link

what's your build command? do you have full build log?

Build log: https://gist.github.com/kleisauke/00df397ffea14b353fbdf125a01fe150#file-build-log
Build flags: https://github.com/kleisauke/pytorch-mkldnn-issue/blob/master/third_party/pytorch/CMakeLists.txt#L5-L29
Reproduction repo: https://github.com/kleisauke/pytorch-mkldnn-issue

From the build log I see libmkldnn.so is correctly installed:

-- Set runtime path of "/usr/local/lib/libmkldnn.so.0.14.0" to "/usr/local/lib"

However it looks like the wrong library is appended to INTERFACE_LINK_LIBRARIES in the generated CMake Caffe2 target file /usr/local/share/cmake/Caffe2/Caffe2Targets.cmake (which causes this linkage error):
https://gist.github.com/kleisauke/00df397ffea14b353fbdf125a01fe150#file-caffe2targets-cmake-L67

@gujinghui gujinghui force-pushed the default_mkldnn branch 2 times, most recently from 8c22da5 to 7795490 Compare November 2, 2018 15:33
Signed-off-by: Gu, Jinghui <jinghui.gu@intel.com>
Signed-off-by: Gu, Jinghui <jinghui.gu@intel.com>
@gujinghui
Copy link
Collaborator Author

rebased with latest code base.

Finally, we passed all test cases. 🎉 🎉 🎉
@yinghai @soumith
Please help review.

@Jianhui-Li @jgong5

@kleisauke This PR should be able to fix the issue in your code.
Please have a try.

@kleisauke
Copy link

@gujinghui I can confirm that this PR resolves the issue that I had, thanks!

@yinghai
Copy link
Contributor

yinghai commented Nov 6, 2018

Thanks, @gujinghui!. @orionr, do we have a CI setup to test this?

@orionr
Copy link
Contributor

orionr commented Nov 6, 2018

@yinghai doesn't look like we have OSS CI with MKLDNN enabled - likely want to test that. However, we do have internal Sandcastle testing of it.

If you want to add it, feel free to tweak .jenkins/pytorch/build.sh with something like

if ! which conda; then
  pip install -q mkl mkl-devel
  export USE_MKLDNN=1
fi

@soumith
Copy link
Member

soumith commented Nov 6, 2018

I wouldn't want MKLDNN tested in all CI configurations, because it's quite important to make sure the non MKLDNN path is working. @orionr where do we add to the config to only add it to some configs, maybe 2 or 3 configs...

@orionr
Copy link
Contributor

orionr commented Nov 6, 2018

Same place, but you can also do a test similar to [[ "$BUILD_ENVIRONMENT" == *trusty-py3.6-gcc5.4* ]] below that you'll see setting up a DEBUG build.

https://github.com/pytorch/pytorch/blob/master/.jenkins/pytorch/build.sh#L96

@gujinghui
Copy link
Collaborator Author

Hi all,

@soumith @yinghai @orionr
Any changes are needed for this PR?

@orionr
Copy link
Contributor

orionr commented Nov 8, 2018

@gujinghui let me push a change to the CI so MKLDNN gets tested.

@orionr
Copy link
Contributor

orionr commented Nov 8, 2018

Changes pushed. Let's see how CI does.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@orionr has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

zdevito pushed a commit to zdevito/ATen that referenced this pull request Nov 8, 2018
Summary:
build with mkl-dnn by default
Pull Request resolved: pytorch/pytorch#13303

Reviewed By: yinghai

Differential Revision: D12979633

Pulled By: orionr

fbshipit-source-id: 00d23fa27c0d13e82f7e5acb3ebd00ed7ba1d5dc
@avmgithub
Copy link
Contributor

ppc64le does not support mkl-dnn. I suspect this is why the ppc64le is now breaking.

-- USE_MKLDNN : ON <<<<<<<<<<<<<<<<<

c++: error: unrecognized command line option '-march=native'
third_party/ideep/mkl-dnn/src/CMakeFiles/mkldnn.dir/build.make:62: recipe for target 'third_party/ideep/mkl-dnn/src/CMakeFiles/mkldnn.dir/common/batch_normalization.cpp.o' failed
make[2]: *** [third_party/ideep/mkl-dnn/src/CMakeFiles/mkldnn.dir/common/batch_normalization.cpp.o] Error 1
make[2]: *** Waiting for unfinished jobs....
c++: error: unrecognized command line option '-march=native'
third_party/ideep/mkl-dnn/src/CMakeFiles/mkldnn.dir/build.make:75: recipe for target 'third_party/ideep/mkl-dnn/src/CMakeFiles/mkldnn.dir/common/convolution.cpp.o' failed
make[2]: *** [third_party/ideep/mkl-dnn/src/CMakeFiles/mkldnn.dir/common/convolution.cpp.o] Error 1
CMakeFiles/Makefile2:1342: recipe for target 'third_party/ideep/mkl-dnn/src/CMakeFiles/mkldnn.dir/all' failed
make[1]: *** [third_party/ideep/mkl-dnn/src/CMakeFiles/mkldnn.dir/all] Error 2
make[1]: *** Waiting for unfinished jobs....

Makefile:140: recipe for target 'all' failed
make: *** [all] Error 2
Failed to run 'bash ../tools/build_pytorch_libs.sh --use-cuda --use-fbgemm --use-nnpack --use-mkldnn --use-qnnpack caffe2'

@gujinghui
Copy link
Collaborator Author

ppc64le does not support mkl-dnn. I suspect this is why the ppc64le is now breaking.

-- USE_MKLDNN : ON <<<<<<<<<<<<<<<<<

c++: error: unrecognized command line option '-march=native'
third_party/ideep/mkl-dnn/src/CMakeFiles/mkldnn.dir/build.make:62: recipe for target 'third_party/ideep/mkl-dnn/src/CMakeFiles/mkldnn.dir/common/batch_normalization.cpp.o' failed
make[2]: *** [third_party/ideep/mkl-dnn/src/CMakeFiles/mkldnn.dir/common/batch_normalization.cpp.o] Error 1
make[2]: *** Waiting for unfinished jobs....
c++: error: unrecognized command line option '-march=native'
third_party/ideep/mkl-dnn/src/CMakeFiles/mkldnn.dir/build.make:75: recipe for target 'third_party/ideep/mkl-dnn/src/CMakeFiles/mkldnn.dir/common/convolution.cpp.o' failed
make[2]: *** [third_party/ideep/mkl-dnn/src/CMakeFiles/mkldnn.dir/common/convolution.cpp.o] Error 1
CMakeFiles/Makefile2:1342: recipe for target 'third_party/ideep/mkl-dnn/src/CMakeFiles/mkldnn.dir/all' failed
make[1]: *** [third_party/ideep/mkl-dnn/src/CMakeFiles/mkldnn.dir/all] Error 2
make[1]: *** Waiting for unfinished jobs....

Makefile:140: recipe for target 'all' failed
make: *** [all] Error 2
Failed to run 'bash ../tools/build_pytorch_libs.sh --use-cuda --use-fbgemm --use-nnpack --use-mkldnn --use-qnnpack caffe2'

Please try to disable MKLDNN by settting NO_MKLDNN=1:
NO_MKLDNN=1 python setup.py build_deps

facebook-github-bot pushed a commit that referenced this pull request Nov 9, 2018
…13759)

Summary:
MKLDNN is not supported on ppc64le change USE_MKLDNN to OFF for ppc64le
Pull Request resolved: #13759

Differential Revision: D12993121

Pulled By: soumith

fbshipit-source-id: 539d5cfcff2c03b59fa71e10b52fac333a64c381
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants