Skip to content

Conversation

xuzhao9
Copy link
Contributor

@xuzhao9 xuzhao9 commented Apr 26, 2021

This PR adds TorchBench (pytorch/benchmark) CI workflow to pytorch. It tests PRs whose body contains a line staring with "RUN_TORCHBENCH: " followed by a list of torchbench model names. For example, this PR will create a Torchbench job of running pytorch_mobildnet_v3 and yolov3 model.

For security reasons, only the branch on pytorch/pytorch will run. It will not work on forked repositories.

The model names have to match the exact names in pytorch/benchmark/torchbenchmark/models, separated by comma symbol. Only the first line starting with "RUN_TORCHBENCH: " is respected. If nothing is specified after the magic word, no test will run.

Known issues:

  1. Build PyTorch from scratch and do not reuse build artifacts from other workflows. This is because GHA migration is still in progress.
  2. Currently there is only one worker, so jobs are serialized. We will review the capacity issue after this is deployed.
  3. If the user would like to rerun the test, she has to push to the PR. Simply updating the PR body won't work.
  4. Only supports environment CUDA 10.2 + python 3.7

RUN_TORCHBENCH: yolov3, pytorch_mobilenet_v3

@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Apr 26, 2021

💊 CI failures summary and remediations

As of commit 08924bb (more details on the Dr. CI page):



🕵️ 3 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See CircleCI build pytorch_xla_linux_bionic_py3_6_clang9_build (1/3)

Step: "(Optional) Merge target branch" (full log | diagnosis details | 🔁 rerun)

Automatic merge failed; fix conflicts and then commit the result.
CONFLICT (add/add): Merge conflict in .circleci/docker/ubuntu-cuda/Dockerfile
Auto-merging .circleci/docker/ubuntu-cuda/Dockerfile
CONFLICT (add/add): Merge conflict in .circleci/docker/common/install_conda.sh
Auto-merging .circleci/docker/common/install_conda.sh
CONFLICT (add/add): Merge conflict in .circleci/config.yml
Auto-merging .circleci/config.yml
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/windows_build_definitions.py
Auto-merging .circleci/cimodel/data/windows_build_definitions.py
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/pytorch_build_data.py
Auto-merging .circleci/cimodel/data/pytorch_build_data.py
Automatic merge failed; fix conflicts and then commit the result.


Exited with code exit status 1

See CircleCI build pytorch_linux_xenial_py3_6_gcc5_4_build (2/3)

Step: "(Optional) Merge target branch" (full log | diagnosis details | 🔁 rerun)

Automatic merge failed; fix conflicts and then commit the result.
CONFLICT (add/add): Merge conflict in .circleci/docker/ubuntu-cuda/Dockerfile
Auto-merging .circleci/docker/ubuntu-cuda/Dockerfile
CONFLICT (add/add): Merge conflict in .circleci/docker/common/install_conda.sh
Auto-merging .circleci/docker/common/install_conda.sh
CONFLICT (add/add): Merge conflict in .circleci/config.yml
Auto-merging .circleci/config.yml
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/windows_build_definitions.py
Auto-merging .circleci/cimodel/data/windows_build_definitions.py
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/pytorch_build_data.py
Auto-merging .circleci/cimodel/data/pytorch_build_data.py
Automatic merge failed; fix conflicts and then commit the result.


Exited with code exit status 1

See CircleCI build binary_linux_libtorch_3_7m_cpu_gcc5_4_cxx11-abi_shared-with-deps_build (3/3)

Step: "Build" (full log | diagnosis details | 🔁 rerun)

FAILED: lib/libtorch_cpu.so
[5178/5410] Building CXX object caffe2/torch/CMakeFiles/torch_python.dir/csrc/jit/passes/onnx/eval_peephole.cpp.o
[5179/5410] Building CXX object caffe2/torch/CMakeFiles/torch_python.dir/csrc/jit/passes/onnx/pattern_conversion/pattern_encapsulation.cpp.o
[5180/5410] Building CXX object caffe2/torch/CMakeFiles/torch_python.dir/csrc/jit/passes/onnx/pattern_conversion/pattern_conversion.cpp.o
[5181/5410] Building CXX object caffe2/torch/CMakeFiles/torch_python.dir/csrc/jit/passes/onnx/unpack_quantized_weights.cpp.o
[5182/5410] Building CXX object caffe2/torch/CMakeFiles/torch_python.dir/csrc/utils/object_ptr.cpp.o
[5183/5410] Building CXX object caffe2/torch/CMakeFiles/torch_python.dir/csrc/utils/crash_handler.cpp.o
[5184/5410] Building CXX object caffe2/torch/CMakeFiles/torch_python.dir/csrc/jit/passes/onnx/remove_inplace_ops_for_onnx.cpp.o
[5185/5410] Building CXX object caffe2/torch/CMakeFiles/torch_python.dir/csrc/jit/passes/onnx/fold_if_node.cpp.o
[5186/5410] Building CXX object caffe2/torch/CMakeFiles/torch_python.dir/csrc/utils/cuda_lazy_init.cpp.o
[5187/5410] Linking CXX shared library lib/libtorch_cpu.so
FAILED: lib/libtorch_cpu.so 
: && /usr/bin/c++ -fPIC -Wno-deprecated-declarations -D_GLIBCXX_USE_CXX11_ABI=1 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -DHAVE_AVX_CPU_DEFINITION -DHAVE_AVX2_CPU_DEFINITION -O2 -g -DNDEBUG  -Wl,--no-as-needed -rdynamic -shared -Wl,-soname,libtorch_cpu.so -o lib/libtorch_cpu.so caffe2/quantization/server/CMakeFiles/caffe2_dnnlowp_avx2_ops.dir/elementwise_sum_dnnlowp_op_avx2.cc.o caffe2/quantization/server/CMakeFiles/caffe2_dnnlowp_avx2_ops.dir/fully_connected_fake_lowp_op_avx2.cc.o caffe2/quantization/server/CMakeFiles/caffe2_dnnlowp_avx2_ops.dir/group_norm_dnnlowp_op_avx2.cc.o caffe2/quantization/server/CMakeFiles/caffe2_dnnlowp_avx2_ops.dir/pool_dnnlowp_op_avx2.cc.o caffe2/quantization/server/CMakeFiles/caffe2_dnnlowp_avx2_ops.dir/relu_dnnlowp_op_avx2.cc.o caffe2/quantization/server/CMakeFiles/caffe2_dnnlowp_avx2_ops.dir/spatial_batch_norm_dnnlowp_op_avx2.cc.o caffe2/quantization/server/CMakeFiles/caffe2_dnnlowp_avx2_ops.dir/transpose.cc.o caffe2/quantization/server/CMakeFiles/caffe2_dnnlowp_avx2_ops.dir/norm_minimization_avx2.cc.o caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/BatchedFallback.cpp.o caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/BatchedTensorImpl.cpp.o caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/BatchingRegistrations.cpp.o caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/CPUGeneratorImpl.cpp.o caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/Context.cpp.o caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/DLConvertor.cpp.o caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/DynamicLibrary.cpp.o caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/ExpandUtils.cpp.o caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/LegacyTHFunctionsCPU.cpp.o caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/MemoryOverlap.cpp.o caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/NamedTensorUtils.cpp.o caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/ParallelCommon.cpp.o caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/ParallelNative.cpp.o caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/ParallelNativeTBB.cpp.o caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/ParallelOpenMP.cpp.o caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/ParallelThreadPoolNative.cpp.o caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/ScalarOps.cpp.o caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/SequenceNumber.cpp.o caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/SparseCsrTensorImpl.cpp.o caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/SparseTensorImpl.cpp.o caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/SparseTensorUtils.cpp.o caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/TensorGeometry.cpp.o caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/TensorIndexing.cpp.o caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/TensorIterator.cpp.o caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/TensorMeta.cpp.o caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/TensorNames.cpp.o caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/TensorUtils.cpp.o caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/ThreadLocalState.cpp.o caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/Utils.cpp.o caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/Version.cpp.o caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/VmapMode.cpp.o caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/VmapModeRegistrations.cpp.o caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/VmapTransforms.cpp.o caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/autocast_mode.cpp.o caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/cpu/FlushDenormal.cpp.o caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/detail/CPUGuardImpl.cpp.o caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/detail/CUDAHooksInterface.cpp.o caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/detail/HIPHooksInterface.cpp.o caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/detail/MetaGuardImpl.cpp.o caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/record_function.cpp.o caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/core/ATenGeneral.cpp.o caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/core/BackendSelectFallbackKernel.cpp.o caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/core/DeprecatedTypeProperties.cpp.o caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/core/DeprecatedTypePropertiesRegistry.cpp.o caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/core/Dict.cpp.o caffe2/CMakeFiles/torch_cpu.dir/__/at
collect2: fatal error: ld terminated with signal 9 [Killed]
compilation terminated.
[5188/5410] Building CXX object caffe2/torch/CMakeFiles/torch_python.dir/csrc/fx/fx_init.cpp.o
[5189/5410] Building CXX object caffe2/torch/CMakeFiles/torch_python.dir/csrc/jit/python/python_arg_flatten.cpp.o
[5190/5410] Building CXX object caffe2/torch/CMakeFiles/torch_python.dir/csrc/multiprocessing/init.cpp.o
[5191/5410] Building CXX object caffe2/torch/CMakeFiles/torch_python.dir/csrc/jit/python/pybind_utils.cpp.o
[5192/5410] Building CXX object caffe2/torch/CMakeFiles/torch_python.dir/csrc/jit/python/python_interpreter.cpp.o
[5193/5410] Building CXX object caffe2/torch/CMakeFiles/torch_python.dir/csrc/utils/invalid_arguments.cpp.o
[5194/5410] Building CXX object caffe2/torch/CMakeFiles/torch_python.dir/csrc/tensor/python_tensor.cpp.o

1 failure not recognized by patterns:

Job Step Action
CircleCI pytorch_macos_10_13_py3_lite_interpreter_build_test Checkout code 🔁 rerun

🚧 1 fixed upstream failure:

These were probably caused by upstream breakages that were already fixed.

Please rebase on the viable/strict branch (expand for instructions)

If your commit is older than viable/strict, run these commands:

git fetch https://github.com/pytorch/pytorch viable/strict
git rebase FETCH_HEAD

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@xuzhao9 xuzhao9 changed the title Add CI workflow and script to test torchbench. [WIP] Add CI workflow and script to test torchbench. Apr 26, 2021
@xuzhao9 xuzhao9 force-pushed the xz9/add-torchbench-ci branch from ed22055 to cc4c4f1 Compare April 26, 2021 22:26
@@ -0,0 +1,48 @@
name: TorchBench CI (pytorch-linux-py3.7-cu102)
on:
pull_request:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the idea that we'll only have one runner for this are we worried that there might be a runner bottleneck?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is to only run the job when people explicitly specify the magic line "RUN_TORCHBENCH:" as part of their PR body. Is there a way to quickly skip this workflow when the magic line is missing?

Copy link
Contributor Author

@xuzhao9 xuzhao9 Apr 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the job condition so that it runs only when the PR body contains keyword "RUN_TORCHBENCH:". Currently, we don't consider the capacity issue when too many PRs specify this keyword.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, would it be applicable here to maybe only search for a specific label instead of predicating it on a magic string inside of the pull request body?

ci/torchbench for example?

Copy link
Contributor Author

@xuzhao9 xuzhao9 Apr 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That could also work, but we require user to specify a list of models to benchmark in the PR body as well. For example: RUN_TORCHBENCH: yolov3, pytorch_mobilenet_v3. If only using label, user cannot specify a list of model names they hope to run.

Do you suggest user should manually add both label and magic string in PR body to trigger the test? Or we still use pr body magic word as a trigger, but still automatically apply the label ci/torchbench?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see, that makes sense

@xuzhao9 xuzhao9 requested review from janeyx99 and malfet April 27, 2021 00:27
@xuzhao9 xuzhao9 changed the title [WIP] Add CI workflow and script to test torchbench. Add CI workflow and script to test torchbench. Apr 27, 2021
@xuzhao9 xuzhao9 requested a review from wconstab April 28, 2021 16:25
@xuzhao9 xuzhao9 requested a review from seemethere April 28, 2021 22:37
@seemethere seemethere added the module: ci Related to continuous integration label Apr 28, 2021
@facebook-github-bot
Copy link
Contributor

@xuzhao9 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@wconstab
Copy link
Contributor

Do we want to encourage developers to run only specific models on their PRs?

I think for capacity reasons we could find out that we can't afford to run all the models for every one of our users, but it sounds like we might not know that yet without trying?

It is also nice to provide an override for the experienced developer who wants to run something specific.

But by default, for most people isn't it better to run the whole suite? I imagine for some users, they won't know which models they want to run, and in other cases they may miss important signal by thinking they only care about one model, and defeating some of the value of this infra.

@xuzhao9
Copy link
Contributor Author

xuzhao9 commented Apr 29, 2021

Do we want to encourage developers to run only specific models on their PRs?

I think for capacity reasons we could find out that we can't afford to run all the models for every one of our users, but it sounds like we might not know that yet without trying?

It is also nice to provide an override for the experienced developer who wants to run something specific.

But by default, for most people isn't it better to run the whole suite? I imagine for some users, they won't know which models they want to run, and in other cases they may miss important signal by thinking they only care about one model, and defeating some of the value of this infra.

Thanks for the feedbacks! I think we should definitely add a feature that people can specify "RUN_TORCHBENCH: ALL" to run the entire suite.

Although ideally we would like to test the entire suite, we would also like to give developers fast feedback signals. Currently, because we can't reuse the build artifacts from other GHA workflows, we have to rebuild the entire PR base and head commits, which is already super slow. The data shows even testing only two models (yolov3 and pytorch_mobilenet_v3) takes about 1hr to finish. Given TorchBench master already has ~45 models and the fact that we have only one runner, I think it will be so slow to run the entire suite that the signal will become almost useless.

Also as a new feature, I think it is better to "beta test" it on experts who understand what they want to test, get some feedbacks from them, and then make it more complete. We could still provide the regression detection feature with nightly CI.

@facebook-github-bot
Copy link
Contributor

@xuzhao9 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@xuzhao9 merged this pull request in c72f01a.

@xuzhao9 xuzhao9 deleted the xz9/add-torchbench-ci branch April 29, 2021 19:39
krshrimali pushed a commit to krshrimali/pytorch that referenced this pull request May 19, 2021
Summary:
This PR adds TorchBench (pytorch/benchmark) CI workflow to pytorch. It tests PRs whose body contains a line staring with "RUN_TORCHBENCH: " followed by a list of torchbench model names. For example, this PR will create a Torchbench job of running pytorch_mobildnet_v3 and yolov3 model.

For security reasons, only the branch on pytorch/pytorch will run. It will not work on forked repositories.

The model names have to match the exact names in pytorch/benchmark/torchbenchmark/models, separated by comma symbol. Only the first line starting with "RUN_TORCHBENCH: " is respected. If nothing is specified after the magic word, no test will run.

Known issues:
1. Build PyTorch from scratch and do not reuse build artifacts from other workflows. This is because GHA migration is still in progress.
2. Currently there is only one worker, so jobs are serialized. We will review the capacity issue after this is deployed.
3. If the user would like to rerun the test, she has to push to the PR. Simply updating the PR body won't work.
4. Only supports environment CUDA 10.2 + python 3.7

RUN_TORCHBENCH: yolov3, pytorch_mobilenet_v3

Pull Request resolved: pytorch#56957

Reviewed By: janeyx99

Differential Revision: D28079077

Pulled By: xuzhao9

fbshipit-source-id: e9ea73bdd9f35e650b653009060d477b22174bba
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed Merged module: ci Related to continuous integration

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants