Skip to content

Conversation

@suo
Copy link
Member

@suo suo commented Feb 17, 2022

Stack from ghstack:

Today, we have two pieces that conspire to determine what workflows we run:

  • generate_ci_workflows.py, which takes a declarative description of what we want the workflow to do and uses jinja to generate a workflow yaml file
  • generate-test-matrix, which runs at CI time to dynamically generate test jobs.

This is bad:

  • Having one layer of code generation is unfortunate, having two is confusing.
  • You cannot tell from a workflow yaml file what test jobs will be run.
  • We have to do this careful dance of plumbing the args to generate-test-matrix through setting env vars and other such ugliness.
  • In cases where the build job fails and prevents generate-test-matrix from running, a ghost test job that doesn't actually exist noises up the HUD and our stats.
  • A bunch of useless generate-test-matrix jobs (8 on PRs) noise up our signal.

As far as I can tell, this complexity is unnecessary--we have all the information we need to generate the build matrix statically. There does not appear to be an advantage in retaining generate-build-matrix, so I am removing generate-test-matrix to simplify the CI.

The only place where we were actually doing something dynamic is in our windows gpu workflow, where we would check at runtime whether the workflow was triggered from a PR or master and behave accordingly. This is more simply done by just having two separate workflows with different trigger conditions, which avoids the madness of needing to parse labels and forking the behavior dynamically, which has been a source of confusion in the past.

Today, we have two pieces that conspire to determine what workflows we run:
- `generate_ci_workflows.py`, which takes a declarative description of what we want the workflow to do and uses jinja to generate a workflow yaml file
- `generate-test-matrix`, which runs at CI time to dynamically generate test jobs.

This is bad:
- Having one layer of code generation is unfortunate, having two is confusing.
- You cannot read from a workflow yaml what test jobs will be run.
- We have to do this careful dance of plumbing the args to `generate-test-matrix` through setting env vars and other such ugliness.
- In cases where the build job fails and prevents `generate-test-matrix` from running, a ghost `test` job that doesn't actually exist noises up the HUD and our stats.
- A bunch of useless `generate-test-matrix` jobs (8 on PRs) noise up our signal.

As far as I can tell, this complexity is unnecessary--we have all the information we need to generate the build matrix statically. There does not appear to be an advantage in retaining generate-build-matrix, so I am removing `generate-test-matrix` to simplify the CI.

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented Feb 17, 2022

CI Flow Status

⚛️ CI Flow

Ruleset - Version: v1
Ruleset - File: https://github.com/pytorch/pytorch/blob/83aa1c88821b7b7fed367c0d5cf2d4fe2696368b/.github/generated-ciflow-ruleset.json
PR ciflow labels: ciflow/default
Add ciflow labels to this PR to trigger more builds:

Workflows Labels (bold enabled) Status
Triggered Workflows
linux-binary-conda ciflow/binaries, ciflow/binaries_conda, ciflow/default ✅ triggered
linux-binary-libtorch-cxx11-abi ciflow/binaries, ciflow/binaries_libtorch, ciflow/default ✅ triggered
linux-binary-libtorch-pre-cxx11 ciflow/binaries, ciflow/binaries_libtorch, ciflow/default ✅ triggered
linux-binary-manywheel ciflow/binaries, ciflow/binaries_wheel, ciflow/default ✅ triggered
linux-bionic-py3.7-clang9 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/noarch, ciflow/trunk ✅ triggered
linux-bionic-rocm4.5-py3.7 ciflow/all, ciflow/default, ciflow/linux, ciflow/rocm, ciflow/trunk ✅ triggered
linux-docs ciflow/all, ciflow/cpu, ciflow/default, ciflow/docs, ciflow/linux, ciflow/trunk ✅ triggered
linux-vulkan-bionic-py3.7-clang9 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk, ciflow/vulkan ✅ triggered
linux-xenial-cuda11.3-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
linux-xenial-cuda11.3-py3.7-gcc7-bazel-test ciflow/all, ciflow/bazel, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
linux-xenial-py3-clang5-mobile-build ciflow/all, ciflow/default, ciflow/linux, ciflow/mobile, ciflow/trunk ✅ triggered
linux-xenial-py3-clang5-mobile-custom-build-static ciflow/all, ciflow/default, ciflow/linux, ciflow/mobile, ciflow/trunk ✅ triggered
linux-xenial-py3.7-clang7-asan ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/sanitizers, ciflow/trunk ✅ triggered
linux-xenial-py3.7-clang7-onnx ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/onnx, ciflow/trunk ✅ triggered
linux-xenial-py3.7-gcc5.4 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
linux-xenial-py3.7-gcc7 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
linux-xenial-py3.7-gcc7-no-ops ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
macos-arm64-binary-conda ciflow/binaries, ciflow/binaries_conda, ciflow/default ✅ triggered
macos-arm64-binary-wheel ciflow/binaries, ciflow/binaries_wheel, ciflow/default ✅ triggered
macos-binary-conda ciflow/binaries, ciflow/binaries_conda, ciflow/default ✅ triggered
macos-binary-libtorch-cxx11-abi ciflow/binaries, ciflow/binaries_libtorch, ciflow/default ✅ triggered
macos-binary-libtorch-pre-cxx11 ciflow/binaries, ciflow/binaries_libtorch, ciflow/default ✅ triggered
macos-binary-wheel ciflow/binaries, ciflow/binaries_wheel, ciflow/default ✅ triggered
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single ciflow/all, ciflow/android, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit ciflow/all, ciflow/android, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
win-vs2019-cpu-py3 ciflow/all, ciflow/cpu, ciflow/default, ciflow/trunk, ciflow/win ✅ triggered
win-vs2019-cuda11.3-py3 ciflow/all, ciflow/cuda, ciflow/default, ciflow/trunk, ciflow/win ✅ triggered
windows-binary-libtorch-cxx11-abi ciflow/binaries, ciflow/binaries_libtorch, ciflow/default ✅ triggered
windows-binary-libtorch-pre-cxx11 ciflow/binaries, ciflow/binaries_libtorch, ciflow/default ✅ triggered
windows-binary-wheel ciflow/binaries, ciflow/binaries_wheel, ciflow/default ✅ triggered
Skipped Workflows
caffe2-linux-xenial-py3.7-gcc5.4 ciflow/all, ciflow/cpu, ciflow/linux, ciflow/trunk 🚫 skipped
docker-builds ciflow/all, ciflow/trunk 🚫 skipped
ios-12-5-1-arm64 ciflow/all, ciflow/ios, ciflow/macos, ciflow/scheduled 🚫 skipped
ios-12-5-1-arm64-coreml ciflow/all, ciflow/ios, ciflow/macos, ciflow/scheduled 🚫 skipped
ios-12-5-1-arm64-custom-ops ciflow/all, ciflow/ios, ciflow/macos, ciflow/scheduled 🚫 skipped
ios-12-5-1-arm64-metal ciflow/all, ciflow/ios, ciflow/macos, ciflow/scheduled 🚫 skipped
ios-12-5-1-x86-64 ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
ios-12-5-1-x86-64-coreml ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
libtorch-linux-xenial-cuda10.2-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/trunk 🚫 skipped
libtorch-linux-xenial-cuda11.3-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/trunk 🚫 skipped
linux-bionic-cuda10.2-py3.9-gcc7 ciflow/all, ciflow/cuda, ciflow/linux, ciflow/slow, ciflow/trunk 🚫 skipped
linux-docs-push ciflow/all, ciflow/cpu, ciflow/linux, ciflow/scheduled 🚫 skipped
linux-xenial-cuda11.3-py3.7-gcc7-no-ops ciflow/all, ciflow/cuda, ciflow/linux, ciflow/trunk 🚫 skipped
macos-10-15-py3-arm64 ciflow/all, ciflow/macos, ciflow/trunk 🚫 skipped
macos-10-15-py3-lite-interpreter-x86-64 ciflow/all, ciflow/macos, ciflow/trunk 🚫 skipped
macos-11-py3-x86-64 ciflow/all, ciflow/macos, ciflow/trunk 🚫 skipped
parallelnative-linux-xenial-py3.7-gcc5.4 ciflow/all, ciflow/cpu, ciflow/linux, ciflow/trunk 🚫 skipped
periodic-libtorch-linux-bionic-cuda11.5-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-libtorch-linux-xenial-cuda11.1-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-linux-bionic-cuda11.5-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck ciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled, ciflow/slow, ciflow/slow-gradcheck 🚫 skipped
periodic-linux-xenial-cuda11.1-py3.7-gcc7-debug ciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-win-vs2019-cuda11.1-py3 ciflow/all, ciflow/cuda, ciflow/scheduled, ciflow/win 🚫 skipped
periodic-win-vs2019-cuda11.5-py3 ciflow/all, ciflow/cuda, ciflow/scheduled, ciflow/win 🚫 skipped
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-build ciflow/all, ciflow/android, ciflow/cpu, ciflow/linux, ciflow/trunk 🚫 skipped
pytorch-xla-linux-bionic-py3.7-clang8 ciflow/all, ciflow/cpu, ciflow/linux, ciflow/trunk, ciflow/xla 🚫 skipped

@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Feb 17, 2022

🔗 Helpful links

💊 CI failures summary and remediations

As of commit 0bf0ef1 (more details on the Dr. CI page):


  • 8/8 failures introduced in this PR

🕵️ 7 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See GitHub Actions build linux-xenial-py3.7-clang7-asan / test (smoke_tests, 1, 1, linux.2xlarge) (1/7)

Step: "Unknown" (full log | diagnosis details | 🔁 rerun)

2022-02-17T19:36:10.9661946Z SUMMARY: Undefined.../jenkins/workspace/aten/src/ATen/Utils.cpp:20:3 in
2022-02-17T19:36:10.9161931Z     #10 0x556d4d84d801 in run_mod /tmp/build/80754af9/python_1627392990942/work/Python/pythonrun.c:1037
2022-02-17T19:36:10.9162928Z     #11 0x556d4d8587a9 in PyRun_StringFlags /tmp/build/80754af9/python_1627392990942/work/Python/pythonrun.c:961
2022-02-17T19:36:10.9163361Z     #12 0x556d4d85880b in PyRun_SimpleStringFlags /tmp/build/80754af9/python_1627392990942/work/Python/pythonrun.c:455
2022-02-17T19:36:10.9164893Z     #13 0x556d4d858908 in pymain_run_command /tmp/build/80754af9/python_1627392990942/work/Modules/main.c:420
2022-02-17T19:36:10.9165457Z     #14 0x556d4d858908 in pymain_run_python /tmp/build/80754af9/python_1627392990942/work/Modules/main.c:2907
2022-02-17T19:36:10.9165968Z     #15 0x556d4d858908 in pymain_main /tmp/build/80754af9/python_1627392990942/work/Modules/main.c:3460
2022-02-17T19:36:10.9166512Z     #16 0x556d4d858ccb in _Py_UnixMain /tmp/build/80754af9/python_1627392990942/work/Modules/main.c:3495
2022-02-17T19:36:10.9660757Z     #17 0x7ff79580183f in __libc_start_main /build/glibc-S7Ft5T/glibc-2.23/csu/../csu/libc-start.c:291
2022-02-17T19:36:10.9661113Z     #18 0x556d4d7fd554 in _start (/opt/conda/bin/python3.7+0x1d7554)
2022-02-17T19:36:10.9661271Z 
2022-02-17T19:36:10.9661946Z SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /var/lib/jenkins/workspace/aten/src/ATen/Utils.cpp:20:3 in 
2022-02-17T19:36:10.9847773Z + retcode=1
2022-02-17T19:36:10.9848077Z + set -e
2022-02-17T19:36:10.9848217Z + return 1
2022-02-17T19:36:10.9851394Z + [[ linux-xenial-py3.7-clang7-asan-smoke_tests == *-NO_AVX-* ]]
2022-02-17T19:36:10.9851684Z + [[ smoke_tests == \n\o\g\p\u\_\N\O\_\A\V\X ]]
2022-02-17T19:36:10.9852052Z + [[ linux-xenial-py3.7-clang7-asan-smoke_tests == *-NO_AVX2-* ]]
2022-02-17T19:36:10.9852318Z + [[ smoke_tests == \n\o\g\p\u\_\N\O\_\A\V\X\2 ]]
2022-02-17T19:36:10.9852639Z + [[ linux-xenial-py3.7-clang7-asan-smoke_tests == *-NO_AVX512-* ]]
2022-02-17T19:36:10.9852937Z + [[ smoke_tests == \n\o\g\p\u\_\N\O\_\A\V\X\5\1\2 ]]
2022-02-17T19:36:10.9856580Z + [[ linux-xenial-py3.7-clang7-asan-smoke_tests == *tbb* ]]

See GitHub Actions build pytorch-xla-linux-bionic-py3.7-clang8 / test (smoke_tests, 1, 1, linux.2xlarge) (2/7)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

2022-02-17T20:31:43.4637553Z RuntimeError: test_quantization failed!
2022-02-17T20:31:42.7977315Z Generated XML report: test-reports/python-unittest/test_quantization/TEST-quantization.core.test_quantized_tensor.TestQuantizedTensor-20220217202233.xml
2022-02-17T20:31:42.7980281Z Generated XML report: test-reports/python-unittest/test_quantization/TEST-quantization.core.test_workflow_module.TestRecordHistogramObserver-20220217202233.xml
2022-02-17T20:31:42.7996561Z Generated XML report: test-reports/python-unittest/test_quantization/TEST-quantization.bc.test_backward_compatibility.TestSerialization-20220217202233.xml
2022-02-17T20:31:42.8015894Z Generated XML report: test-reports/python-unittest/test_quantization/TEST-quantization.core.test_quantized_module.TestStaticQuantizedModule-20220217202233.xml
2022-02-17T20:31:42.8029441Z Generated XML report: test-reports/python-unittest/test_quantization/TEST-quantization.fx.test_subgraph_rewriter.TestSubgraphRewriter-20220217202233.xml
2022-02-17T20:31:43.4633020Z Traceback (most recent call last):
2022-02-17T20:31:43.4633276Z   File "test/run_test.py", line 1098, in <module>
2022-02-17T20:31:43.4635023Z     main()
2022-02-17T20:31:43.4635262Z   File "test/run_test.py", line 1076, in main
2022-02-17T20:31:43.4637227Z     raise RuntimeError(err_message)
2022-02-17T20:31:43.4637553Z RuntimeError: test_quantization failed!
2022-02-17T20:31:43.7562560Z 
2022-02-17T20:31:43.7563056Z real	53m23.814s
2022-02-17T20:31:43.7563456Z user	123m44.067s
2022-02-17T20:31:43.7563740Z sys	10m55.447s
2022-02-17T20:31:43.7564203Z + cleanup
2022-02-17T20:31:43.7564361Z + retcode=1
2022-02-17T20:31:43.7564501Z + set +x
2022-02-17T20:31:43.7603571Z ##[error]Process completed with exit code 1.
2022-02-17T20:31:43.7633324Z ##[group]Run # Ensure the working directory gets chowned back to the current user
2022-02-17T20:31:43.7633685Z �[36;1m# Ensure the working directory gets chowned back to the current user�[0m

See GitHub Actions build linux-bionic-rocm4.5-py3.7 / test (smoke_tests, 1, 1, linux.rocm.gpu) (3/7)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

2022-02-17T22:10:36.7916440Z FAIL [0.014s]: test_reduce_add_coalesced (__main__.TestCudaComm)
2022-02-17T22:10:36.7290520Z   test_scatter_cpu_dim (__main__.TestCudaComm) ... ok (0.008s)
2022-02-17T22:10:36.7372254Z   test_scatter_cpu_neg_dim (__main__.TestCudaComm) ... ok (0.008s)
2022-02-17T22:10:36.7468548Z   test_scatter_cpu_sizes (__main__.TestCudaComm) ... ok (0.010s)
2022-02-17T22:10:36.7557918Z   test_scatter_gpu (__main__.TestCudaComm) ... ok (0.009s)
2022-02-17T22:10:36.7645387Z   test_scatter_gpu_dim (__main__.TestCudaComm) ... ok (0.009s)
2022-02-17T22:10:36.7730714Z   test_scatter_gpu_neg_dim (__main__.TestCudaComm) ... ok (0.009s)
2022-02-17T22:10:36.7831548Z   test_scatter_gpu_sizes (__main__.TestCudaComm) ... ok (0.010s)
2022-02-17T22:10:36.7914629Z   test_scatter_namedtuple (__main__.TestCudaComm) ... ok (0.008s)
2022-02-17T22:10:36.7915273Z 
2022-02-17T22:10:36.7915628Z ======================================================================
2022-02-17T22:10:36.7916440Z FAIL [0.014s]: test_reduce_add_coalesced (__main__.TestCudaComm)
2022-02-17T22:10:36.7918470Z ----------------------------------------------------------------------
2022-02-17T22:10:36.7919780Z Traceback (most recent call last):
2022-02-17T22:10:36.7921321Z   File "test_cuda.py", line 3977, in test_reduce_add_coalesced
2022-02-17T22:10:36.7922701Z     self._test_reduce_add_coalesced(tensors, num_bytes * 5 // 2)
2022-02-17T22:10:36.7923664Z   File "test_cuda.py", line 3943, in _test_reduce_add_coalesced
2022-02-17T22:10:36.7924463Z     self.assertEqual(r, t * 2)
2022-02-17T22:10:36.7926821Z   File "/opt/conda/lib/python3.7/site-packages/torch/testing/_internal/common_utils.py", line 2145, in assertEqual
2022-02-17T22:10:36.7927981Z     msg=msg,
2022-02-17T22:10:36.7929636Z   File "/opt/conda/lib/python3.7/site-packages/torch/testing/_comparison.py", line 1064, in assert_equal
2022-02-17T22:10:36.7930725Z     raise error_metas[0].to_error()

See GitHub Actions build win-vs2019-cuda11.3-py3 / test (distributed, 1, 1, windows.8xlarge.nvidia.gpu) (4/7)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

2022-02-18T00:14:35.0064661Z RuntimeError: test_optim failed!
2022-02-18T00:14:34.5019879Z   File "C:\actions-runner\_work\pytorch\pytorch\build\win_tmp\build\torch\optim\adam.py", line 204, in adam
2022-02-18T00:14:34.5020560Z     func(params,
2022-02-18T00:14:34.5021354Z   File "C:\actions-runner\_work\pytorch\pytorch\build\win_tmp\build\torch\optim\adam.py", line 307, in _multi_tensor_adam
2022-02-18T00:14:34.5022231Z     max_exp_avg_sq_sqrt = torch._foreach_sqrt(max_exp_avg_sqs)
2022-02-18T00:14:34.5022773Z KeyboardInterrupt
2022-02-18T00:14:35.0061625Z Traceback (most recent call last):
2022-02-18T00:14:35.0062570Z   File "run_test.py", line 1098, in <module>
2022-02-18T00:14:35.0063032Z     main()
2022-02-18T00:14:35.0063514Z   File "run_test.py", line 1076, in main
2022-02-18T00:14:35.0064107Z     raise RuntimeError(err_message)
2022-02-18T00:14:35.0064661Z RuntimeError: test_optim failed!
2022-02-18T00:14:35.3890024Z Terminate batch job (Y/N)? 
2022-02-18T00:14:35.3891908Z 
2022-02-18T00:14:35.3892511Z (base) C:\actions-runner\_work\pytorch\pytorch\test>popd
2022-02-18T00:14:35.3899436Z 
2022-02-18T00:14:35.3900213Z (base) C:\actions-runner\_work\pytorch\pytorch>if ERRORLEVEL 1 exit /b 1 
2022-02-18T00:14:35.3936052Z + cleanup
2022-02-18T00:14:35.3936825Z + retcode=1
2022-02-18T00:14:35.3937243Z + set +x
2022-02-18T00:14:35.4009369Z ##[error]The action has timed out.
2022-02-18T00:14:35.4499622Z ##[group]Run # -ir => recursive include all files in pattern

See GitHub Actions build periodic-win-vs2019-cuda11.1-py3 / test (distributed, 1, 1, windows.8xlarge.nvidia.gpu) (5/7)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

2022-02-18T00:18:35.2435032Z RuntimeError: test_quantization failed!
2022-02-18T00:18:30.8621237Z   File "C:\actions-runner\_work\pytorch\pytorch\build\win_tmp\build\torch\_tensor_str.py", line 221, in _tensor_str_with_formatter
2022-02-18T00:18:30.8622375Z     slices = ([_tensor_str_with_formatter(self[i], indent + 1, summarize, formatter1, formatter2)
2022-02-18T00:18:30.8623446Z   File "C:\actions-runner\_work\pytorch\pytorch\build\win_tmp\build\torch\_tensor_str.py", line 221, in <listcomp>
2022-02-18T00:18:30.8624480Z     slices = ([_tensor_str_with_formatter(self[i], indent + 1, summarize, formatter1, formatter2)
2022-02-18T00:18:30.8625192Z KeyboardInterrupt
2022-02-18T00:18:35.2431946Z Traceback (most recent call last):
2022-02-18T00:18:35.2432895Z   File "run_test.py", line 1098, in <module>
2022-02-18T00:18:35.2433360Z     main()
2022-02-18T00:18:35.2433850Z   File "run_test.py", line 1076, in main
2022-02-18T00:18:35.2434429Z     raise RuntimeError(err_message)
2022-02-18T00:18:35.2435032Z RuntimeError: test_quantization failed!
2022-02-18T00:18:35.5937179Z Terminate batch job (Y/N)? 
2022-02-18T00:18:35.5939025Z 
2022-02-18T00:18:35.5939619Z (base) C:\actions-runner\_work\pytorch\pytorch\test>popd
2022-02-18T00:18:35.5946295Z 
2022-02-18T00:18:35.5946935Z (base) C:\actions-runner\_work\pytorch\pytorch>if ERRORLEVEL 1 exit /b 1 
2022-02-18T00:18:35.5980909Z + cleanup
2022-02-18T00:18:35.5981567Z + retcode=1
2022-02-18T00:18:35.5982101Z + set +x
2022-02-18T00:18:35.6047932Z ##[error]The action has timed out.
2022-02-18T00:18:35.6597596Z ##[group]Run # -ir => recursive include all files in pattern

See GitHub Actions build win-vs2019-cuda11.3-py3-smoke / test (distributed, 1, 1, windows.8xlarge.nvidia.gpu) (6/7)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

2022-02-18T00:14:05.1404535Z RuntimeError: test_optim failed!
2022-02-18T00:14:04.5499352Z   File "C:\actions-runner\_work\pytorch\pytorch\build\win_tmp\build\torch\autograd\__init__.py", line 166, in backward
2022-02-18T00:14:04.5500314Z     grad_tensors_ = _make_grads(tensors, grad_tensors_, is_grads_batched=False)
2022-02-18T00:14:04.5501301Z   File "C:\actions-runner\_work\pytorch\pytorch\build\win_tmp\build\torch\autograd\__init__.py", line 30, in _make_grads
2022-02-18T00:14:04.5502350Z     def _make_grads(outputs: Sequence[torch.Tensor], grads: Sequence[_OptionalTensor],
2022-02-18T00:14:04.5503086Z KeyboardInterrupt
2022-02-18T00:14:05.1401471Z Traceback (most recent call last):
2022-02-18T00:14:05.1402415Z   File "run_test.py", line 1098, in <module>
2022-02-18T00:14:05.1402876Z     main()
2022-02-18T00:14:05.1403393Z   File "run_test.py", line 1076, in main
2022-02-18T00:14:05.1403981Z     raise RuntimeError(err_message)
2022-02-18T00:14:05.1404535Z RuntimeError: test_optim failed!
2022-02-18T00:14:05.5728515Z Terminate batch job (Y/N)? 
2022-02-18T00:14:05.5730623Z 
2022-02-18T00:14:05.5731670Z (base) C:\actions-runner\_work\pytorch\pytorch\test>popd
2022-02-18T00:14:05.5738291Z 
2022-02-18T00:14:05.5739119Z (base) C:\actions-runner\_work\pytorch\pytorch>if ERRORLEVEL 1 exit /b 1 
2022-02-18T00:14:05.5775589Z + cleanup
2022-02-18T00:14:05.5776299Z + retcode=1
2022-02-18T00:14:05.5776697Z + set +x
2022-02-18T00:14:05.5844764Z ##[error]The action has timed out.
2022-02-18T00:14:05.6426015Z ##[group]Run # -ir => recursive include all files in pattern

See GitHub Actions build periodic-win-vs2019-cuda11.5-py3 / test (distributed, 1, 1, windows.8xlarge.nvidia.gpu) (7/7)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

2022-02-17T21:47:25.3864236Z test_add_done_ca...arg() takes 0 positional arguments but 1 was given
2022-02-17T21:47:25.3848087Z 
2022-02-17T21:47:25.3849037Z For more information about alternatives visit: ('https://numba.pydata.org/numba-doc/latest/cuda/overview.html', '#cudatoolkit-lookup')�[0m
2022-02-17T21:47:25.3850176Z   warnings.warn(errors.NumbaWarning(msg))
2022-02-17T21:47:25.3851028Z C:\Jenkins\Miniconda3\lib\site-packages\numba\cuda\envvars.py:17: NumbaWarning: �[1m
2022-02-17T21:47:25.3852390Z Environment variables with the 'NUMBAPRO' prefix are deprecated and consequently ignored, found use of NUMBAPRO_LIBDEVICE=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.5\nvvm\libdevice.
2022-02-17T21:47:25.3853322Z 
2022-02-17T21:47:25.3854264Z For more information about alternatives visit: ('https://numba.pydata.org/numba-doc/latest/cuda/overview.html', '#cudatoolkit-lookup')�[0m
2022-02-17T21:47:25.3855407Z   warnings.warn(errors.NumbaWarning(msg))
2022-02-17T21:47:25.3855907Z ok (0.739s)
2022-02-17T21:47:25.3856507Z   test_add_done_callback_maintains_callback_order (__main__.TestFuture) ... ok (0.000s)
2022-02-17T21:47:25.3864236Z   test_add_done_callback_no_arg_error_is_ignored (__main__.TestFuture) ... [E pybind_utils.h:201] Got the following error when running the callback: TypeError: no_arg() takes 0 positional arguments but 1 was given
2022-02-17T21:47:25.3865319Z ok (0.000s)
2022-02-17T21:47:25.3882491Z   test_add_done_callback_simple (__main__.TestFuture) ... ok (0.000s)
2022-02-17T21:47:25.3950646Z   test_chained_then (__main__.TestFuture) ... ok (0.000s)
2022-02-17T21:47:25.5100414Z   test_collect_all (__main__.TestFuture) ... ok (0.130s)
2022-02-17T21:47:25.5111358Z   test_done (__main__.TestFuture) ... ok (0.000s)
2022-02-17T21:47:25.5130055Z   test_done_exception (__main__.TestFuture) ... ok (0.000s)
2022-02-17T21:47:25.5155240Z   test_interleaving_then_and_add_done_callback_maintains_callback_order (__main__.TestFuture) ... ok (0.000s)
2022-02-17T21:47:25.5169854Z   test_interleaving_then_and_add_done_callback_propagates_error (__main__.TestFuture) ... [E pybind_utils.h:201] Got the following error when running the callback: ValueError: Expected error
2022-02-17T21:47:25.5170745Z 
2022-02-17T21:47:25.5171033Z At:

1 failure not recognized by patterns:

Job Step Action
GitHub Actions ios-12-5-1-arm64-custom-ops / build Unknown 🔁 rerun

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@facebook-github-bot facebook-github-bot added the module: rocm AMD GPU support for Pytorch label Feb 17, 2022
suo added a commit that referenced this pull request Feb 17, 2022
Today, we have two pieces that conspire to determine what workflows we run:
- `generate_ci_workflows.py`, which takes a declarative description of what we want the workflow to do and uses jinja to generate a workflow yaml file
- `generate-test-matrix`, which runs at CI time to dynamically generate test jobs.

This is bad:
- Having one layer of code generation is unfortunate, having two is confusing.
- You cannot read from a workflow yaml what test jobs will be run.
- We have to do this careful dance of plumbing the args to `generate-test-matrix` through setting env vars and other such ugliness.
- In cases where the build job fails and prevents `generate-test-matrix` from running, a ghost `test` job that doesn't actually exist noises up the HUD and our stats.
- A bunch of useless `generate-test-matrix` jobs (8 on PRs) noise up our signal.

As far as I can tell, this complexity is unnecessary--we have all the information we need to generate the build matrix statically. There does not appear to be an advantage in retaining generate-build-matrix, so I am removing `generate-test-matrix` to simplify the CI.

ghstack-source-id: 749ff94
Pull Request resolved: #73001
Today, we have two pieces that conspire to determine what workflows we run:
- `generate_ci_workflows.py`, which takes a declarative description of what we want the workflow to do and uses jinja to generate a workflow yaml file
- `generate-test-matrix`, which runs at CI time to dynamically generate test jobs.

This is bad:
- Having one layer of code generation is unfortunate, having two is confusing.
- You cannot tell from a workflow yaml file what test jobs will be run.
- We have to do this careful dance of plumbing the args to `generate-test-matrix` through setting env vars and other such ugliness.
- In cases where the build job fails and prevents `generate-test-matrix` from running, a ghost `test` job that doesn't actually exist noises up the HUD and our stats.
- A bunch of useless `generate-test-matrix` jobs (8 on PRs) noise up our signal.

As far as I can tell, this complexity is unnecessary--we have all the information we need to generate the build matrix statically. There does not appear to be an advantage in retaining generate-build-matrix, so I am removing `generate-test-matrix` to simplify the CI.

The *only* place where we were actually doing something dynamic is in our windows gpu workflow, where we would check at runtime whether the workflow was triggered from a PR or master and behave accordingly. This is more simply done by just having two separate workflows with different trigger conditions, which avoids the madness of needing to parse labels and forking the behavior dynamically, which has been a source of confusion in the past.

[ghstack-poisoned]
@suo suo changed the title [wip/ci] delete generate-test-matrix [ci] delete generate-test-matrix Feb 17, 2022
suo added a commit that referenced this pull request Feb 17, 2022
Today, we have two pieces that conspire to determine what workflows we run:
- `generate_ci_workflows.py`, which takes a declarative description of what we want the workflow to do and uses jinja to generate a workflow yaml file
- `generate-test-matrix`, which runs at CI time to dynamically generate test jobs.

This is bad:
- Having one layer of code generation is unfortunate, having two is confusing.
- You cannot tell from a workflow yaml file what test jobs will be run.
- We have to do this careful dance of plumbing the args to `generate-test-matrix` through setting env vars and other such ugliness.
- In cases where the build job fails and prevents `generate-test-matrix` from running, a ghost `test` job that doesn't actually exist noises up the HUD and our stats.
- A bunch of useless `generate-test-matrix` jobs (8 on PRs) noise up our signal.

As far as I can tell, this complexity is unnecessary--we have all the information we need to generate the build matrix statically. There does not appear to be an advantage in retaining generate-build-matrix, so I am removing `generate-test-matrix` to simplify the CI.

The *only* place where we were actually doing something dynamic is in our windows gpu workflow, where we would check at runtime whether the workflow was triggered from a PR or master and behave accordingly. This is more simply done by just having two separate workflows with different trigger conditions, which avoids the madness of needing to parse labels and forking the behavior dynamically, which has been a source of confusion in the past.

ghstack-source-id: 9f387b3
Pull Request resolved: #73001
Today, we have two pieces that conspire to determine what workflows we run:
- `generate_ci_workflows.py`, which takes a declarative description of what we want the workflow to do and uses jinja to generate a workflow yaml file
- `generate-test-matrix`, which runs at CI time to dynamically generate test jobs.

This is bad:
- Having one layer of code generation is unfortunate, having two is confusing.
- You cannot tell from a workflow yaml file what test jobs will be run.
- We have to do this careful dance of plumbing the args to `generate-test-matrix` through setting env vars and other such ugliness.
- In cases where the build job fails and prevents `generate-test-matrix` from running, a ghost `test` job that doesn't actually exist noises up the HUD and our stats.
- A bunch of useless `generate-test-matrix` jobs (8 on PRs) noise up our signal.

As far as I can tell, this complexity is unnecessary--we have all the information we need to generate the build matrix statically. There does not appear to be an advantage in retaining generate-build-matrix, so I am removing `generate-test-matrix` to simplify the CI.

The *only* place where we were actually doing something dynamic is in our windows gpu workflow, where we would check at runtime whether the workflow was triggered from a PR or master and behave accordingly. This is more simply done by just having two separate workflows with different trigger conditions, which avoids the madness of needing to parse labels and forking the behavior dynamically, which has been a source of confusion in the past.

[ghstack-poisoned]
Today, we have two pieces that conspire to determine what workflows we run:
- `generate_ci_workflows.py`, which takes a declarative description of what we want the workflow to do and uses jinja to generate a workflow yaml file
- `generate-test-matrix`, which runs at CI time to dynamically generate test jobs.

This is bad:
- Having one layer of code generation is unfortunate, having two is confusing.
- You cannot tell from a workflow yaml file what test jobs will be run.
- We have to do this careful dance of plumbing the args to `generate-test-matrix` through setting env vars and other such ugliness.
- In cases where the build job fails and prevents `generate-test-matrix` from running, a ghost `test` job that doesn't actually exist noises up the HUD and our stats.
- A bunch of useless `generate-test-matrix` jobs (8 on PRs) noise up our signal.

As far as I can tell, this complexity is unnecessary--we have all the information we need to generate the build matrix statically. There does not appear to be an advantage in retaining generate-build-matrix, so I am removing `generate-test-matrix` to simplify the CI.

The *only* place where we were actually doing something dynamic is in our windows gpu workflow, where we would check at runtime whether the workflow was triggered from a PR or master and behave accordingly. This is more simply done by just having two separate workflows with different trigger conditions, which avoids the madness of needing to parse labels and forking the behavior dynamically, which has been a source of confusion in the past.

[ghstack-poisoned]
Today, we have two pieces that conspire to determine what workflows we run:
- `generate_ci_workflows.py`, which takes a declarative description of what we want the workflow to do and uses jinja to generate a workflow yaml file
- `generate-test-matrix`, which runs at CI time to dynamically generate test jobs.

This is bad:
- Having one layer of code generation is unfortunate, having two is confusing.
- You cannot tell from a workflow yaml file what test jobs will be run.
- We have to do this careful dance of plumbing the args to `generate-test-matrix` through setting env vars and other such ugliness.
- In cases where the build job fails and prevents `generate-test-matrix` from running, a ghost `test` job that doesn't actually exist noises up the HUD and our stats.
- A bunch of useless `generate-test-matrix` jobs (8 on PRs) noise up our signal.

As far as I can tell, this complexity is unnecessary--we have all the information we need to generate the build matrix statically. There does not appear to be an advantage in retaining generate-build-matrix, so I am removing `generate-test-matrix` to simplify the CI.

The *only* place where we were actually doing something dynamic is in our windows gpu workflow, where we would check at runtime whether the workflow was triggered from a PR or master and behave accordingly. This is more simply done by just having two separate workflows with different trigger conditions, which avoids the madness of needing to parse labels and forking the behavior dynamically, which has been a source of confusion in the past.

[ghstack-poisoned]
@suo suo requested a review from malfet February 17, 2022 18:14
test_jobs.append(
{
"id": f"test_default_{shard}_{config['num_shards']}",
"name": f"test (default, {shard}, {self.num_test_shards}, {self.test_runner_type})",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might also be beneficial to simplify the names as well here to not include the test_runner_type? I find that information might not be very useful to a majority of people and we can derive it from the logs

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, at the moment I am just trying replicate the same job name, to avoid churning metrics and the HUD. We can definitely change it if it we want though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's actually a long term improvement I wanted to have in HUD - when it can combine history over renames (for example, today we have old XLA jobs names and new XLA job names and there are no continuation)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, one thing that might be interesting is that we are allowed to control the display name separately from the ID. So we can try to come up with some stable scheme for the ID and use that to identify things in HUD rather than the display name

@malfet
Copy link
Contributor

malfet commented Feb 17, 2022

Shellcheck failures are real

enable_xla_test: YamlShellBool = "''"
enable_noarch_test: YamlShellBool = "''"
enable_force_on_cpu_test: YamlShellBool = "''"
enable_default_test: bool = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We no longer need explicit type annotation here, do we?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no idea lol, the rules for .github are different and stricter than other folders, I can try to remove

Comment on lines +269 to +270
if self.enable_nogpu_no_avx_test:
configs["nogpu_NO_AVX"] = {"num_shards": 1, "runner": NOGPU_RUNNER_TYPE}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated - isn't NO_AVX dead (as we only have AVX2 and AVX512 now?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still run it on linux-bionic-cuda10.2-py3.9-gcc7 at least, so not sure.

test_jobs.append(
{
"id": f"test_default_{shard}_{config['num_shards']}",
"name": f"test (default, {shard}, {self.num_test_shards}, {self.test_runner_type})",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's actually a long term improvement I wanted to have in HUD - when it can combine history over renames (for example, today we have old XLA jobs names and new XLA job names and there are no continuation)

Today, we have two pieces that conspire to determine what workflows we run:
- `generate_ci_workflows.py`, which takes a declarative description of what we want the workflow to do and uses jinja to generate a workflow yaml file
- `generate-test-matrix`, which runs at CI time to dynamically generate test jobs.

This is bad:
- Having one layer of code generation is unfortunate, having two is confusing.
- You cannot tell from a workflow yaml file what test jobs will be run.
- We have to do this careful dance of plumbing the args to `generate-test-matrix` through setting env vars and other such ugliness.
- In cases where the build job fails and prevents `generate-test-matrix` from running, a ghost `test` job that doesn't actually exist noises up the HUD and our stats.
- A bunch of useless `generate-test-matrix` jobs (8 on PRs) noise up our signal.

As far as I can tell, this complexity is unnecessary--we have all the information we need to generate the build matrix statically. There does not appear to be an advantage in retaining generate-build-matrix, so I am removing `generate-test-matrix` to simplify the CI.

The *only* place where we were actually doing something dynamic is in our windows gpu workflow, where we would check at runtime whether the workflow was triggered from a PR or master and behave accordingly. This is more simply done by just having two separate workflows with different trigger conditions, which avoids the madness of needing to parse labels and forking the behavior dynamically, which has been a source of confusion in the past.

[ghstack-poisoned]
suo added a commit that referenced this pull request Feb 17, 2022
Today, we have two pieces that conspire to determine what workflows we run:
- `generate_ci_workflows.py`, which takes a declarative description of what we want the workflow to do and uses jinja to generate a workflow yaml file
- `generate-test-matrix`, which runs at CI time to dynamically generate test jobs.

This is bad:
- Having one layer of code generation is unfortunate, having two is confusing.
- You cannot tell from a workflow yaml file what test jobs will be run.
- We have to do this careful dance of plumbing the args to `generate-test-matrix` through setting env vars and other such ugliness.
- In cases where the build job fails and prevents `generate-test-matrix` from running, a ghost `test` job that doesn't actually exist noises up the HUD and our stats.
- A bunch of useless `generate-test-matrix` jobs (8 on PRs) noise up our signal.

As far as I can tell, this complexity is unnecessary--we have all the information we need to generate the build matrix statically. There does not appear to be an advantage in retaining generate-build-matrix, so I am removing `generate-test-matrix` to simplify the CI.

The *only* place where we were actually doing something dynamic is in our windows gpu workflow, where we would check at runtime whether the workflow was triggered from a PR or master and behave accordingly. This is more simply done by just having two separate workflows with different trigger conditions, which avoids the madness of needing to parse labels and forking the behavior dynamically, which has been a source of confusion in the past.

ghstack-source-id: e21c0cb
Pull Request resolved: #73001
@suo
Copy link
Member Author

suo commented Feb 17, 2022

@pytorchbot merge this

@suo suo linked an issue Feb 17, 2022 that may be closed by this pull request
facebook-github-bot pushed a commit that referenced this pull request Feb 17, 2022
Summary:
Today, we have two pieces that conspire to determine what workflows we run:
- `generate_ci_workflows.py`, which takes a declarative description of what we want the workflow to do and uses jinja to generate a workflow yaml file
- `generate-test-matrix`, which runs at CI time to dynamically generate test jobs.

This is bad:
- Having one layer of code generation is unfortunate, having two is confusing.
- You cannot tell from a workflow yaml file what test jobs will be run.
- We have to do this careful dance of plumbing the args to `generate-test-matrix` through setting env vars and other such ugliness.
- In cases where the build job fails and prevents `generate-test-matrix` from running, a ghost `test` job that doesn't actually exist noises up the HUD and our stats.
- A bunch of useless `generate-test-matrix` jobs (8 on PRs) noise up our signal.

As far as I can tell, this complexity is unnecessary--we have all the information we need to generate the build matrix statically. There does not appear to be an advantage in retaining generate-build-matrix, so I am removing `generate-test-matrix` to simplify the CI.

The *only* place where we were actually doing something dynamic is in our windows gpu workflow, where we would check at runtime whether the workflow was triggered from a PR or master and behave accordingly. This is more simply done by just having two separate workflows with different trigger conditions, which avoids the madness of needing to parse labels and forking the behavior dynamically, which has been a source of confusion in the past.

Pull Request resolved: #73001

Test Plan: automation

Reviewed By: malfet, seemethere

Differential Revision: D34315415

fbshipit-source-id: 164281a10b0692312e90edebdda174c5175cdfdd
suo added a commit that referenced this pull request Feb 18, 2022
These accidentally got turned on by #73001. Turn them off.

[ghstack-poisoned]
suo added a commit that referenced this pull request Feb 18, 2022
These accidentally got turned on by #73001. Turn them off.

ghstack-source-id: 8867002
Pull Request resolved: #73064
facebook-github-bot pushed a commit that referenced this pull request Feb 18, 2022
Summary:
Pull Request resolved: #73064

These accidentally got turned on by #73001. Turn them off.

Test Plan: Imported from OSS

Reviewed By: shannonzhu

Differential Revision: D34332530

Pulled By: suo

fbshipit-source-id: a6493b7d94465fa9141f1527648dbbec09c5706d
pytorchmergebot pushed a commit that referenced this pull request Feb 18, 2022
Summary:
Pull Request resolved: #73064

These accidentally got turned on by #73001. Turn them off.

Test Plan: Imported from OSS

Reviewed By: shannonzhu

Differential Revision: D34332530

Pulled By: suo

fbshipit-source-id: a6493b7d94465fa9141f1527648dbbec09c5706d
(cherry picked from commit b18c95e)
cyyever pushed a commit to cyyever/pytorch_private that referenced this pull request Feb 21, 2022
Today, we have two pieces that conspire to determine what workflows we run:
- `generate_ci_workflows.py`, which takes a declarative description of what we want the workflow to do and uses jinja to generate a workflow yaml file
- `generate-test-matrix`, which runs at CI time to dynamically generate test jobs.

This is bad:
- Having one layer of code generation is unfortunate, having two is confusing.
- You cannot tell from a workflow yaml file what test jobs will be run.
- We have to do this careful dance of plumbing the args to `generate-test-matrix` through setting env vars and other such ugliness.
- In cases where the build job fails and prevents `generate-test-matrix` from running, a ghost `test` job that doesn't actually exist noises up the HUD and our stats.
- A bunch of useless `generate-test-matrix` jobs (8 on PRs) noise up our signal.

As far as I can tell, this complexity is unnecessary--we have all the information we need to generate the build matrix statically. There does not appear to be an advantage in retaining generate-build-matrix, so I am removing `generate-test-matrix` to simplify the CI.

The *only* place where we were actually doing something dynamic is in our windows gpu workflow, where we would check at runtime whether the workflow was triggered from a PR or master and behave accordingly. This is more simply done by just having two separate workflows with different trigger conditions, which avoids the madness of needing to parse labels and forking the behavior dynamically, which has been a source of confusion in the past.

Pull Request resolved: pytorch/pytorch#73001
cyyever pushed a commit to cyyever/pytorch_private that referenced this pull request Feb 21, 2022
Summary:
Pull Request resolved: pytorch/pytorch#73064

These accidentally got turned on by pytorch/pytorch#73001. Turn them off.

Test Plan: Imported from OSS

Reviewed By: shannonzhu

Differential Revision: D34332530

Pulled By: suo

fbshipit-source-id: a6493b7d94465fa9141f1527648dbbec09c5706d
(cherry picked from commit b18c95e4a68e7d96e617edfb83a3e55780b49f4c)
@facebook-github-bot facebook-github-bot deleted the gh/suo/489/head branch February 21, 2022 15:17
cyyever pushed a commit to cyyever/pytorch_private that referenced this pull request Mar 3, 2022
Today, we have two pieces that conspire to determine what workflows we run:
- `generate_ci_workflows.py`, which takes a declarative description of what we want the workflow to do and uses jinja to generate a workflow yaml file
- `generate-test-matrix`, which runs at CI time to dynamically generate test jobs.

This is bad:
- Having one layer of code generation is unfortunate, having two is confusing.
- You cannot tell from a workflow yaml file what test jobs will be run.
- We have to do this careful dance of plumbing the args to `generate-test-matrix` through setting env vars and other such ugliness.
- In cases where the build job fails and prevents `generate-test-matrix` from running, a ghost `test` job that doesn't actually exist noises up the HUD and our stats.
- A bunch of useless `generate-test-matrix` jobs (8 on PRs) noise up our signal.

As far as I can tell, this complexity is unnecessary--we have all the information we need to generate the build matrix statically. There does not appear to be an advantage in retaining generate-build-matrix, so I am removing `generate-test-matrix` to simplify the CI.

The *only* place where we were actually doing something dynamic is in our windows gpu workflow, where we would check at runtime whether the workflow was triggered from a PR or master and behave accordingly. This is more simply done by just having two separate workflows with different trigger conditions, which avoids the madness of needing to parse labels and forking the behavior dynamically, which has been a source of confusion in the past.

Pull Request resolved: pytorch/pytorch#73001
cyyever pushed a commit to cyyever/pytorch_private that referenced this pull request Mar 3, 2022
Summary:
Pull Request resolved: pytorch/pytorch#73064

These accidentally got turned on by pytorch/pytorch#73001. Turn them off.

Test Plan: Imported from OSS

Reviewed By: shannonzhu

Differential Revision: D34332530

Pulled By: suo

fbshipit-source-id: a6493b7d94465fa9141f1527648dbbec09c5706d
(cherry picked from commit b18c95e4a68e7d96e617edfb83a3e55780b49f4c)
cyyever pushed a commit to cyyever/pytorch_private that referenced this pull request Mar 3, 2022
Today, we have two pieces that conspire to determine what workflows we run:
- `generate_ci_workflows.py`, which takes a declarative description of what we want the workflow to do and uses jinja to generate a workflow yaml file
- `generate-test-matrix`, which runs at CI time to dynamically generate test jobs.

This is bad:
- Having one layer of code generation is unfortunate, having two is confusing.
- You cannot tell from a workflow yaml file what test jobs will be run.
- We have to do this careful dance of plumbing the args to `generate-test-matrix` through setting env vars and other such ugliness.
- In cases where the build job fails and prevents `generate-test-matrix` from running, a ghost `test` job that doesn't actually exist noises up the HUD and our stats.
- A bunch of useless `generate-test-matrix` jobs (8 on PRs) noise up our signal.

As far as I can tell, this complexity is unnecessary--we have all the information we need to generate the build matrix statically. There does not appear to be an advantage in retaining generate-build-matrix, so I am removing `generate-test-matrix` to simplify the CI.

The *only* place where we were actually doing something dynamic is in our windows gpu workflow, where we would check at runtime whether the workflow was triggered from a PR or master and behave accordingly. This is more simply done by just having two separate workflows with different trigger conditions, which avoids the madness of needing to parse labels and forking the behavior dynamically, which has been a source of confusion in the past.

Pull Request resolved: pytorch/pytorch#73001
cyyever pushed a commit to cyyever/pytorch_private that referenced this pull request Mar 3, 2022
Summary:
Pull Request resolved: pytorch/pytorch#73064

These accidentally got turned on by pytorch/pytorch#73001. Turn them off.

Test Plan: Imported from OSS

Reviewed By: shannonzhu

Differential Revision: D34332530

Pulled By: suo

fbshipit-source-id: a6493b7d94465fa9141f1527648dbbec09c5706d
(cherry picked from commit b18c95e4a68e7d96e617edfb83a3e55780b49f4c)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed module: rocm AMD GPU support for Pytorch topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ciflow/win no longer schedules full testsuite run on Windows

4 participants