-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Add CI workflow and script to test torchbench. #56957
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
💊 CI failures summary and remediationsAs of commit 08924bb (more details on the Dr. CI page):
🕵️ 3 new failures recognized by patternsThe following CI failures do not appear to be due to upstream breakages:
|
Job | Step | Action |
---|---|---|
Checkout code | 🔁 rerun |
🚧 1 fixed upstream failure:
These were probably caused by upstream breakages that were already fixed.
Please rebase on the viable/strict
branch (expand for instructions)
If your commit is older than viable/strict
, run these commands:
git fetch https://github.com/pytorch/pytorch viable/strict
git rebase FETCH_HEAD
- pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_build from Apr 26 until Apr 27 (a90a3ac - 0d777a8)
This comment was automatically generated by Dr. CI (expand for details).
Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions to the (internal) Dr. CI Users group.
ed22055
to
cc4c4f1
Compare
@@ -0,0 +1,48 @@ | |||
name: TorchBench CI (pytorch-linux-py3.7-cu102) | |||
on: | |||
pull_request: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the idea that we'll only have one runner for this are we worried that there might be a runner bottleneck?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea is to only run the job when people explicitly specify the magic line "RUN_TORCHBENCH:" as part of their PR body. Is there a way to quickly skip this workflow when the magic line is missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed the job condition so that it runs only when the PR body contains keyword "RUN_TORCHBENCH:". Currently, we don't consider the capacity issue when too many PRs specify this keyword.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, would it be applicable here to maybe only search for a specific label instead of predicating it on a magic string inside of the pull request body?
ci/torchbench
for example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That could also work, but we require user to specify a list of models to benchmark in the PR body as well. For example: RUN_TORCHBENCH: yolov3, pytorch_mobilenet_v3
. If only using label, user cannot specify a list of model names they hope to run.
Do you suggest user should manually add both label and magic string in PR body to trigger the test? Or we still use pr body magic word as a trigger, but still automatically apply the label ci/torchbench
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see, that makes sense
@xuzhao9 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
Do we want to encourage developers to run only specific models on their PRs? I think for capacity reasons we could find out that we can't afford to run all the models for every one of our users, but it sounds like we might not know that yet without trying? It is also nice to provide an override for the experienced developer who wants to run something specific. But by default, for most people isn't it better to run the whole suite? I imagine for some users, they won't know which models they want to run, and in other cases they may miss important signal by thinking they only care about one model, and defeating some of the value of this infra. |
Thanks for the feedbacks! I think we should definitely add a feature that people can specify "RUN_TORCHBENCH: ALL" to run the entire suite. Although ideally we would like to test the entire suite, we would also like to give developers fast feedback signals. Currently, because we can't reuse the build artifacts from other GHA workflows, we have to rebuild the entire PR base and head commits, which is already super slow. The data shows even testing only two models (yolov3 and pytorch_mobilenet_v3) takes about 1hr to finish. Given TorchBench master already has ~45 models and the fact that we have only one runner, I think it will be so slow to run the entire suite that the signal will become almost useless. Also as a new feature, I think it is better to "beta test" it on experts who understand what they want to test, get some feedbacks from them, and then make it more complete. We could still provide the regression detection feature with nightly CI. |
@xuzhao9 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
Summary: This PR adds TorchBench (pytorch/benchmark) CI workflow to pytorch. It tests PRs whose body contains a line staring with "RUN_TORCHBENCH: " followed by a list of torchbench model names. For example, this PR will create a Torchbench job of running pytorch_mobildnet_v3 and yolov3 model. For security reasons, only the branch on pytorch/pytorch will run. It will not work on forked repositories. The model names have to match the exact names in pytorch/benchmark/torchbenchmark/models, separated by comma symbol. Only the first line starting with "RUN_TORCHBENCH: " is respected. If nothing is specified after the magic word, no test will run. Known issues: 1. Build PyTorch from scratch and do not reuse build artifacts from other workflows. This is because GHA migration is still in progress. 2. Currently there is only one worker, so jobs are serialized. We will review the capacity issue after this is deployed. 3. If the user would like to rerun the test, she has to push to the PR. Simply updating the PR body won't work. 4. Only supports environment CUDA 10.2 + python 3.7 RUN_TORCHBENCH: yolov3, pytorch_mobilenet_v3 Pull Request resolved: pytorch#56957 Reviewed By: janeyx99 Differential Revision: D28079077 Pulled By: xuzhao9 fbshipit-source-id: e9ea73bdd9f35e650b653009060d477b22174bba
This PR adds TorchBench (pytorch/benchmark) CI workflow to pytorch. It tests PRs whose body contains a line staring with "RUN_TORCHBENCH: " followed by a list of torchbench model names. For example, this PR will create a Torchbench job of running pytorch_mobildnet_v3 and yolov3 model.
For security reasons, only the branch on pytorch/pytorch will run. It will not work on forked repositories.
The model names have to match the exact names in pytorch/benchmark/torchbenchmark/models, separated by comma symbol. Only the first line starting with "RUN_TORCHBENCH: " is respected. If nothing is specified after the magic word, no test will run.
Known issues:
RUN_TORCHBENCH: yolov3, pytorch_mobilenet_v3