-
Notifications
You must be signed in to change notification settings - Fork 697
Enable running TOSA reference_model e2e tests on executorch public CI #1335
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/1335
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (5 Unrelated Failures)As of commit b24c7ce with merge base 663882f ( FLAKY - The following jobs failed but were likely due to flakiness present on trunk:
BROKEN TRUNK - The following jobs failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
ciflow/trunk |
examples/arm/run.sh
Outdated
| done | ||
|
|
||
| # Run the TOSA Reference_Model e2e tests | ||
| cd $et_root_dir |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are effectively different lowering and execution paths, and since they are in the same script with -e if something fails before this, then this won't run. Do you want to either selectively apply -e or split this into running different individual scripts so we can get more complete signal when something fails?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree we can split into separate files (run.sh and run_tosa_e2e.sh) which will be much cleaner. If this sounds good, I will push the changes.
Another question is: do i need to do anything else to trigger the running of the new shell script run_tosa_e2e.sh ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hopefully we can add another backend under
executorch/.github/workflows/trunk.yml
Line 140 in c44dadf
| test-arm-backend-delegation: |
Or if preferred from the CI flow we can add it as a script.
I think we should restructure this a bit; I view things as:
examples/arm/ are typically user run scripts and tutorials
backends/arm/test/* are unit tests and test scripts
We should add `backends/arm/test/test.sh which tests both the example (as we don't want it to regress either) and the TOSA AoT flow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we refactor this like @robell said about where these scripts should live, but also taking a step back and be more intentional and clear about the objective of these tests for (A) vs. (B).
Python --> Graph --> Arm Backend --> TOSA --> Vela --> PTE --> Binary --> FVP (A)
|-----------> Binary --> TOSA_Ref_model (B)
Things like script names, job names, which deps are installed, or which docker image (if separate) etc. It would be better to call out shared things as arm or tosa specific , and backend specific things as such like ethos or vela vs. tosa-reference model.
Things like "e2e test" or "delegate test" doesn't really help, and someone reading it without in-depth knowledge (almost everyone except a few of us really) figure out what we are talking about.
How we actually run things on CI - @huydhn is our expert here - we should consult him before doing major refactor around CI jobs, IMHO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Jerry, changes look sensible to me. being nit-picky having the same name for the invocation script and the .py would be cleaner? might be easier if we pattern match it in automation for some reason later too.
Don't think it's worth doing now, but we might want to rename slightly when we have more test flows, as we have test_tosa.py (unit tests which test edge to tosa) and arm_tosa_reference.py which captures the intermediate compiler output (tosa fb, usually a wire format in AoT) and runs it on the reference. so being consistent as we add more and looking to something like:
for the AoT compiler flows (by the backend options (which runtime target) that are passed):
unit_tosa
unit_vela
unit_<other/many>
test_tosa_reference
test_u55 (e.g. like the examples/arm/ flow but for when we have substantial correctness testing of a u55/65 backend).
Now it's down to how the CI team would prefer to integrate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah let me ping Huy.
Also I am ok with the change overall besides names bikeshedding which we should converge as we develop more.
.github/workflows/trunk.yml
Outdated
| source /opt/arm-sdk/setup_path.sh | ||
| PYTHON_EXECUTABLE=python bash examples/arm/run.sh /opt/arm-sdk buck2 | ||
| test-arm-e2e-delegation: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@huydhn - if we have two different scripts (doing different things) should this be the best approach or running under a single job?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As this just needs a common linux.2xlarge runner, let's keep them separate for now. On the other hand, having too many tiny jobs that takes less than 15 to finish is a waste, so we might want to consolidate them into bigger a bigger chunk. But that's an optimization step that could be done independently of this change IMO.
examples/arm/run.sh
Outdated
| done | ||
|
|
||
| # Run the TOSA Reference_Model e2e tests | ||
| cd $et_root_dir |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we refactor this like @robell said about where these scripts should live, but also taking a step back and be more intentional and clear about the objective of these tests for (A) vs. (B).
Python --> Graph --> Arm Backend --> TOSA --> Vela --> PTE --> Binary --> FVP (A)
|-----------> Binary --> TOSA_Ref_model (B)
Things like script names, job names, which deps are installed, or which docker image (if separate) etc. It would be better to call out shared things as arm or tosa specific , and backend specific things as such like ethos or vela vs. tosa-reference model.
Things like "e2e test" or "delegate test" doesn't really help, and someone reading it without in-depth knowledge (almost everyone except a few of us really) figure out what we are talking about.
How we actually run things on CI - @huydhn is our expert here - we should consult him before doing major refactor around CI jobs, IMHO.
1b58ae6 to
a83941b
Compare
504bc48 to
dd461ec
Compare
|
The CI change looks fine, but the job is still failing due to a minor issue with the wrong path here https://github.com/pytorch/executorch/actions/runs/7212751191/job/19651117657?pr=1335 |
yep, let me fix this quickly. |
26f746a to
f344f73
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me know when ready and I can pull this in. @Jerry-Ge
|
@digantdesai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@digantdesai and @huydhn , seems we're still missing something here: https://github.com/pytorch/executorch/actions/runs/7213201076/job/19652543330?pr=1335#step:11:471 |
|
I see that flatc is installed in /opt/conda/envs/py_3.10/bin/flatc https://github.com/pytorch/executorch/actions/runs/7213201076/job/19652543330?pr=1335#step:11:259, so may be the path of the input file is wrong |
yes, that's right. updated the path and let's wait for the new run to finish. |
a20ec45 to
0e53a76
Compare
|
@pytorchbot rebase -b main (Darn, this doesn't work on ET PR yet) |
Happy New Year! And sorry for missing your message last week. I'm updating the docker-build workflow a bit to correctly rebuild the Docker image used by the job. Let's see if this works now. Also, the base commit is rather old, so please help do a rebase to latest ExecuTorch main commit. I realize that my |
examples/arm/setup.sh
Outdated
| cd "${root_dir}" | ||
| if [[ ! -e reference_model ]]; then | ||
| git clone https://review.mlplatform.org/tosa/reference_model -b v0.80.0 | ||
| git clone https://review.mlplatform.org/tosa/reference_model -b 0.90.0d |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see only v0.90a0 on https://review.mlplatform.org/plugins/gitiles/tosa/reference_model/, so the build is failing with Remote branch 0.90.0d not found in upstream origin. Let me try to update the tag here to the correct value and rerun CI.
| git clone https://review.mlplatform.org/tosa/reference_model -b 0.90.0d | |
| git clone https://review.mlplatform.org/tosa/reference_model -b v0.90a0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried it out with this v0.90a0 branch and ended up with a slightly different version mismatch error for git clone https://review.mlplatform.org/tosa/reference_model -b 0.90.0d
Read flatbuffer version 0.90.0d is not compatible with serializer version 0.80.0d (instead of 0.80.0)
cc @guangy10 @Jerry-Ge Do you know where to find that version as it's not listed on https://review.mlplatform.org/plugins/gitiles/tosa/reference_model?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, reference_model:v0.90a0 is still using the 0.80.0d thirdparty serialization_lib which is causing the issue above.
I changed to git clone https://review.mlplatform.org/tosa/reference_model -b main in the latest commit which is now currently using the 0.90.0d serialization_lib. https://review.mlplatform.org/plugins/gitiles/tosa/reference_model/+/refs/heads/main/thirdparty/serialization_lib
But it's still getting ERROR: Read flatbuffer version 0.90.0d is not compatible with serializer version 0.80.0d. Feels like the CI is still picking the v0.80 branch?
Thanks for the help. This is a little bit convoluted now...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect this is due to the TODO that I mention in .ci/docker/build.sh, so I'm pushing a comment change there to trigger the rebuild. Once the build is done https://github.com/pytorch/executorch/actions/runs/7453481163, could you help do another rebase or any minor changes to update the PR one more time. That will pull the latest images from https://github.com/pytorch/executorch/actions/runs/7453481163.
You can see in the current run of the job https://github.com/pytorch/executorch/actions/runs/7453481711/job/20278944604 that it's looking for a new Docker image and the image isn't there yet because https://github.com/pytorch/executorch/actions/runs/7453481163 is still building it. I would need to figure out a better way to handle this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, for sure. I will try that once the new build is done. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It finally looks correct now https://github.com/pytorch/executorch/actions/runs/7454335305/job/20281469352?pr=1335, but I need to hardcode the Docker image to the correct version from https://github.com/pytorch/executorch/actions/runs/7453481163. To unblock this change, let me import this and follow up on that Docker issue in a separate PR. Does this sound good to you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I know a cleaner word around. I could delete the upstream Docker image to ensure that the correct one is found and used instead of hard coding the version. No action needed on your end though. Please squash the commits as usual and I will reimport it. Thank you for your patience :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no problem. thanks
|
@huydhn this finally runs! thanks. Let me squash the commits then we can merge. |
- Renamed the arm_tosa_e2e tests to arm_tosa_reference - Refactored the arm_tosa_reference tests under backend/arm/test directory Signed-off-by: Jerry Ge <jerry.ge@arm.com>
squash and rebase done. let me know if i need to do anything else? when the hard-coded docker image commit was reverted, the error now goes back. |
|
@huydhn has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
I get it sorted out by deleting the old image https://github.com/pytorch/executorch/actions/runs/7454572246/job/20284947302?pr=1335. This looks good now. @digantdesai @guangy10 If this looks ok to you, please help stamp D52173335 to land this, I import the PR, so my approval doesn't count :) |
|
@huydhn sorry missed this party, are we planning to use a specific version as oppose to main branch once the Docker issue is fixed? |
Thanks Digant! Looks like it's a Vela problem for Conv2d and I need to find the right person to fix that. Yea, please create a issue and I will relay that to the right person. |
TOSA 0.90 is about to release soon. Once that's been done, we can change everything to 0.90 then all things should be smooth. |
That failure also happens in trunk atm https://hud.pytorch.org/hud/pytorch/executorch/main/1?per_page=50&name_filter=arm from what I see, so it might be unrelated to the change here and can be fixed separately. If that's the case, I think we can:
|
| - .ci/docker/** | ||
| - .github/workflows/docker-builds.yml | ||
| - requirements-lintrunner.txt | ||
| - examples/arm/** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@huydhn IIUC, dock image rebuild will be triggered if anything changed in this dir. Should we narrow the triggering condition down to detecting changes in examples/arm/setup.sh and examples/arm/ethos-u-setup/** eg reference_model version bump? @Jerry-Ge only these two places matters regarding setup the test env right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right that Docker rebuild will be triggered for all changes in examples/arm folder. I could narrow it down to only these places if only they are needed. Although, may be it's easier to just do a blanket include here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, thinking a bit more, I think we should keep the rule as it's because it reflects the way the whole directory is copied to Docker cp -r ../../examples/arm/ ./arm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@huydhn Remind me if we do need to copy the entire examples/arm folder. IIUC, those *.py files and the run.sh script are not needed by the docker image. Those files may be updated more frequently than the actual dependency for docker image build. Maybe a better way is to collect all deps that docker image needs in a single dir eg examples/arm/setup/** and monitor the changes there for docker image rebuild?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, putting them altogether in a subdirectory like examples/arm/setup/** makes sense to me. As I'm reorganize these scripts a bit to fix the Docker build part, let me include that part there too
| # TODO(huydhn): Figure out a way to rebuild the Docker image automatically | ||
| # with a new image hash when the content here is updated | ||
| cp -r ../../examples/arm/ ./arm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@huydhn IIUC, it will re-run this step (copy new setup.sh and its deps) when the docker image rebuild is triggered, in this case when Arm bumps up the reference model version in setup.sh. So if we are tracking the changes in setup.sh, we should get the automatic docker image rebuild as needed, right? Just curious if you indicate something different
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the rebuild is done alright, but there is a complication where the Docker image hash tag is computed from the commit hash of the .ci/docker directory, the standard location of everything docker-related. So, what we have here is that the Docker image is rebuilt, but its hash tag remains the same because examples/arm/ is not under .ci/docker, so the job doesn't reupload the image thinking that nothing has changed. That's the reason why @Jerry-Ge tried to update the branch, but didn't see the new change took effect.
I could apply the same workaround that PyTorch is using here by removing cp and creating a softlink.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see, the hash here is for the docker image not the hash of the reference_model. @huydhn Maybe you could elaborate more about when the docker image hash tag is checked and compared? I understand that once those files are copied, the hash of the docker image won't change, but shouldn't the docker image build be triggered based on .github/workflows/docker-builds.yml where we set to monitor the examples/arm/ dir. In that case, why is it still needed to check and compare the docker image hash?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic behind that is in this GitHub action https://github.com/pytorch/test-infra/blob/main/.github/actions/calculate-docker-image/action.yml#L161-L163, you can see that the Docker image is not pushed again if the same hash tag already exists. And the hash tag comes from the content of .ci/docker folder itself https://github.com/pytorch/test-infra/blob/main/.github/actions/calculate-docker-image/action.yml#L70
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this option should help pytorch/test-infra#4866, this force pushes the rebuilt Docker image even if the hash tag already exists.
| cd "${root_dir}" | ||
| if [[ ! -e reference_model ]]; then | ||
| git clone https://review.mlplatform.org/tosa/reference_model -b v0.80.0 | ||
| git clone https://review.mlplatform.org/tosa/reference_model -b main |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that the change here is a temporary workaround to unblock this PR just because the new version 0.90.0d is not available. I think the right way for future update will still be the version based (stable) instead attempting to detect and build on new hash from main. So the process would be as simple as:
- Make a new version available in https://review.mlplatform.org/plugins/gitiles/tosa/reference_model/
- Create a PR to update examples/arms/setup.sh to pick up the new version. In that PR, we should expect to see the docker image is rebuilt using the updated version and
test-arm-reference-delegationis using the updated version
cc: @huydhn @Jerry-Ge Let me know if there is a need to build against main branch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, pinning the stable version is definitely the way to go on CI once the version is ready.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@huydhn Thanks for digging into the issue and suggested fixes. This PR looks good to me. Just want to double click on the expected process for future version bumps. I commented it inline and let me know if it makes sense.
|
@Jerry-Ge The follow-up items are added: |
IIRC, we used to have this option, but nothing was using it. I have found an use case for this in pytorch/executorch#1335 (comment) where ExecuTorch team copies the content of `examples/arm` into `.ci/docker` before starting the build https://github.com/pytorch/executorch/blob/main/.ci/docker/build.sh#L40-L46: * The docker image was rebuilt, but its hash tag remained the same because the copy step above is not version tracked. * The new image was not pushed. I have tried to move `examples/arm` into `.ci/docker` and setup softlinks, but the result looks weird pytorch/executorch#1574 (comment), so I'll probably drop that approach. The remaining option here is to add a `force-push` flag that can be used to, well, force push the image even if its hash tag already exists.
Summary: Instead of #1574, the other option to resolve the stale Docker image issue from #1335 (comment) is to use the new `force-push` flag added by pytorch/test-infra#4866. This ensures that a rebuilt image will always be pushed upstream even if its hash tag already exists. ### Testing With the new GH action, the image is pushed correctly https://github.com/pytorch/executorch/actions/runs/7483947498/job/20369991072?pr=1582#step:6:49039 Pull Request resolved: #1582 Reviewed By: digantdesai Differential Revision: D52684112 Pulled By: huydhn fbshipit-source-id: a7642c47be3681129802a4405e5cdcdc2b328560
Enable running TOSA ref_model e2e tests and some refactoring