Skip to content

Conversation

@Jerry-Ge
Copy link
Contributor

@Jerry-Ge Jerry-Ge commented Dec 1, 2023

Enable running TOSA ref_model e2e tests and some refactoring

@pytorch-bot
Copy link

pytorch-bot bot commented Dec 1, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/1335

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (5 Unrelated Failures)

As of commit b24c7ce with merge base 663882f (image):

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 1, 2023
@Jerry-Ge
Copy link
Contributor Author

Jerry-Ge commented Dec 1, 2023

ciflow/trunk

done

# Run the TOSA Reference_Model e2e tests
cd $et_root_dir
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are effectively different lowering and execution paths, and since they are in the same script with -e if something fails before this, then this won't run. Do you want to either selectively apply -e or split this into running different individual scripts so we can get more complete signal when something fails?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree we can split into separate files (run.sh and run_tosa_e2e.sh) which will be much cleaner. If this sounds good, I will push the changes.

Another question is: do i need to do anything else to trigger the running of the new shell script run_tosa_e2e.sh ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hopefully we can add another backend under

test-arm-backend-delegation:

Or if preferred from the CI flow we can add it as a script.

I think we should restructure this a bit; I view things as:
examples/arm/ are typically user run scripts and tutorials
backends/arm/test/* are unit tests and test scripts

We should add `backends/arm/test/test.sh which tests both the example (as we don't want it to regress either) and the TOSA AoT flow.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we refactor this like @robell said about where these scripts should live, but also taking a step back and be more intentional and clear about the objective of these tests for (A) vs. (B).

Python --> Graph --> Arm Backend --> TOSA --> Vela --> PTE --> Binary --> FVP (A)
                                      |-----------> Binary --> TOSA_Ref_model (B)

Things like script names, job names, which deps are installed, or which docker image (if separate) etc. It would be better to call out shared things as arm or tosa specific , and backend specific things as such like ethos or vela vs. tosa-reference model.

Things like "e2e test" or "delegate test" doesn't really help, and someone reading it without in-depth knowledge (almost everyone except a few of us really) figure out what we are talking about.

How we actually run things on CI - @huydhn is our expert here - we should consult him before doing major refactor around CI jobs, IMHO.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Jerry, changes look sensible to me. being nit-picky having the same name for the invocation script and the .py would be cleaner? might be easier if we pattern match it in automation for some reason later too.

Don't think it's worth doing now, but we might want to rename slightly when we have more test flows, as we have test_tosa.py (unit tests which test edge to tosa) and arm_tosa_reference.py which captures the intermediate compiler output (tosa fb, usually a wire format in AoT) and runs it on the reference. so being consistent as we add more and looking to something like:

for the AoT compiler flows (by the backend options (which runtime target) that are passed):
unit_tosa
unit_vela
unit_<other/many>

test_tosa_reference
test_u55 (e.g. like the examples/arm/ flow but for when we have substantial correctness testing of a u55/65 backend).

Now it's down to how the CI team would prefer to integrate?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah let me ping Huy.

Also I am ok with the change overall besides names bikeshedding which we should converge as we develop more.

@digantdesai digantdesai added partner: arm For backend delegation, kernels, demo, etc. from the 3rd-party partner, Arm triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Dec 4, 2023
source /opt/arm-sdk/setup_path.sh
PYTHON_EXECUTABLE=python bash examples/arm/run.sh /opt/arm-sdk buck2
test-arm-e2e-delegation:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@huydhn - if we have two different scripts (doing different things) should this be the best approach or running under a single job?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this just needs a common linux.2xlarge runner, let's keep them separate for now. On the other hand, having too many tiny jobs that takes less than 15 to finish is a waste, so we might want to consolidate them into bigger a bigger chunk. But that's an optimization step that could be done independently of this change IMO.

done

# Run the TOSA Reference_Model e2e tests
cd $et_root_dir
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we refactor this like @robell said about where these scripts should live, but also taking a step back and be more intentional and clear about the objective of these tests for (A) vs. (B).

Python --> Graph --> Arm Backend --> TOSA --> Vela --> PTE --> Binary --> FVP (A)
                                      |-----------> Binary --> TOSA_Ref_model (B)

Things like script names, job names, which deps are installed, or which docker image (if separate) etc. It would be better to call out shared things as arm or tosa specific , and backend specific things as such like ethos or vela vs. tosa-reference model.

Things like "e2e test" or "delegate test" doesn't really help, and someone reading it without in-depth knowledge (almost everyone except a few of us really) figure out what we are talking about.

How we actually run things on CI - @huydhn is our expert here - we should consult him before doing major refactor around CI jobs, IMHO.

@Jerry-Ge Jerry-Ge force-pushed the arm_ci branch 2 times, most recently from 1b58ae6 to a83941b Compare December 13, 2023 22:40
@Jerry-Ge Jerry-Ge force-pushed the arm_ci branch 2 times, most recently from 504bc48 to dd461ec Compare December 14, 2023 18:02
@huydhn
Copy link
Contributor

huydhn commented Dec 14, 2023

The CI change looks fine, but the job is still failing due to a minor issue with the wrong path here https://github.com/pytorch/executorch/actions/runs/7212751191/job/19651117657?pr=1335

@Jerry-Ge
Copy link
Contributor Author

The CI change looks fine, but the job is still failing due to a minor issue with the wrong path here https://github.com/pytorch/executorch/actions/runs/7212751191/job/19651117657?pr=1335

yep, let me fix this quickly.

@Jerry-Ge Jerry-Ge force-pushed the arm_ci branch 2 times, most recently from 26f746a to f344f73 Compare December 14, 2023 18:45
Copy link
Contributor

@digantdesai digantdesai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know when ready and I can pull this in. @Jerry-Ge

@facebook-github-bot
Copy link
Contributor

@digantdesai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@Jerry-Ge
Copy link
Contributor Author

Jerry-Ge commented Dec 14, 2023

Let me know when ready and I can pull this in. @Jerry-Ge

@digantdesai and @huydhn , seems we're still missing something here: https://github.com/pytorch/executorch/actions/runs/7213201076/job/19652543330?pr=1335#step:11:471

@huydhn
Copy link
Contributor

huydhn commented Dec 14, 2023

I see that flatc is installed in /opt/conda/envs/py_3.10/bin/flatc https://github.com/pytorch/executorch/actions/runs/7213201076/job/19652543330?pr=1335#step:11:259, so may be the path of the input file is wrong unable to load file: ./backends/arm/third-party/serialization_lib/schema/tosa.fbs?

@Jerry-Ge
Copy link
Contributor Author

I see that flatc is installed in /opt/conda/envs/py_3.10/bin/flatc https://github.com/pytorch/executorch/actions/runs/7213201076/job/19652543330?pr=1335#step:11:259, so may be the path of the input file is wrong unable to load file: ./backends/arm/third-party/serialization_lib/schema/tosa.fbs?

yes, that's right. updated the path and let's wait for the new run to finish.

@Jerry-Ge Jerry-Ge force-pushed the arm_ci branch 2 times, most recently from a20ec45 to 0e53a76 Compare December 14, 2023 19:47
@huydhn
Copy link
Contributor

huydhn commented Jan 7, 2024

@pytorchbot rebase -b main

(Darn, this doesn't work on ET PR yet)

@huydhn
Copy link
Contributor

huydhn commented Jan 7, 2024

@huydhn happy new year! could you help take another look on this PR : )

Happy New Year! And sorry for missing your message last week. I'm updating the docker-build workflow a bit to correctly rebuild the Docker image used by the job. Let's see if this works now.

Also, the base commit is rather old, so please help do a rebase to latest ExecuTorch main commit. I realize that my @pytorchbot rebase -b main command doesn't work on ExecuTorch PR yet and will create an issue to fix that.

cd "${root_dir}"
if [[ ! -e reference_model ]]; then
git clone https://review.mlplatform.org/tosa/reference_model -b v0.80.0
git clone https://review.mlplatform.org/tosa/reference_model -b 0.90.0d
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see only v0.90a0 on https://review.mlplatform.org/plugins/gitiles/tosa/reference_model/, so the build is failing with Remote branch 0.90.0d not found in upstream origin. Let me try to update the tag here to the correct value and rerun CI.

Suggested change
git clone https://review.mlplatform.org/tosa/reference_model -b 0.90.0d
git clone https://review.mlplatform.org/tosa/reference_model -b v0.90a0

Copy link
Contributor

@huydhn huydhn Jan 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried it out with this v0.90a0 branch and ended up with a slightly different version mismatch error for git clone https://review.mlplatform.org/tosa/reference_model -b 0.90.0d

Read flatbuffer version 0.90.0d is not compatible with serializer version 0.80.0d (instead of 0.80.0)

cc @guangy10 @Jerry-Ge Do you know where to find that version as it's not listed on https://review.mlplatform.org/plugins/gitiles/tosa/reference_model?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, reference_model:v0.90a0 is still using the 0.80.0d thirdparty serialization_lib which is causing the issue above.

I changed to git clone https://review.mlplatform.org/tosa/reference_model -b main in the latest commit which is now currently using the 0.90.0d serialization_lib. https://review.mlplatform.org/plugins/gitiles/tosa/reference_model/+/refs/heads/main/thirdparty/serialization_lib

But it's still getting ERROR: Read flatbuffer version 0.90.0d is not compatible with serializer version 0.80.0d. Feels like the CI is still picking the v0.80 branch?

Thanks for the help. This is a little bit convoluted now...

Copy link
Contributor

@huydhn huydhn Jan 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect this is due to the TODO that I mention in .ci/docker/build.sh, so I'm pushing a comment change there to trigger the rebuild. Once the build is done https://github.com/pytorch/executorch/actions/runs/7453481163, could you help do another rebase or any minor changes to update the PR one more time. That will pull the latest images from https://github.com/pytorch/executorch/actions/runs/7453481163.

You can see in the current run of the job https://github.com/pytorch/executorch/actions/runs/7453481711/job/20278944604 that it's looking for a new Docker image and the image isn't there yet because https://github.com/pytorch/executorch/actions/runs/7453481163 is still building it. I would need to figure out a better way to handle this case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, for sure. I will try that once the new build is done. Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It finally looks correct now https://github.com/pytorch/executorch/actions/runs/7454335305/job/20281469352?pr=1335, but I need to hardcode the Docker image to the correct version from https://github.com/pytorch/executorch/actions/runs/7453481163. To unblock this change, let me import this and follow up on that Docker issue in a separate PR. Does this sound good to you?

Copy link
Contributor

@huydhn huydhn Jan 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I know a cleaner word around. I could delete the upstream Docker image to ensure that the correct one is found and used instead of hard coding the version. No action needed on your end though. Please squash the commits as usual and I will reimport it. Thank you for your patience :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no problem. thanks

@Jerry-Ge
Copy link
Contributor Author

Jerry-Ge commented Jan 8, 2024

@huydhn this finally runs! thanks. Let me squash the commits then we can merge.

- Renamed the arm_tosa_e2e tests to arm_tosa_reference
- Refactored the arm_tosa_reference tests under backend/arm/test
  directory

Signed-off-by: Jerry Ge <jerry.ge@arm.com>
@Jerry-Ge
Copy link
Contributor Author

Jerry-Ge commented Jan 8, 2024

@huydhn this finally runs! thanks. Let me squash the commits then we can merge.

squash and rebase done. let me know if i need to do anything else? when the hard-coded docker image commit was reverted, the error now goes back.

@facebook-github-bot
Copy link
Contributor

@huydhn has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@huydhn
Copy link
Contributor

huydhn commented Jan 9, 2024

squash and rebase done. let me know if i need to do anything else? when the hard-coded docker image commit was reverted, the error now goes back.

I get it sorted out by deleting the old image https://github.com/pytorch/executorch/actions/runs/7454572246/job/20284947302?pr=1335. This looks good now.

@digantdesai @guangy10 If this looks ok to you, please help stamp D52173335 to land this, I import the PR, so my approval doesn't count :)

@digantdesai
Copy link
Contributor

LGTM, looking at the CI, @Jerry-Ge #1336 broke test-arm-backend-delegate job. Can you please fix it? I can create an issue for this if you want.

@digantdesai
Copy link
Contributor

@huydhn sorry missed this party, are we planning to use a specific version as oppose to main branch once the Docker issue is fixed?

@Jerry-Ge
Copy link
Contributor Author

Jerry-Ge commented Jan 9, 2024

LGTM, looking at the CI, @Jerry-Ge #1336 broke test-arm-backend-delegate job. Can you please fix it? I can create an issue for this if you want.

Thanks Digant! Looks like it's a Vela problem for Conv2d and I need to find the right person to fix that. Yea, please create a issue and I will relay that to the right person.

@Jerry-Ge
Copy link
Contributor Author

Jerry-Ge commented Jan 9, 2024

@huydhn sorry missed this party, are we planning to use a specific version as oppose to main branch once the Docker issue is fixed?

TOSA 0.90 is about to release soon. Once that's been done, we can change everything to 0.90 then all things should be smooth.

@huydhn
Copy link
Contributor

huydhn commented Jan 9, 2024

LGTM, looking at the CI, @Jerry-Ge #1336 broke test-arm-backend-delegate job. Can you please fix it? I can create an issue for this if you want.

That failure also happens in trunk atm https://hud.pytorch.org/hud/pytorch/executorch/main/1?per_page=50&name_filter=arm from what I see, so it might be unrelated to the change here and can be fixed separately. If that's the case, I think we can:

  1. Merge this
  2. Fix the Docker issue
  3. Pin to 0.90 branch once it's ready

- .ci/docker/**
- .github/workflows/docker-builds.yml
- requirements-lintrunner.txt
- examples/arm/**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@huydhn IIUC, dock image rebuild will be triggered if anything changed in this dir. Should we narrow the triggering condition down to detecting changes in examples/arm/setup.sh and examples/arm/ethos-u-setup/** eg reference_model version bump? @Jerry-Ge only these two places matters regarding setup the test env right?

Copy link
Contributor

@huydhn huydhn Jan 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right that Docker rebuild will be triggered for all changes in examples/arm folder. I could narrow it down to only these places if only they are needed. Although, may be it's easier to just do a blanket include here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, thinking a bit more, I think we should keep the rule as it's because it reflects the way the whole directory is copied to Docker cp -r ../../examples/arm/ ./arm.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@huydhn Remind me if we do need to copy the entire examples/arm folder. IIUC, those *.py files and the run.sh script are not needed by the docker image. Those files may be updated more frequently than the actual dependency for docker image build. Maybe a better way is to collect all deps that docker image needs in a single dir eg examples/arm/setup/** and monitor the changes there for docker image rebuild?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, putting them altogether in a subdirectory like examples/arm/setup/** makes sense to me. As I'm reorganize these scripts a bit to fix the Docker build part, let me include that part there too

Comment on lines +44 to 46
# TODO(huydhn): Figure out a way to rebuild the Docker image automatically
# with a new image hash when the content here is updated
cp -r ../../examples/arm/ ./arm
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@huydhn IIUC, it will re-run this step (copy new setup.sh and its deps) when the docker image rebuild is triggered, in this case when Arm bumps up the reference model version in setup.sh. So if we are tracking the changes in setup.sh, we should get the automatic docker image rebuild as needed, right? Just curious if you indicate something different

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the rebuild is done alright, but there is a complication where the Docker image hash tag is computed from the commit hash of the .ci/docker directory, the standard location of everything docker-related. So, what we have here is that the Docker image is rebuilt, but its hash tag remains the same because examples/arm/ is not under .ci/docker, so the job doesn't reupload the image thinking that nothing has changed. That's the reason why @Jerry-Ge tried to update the branch, but didn't see the new change took effect.

I could apply the same workaround that PyTorch is using here by removing cp and creating a softlink.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see, the hash here is for the docker image not the hash of the reference_model. @huydhn Maybe you could elaborate more about when the docker image hash tag is checked and compared? I understand that once those files are copied, the hash of the docker image won't change, but shouldn't the docker image build be triggered based on .github/workflows/docker-builds.yml where we set to monitor the examples/arm/ dir. In that case, why is it still needed to check and compare the docker image hash?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic behind that is in this GitHub action https://github.com/pytorch/test-infra/blob/main/.github/actions/calculate-docker-image/action.yml#L161-L163, you can see that the Docker image is not pushed again if the same hash tag already exists. And the hash tag comes from the content of .ci/docker folder itself https://github.com/pytorch/test-infra/blob/main/.github/actions/calculate-docker-image/action.yml#L70

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this option should help pytorch/test-infra#4866, this force pushes the rebuilt Docker image even if the hash tag already exists.

cd "${root_dir}"
if [[ ! -e reference_model ]]; then
git clone https://review.mlplatform.org/tosa/reference_model -b v0.80.0
git clone https://review.mlplatform.org/tosa/reference_model -b main
Copy link
Contributor

@guangy10 guangy10 Jan 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that the change here is a temporary workaround to unblock this PR just because the new version 0.90.0d is not available. I think the right way for future update will still be the version based (stable) instead attempting to detect and build on new hash from main. So the process would be as simple as:

  1. Make a new version available in https://review.mlplatform.org/plugins/gitiles/tosa/reference_model/
  2. Create a PR to update examples/arms/setup.sh to pick up the new version. In that PR, we should expect to see the docker image is rebuilt using the updated version and test-arm-reference-delegation is using the updated version

cc: @huydhn @Jerry-Ge Let me know if there is a need to build against main branch

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, pinning the stable version is definitely the way to go on CI once the version is ready.

Copy link
Contributor

@guangy10 guangy10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@huydhn Thanks for digging into the issue and suggested fixes. This PR looks good to me. Just want to double click on the expected process for future version bumps. I commented it inline and let me know if it makes sense.

@facebook-github-bot
Copy link
Contributor

@huydhn merged this pull request in 8012187.

@guangy10
Copy link
Contributor

guangy10 commented Jan 9, 2024

@Jerry-Ge The follow-up items are added:

huydhn added a commit to pytorch/test-infra that referenced this pull request Jan 23, 2024
IIRC, we used to have this option, but nothing was using it. I have
found an use case for this in
pytorch/executorch#1335 (comment)
where ExecuTorch team copies the content of `examples/arm` into
`.ci/docker` before starting the build
https://github.com/pytorch/executorch/blob/main/.ci/docker/build.sh#L40-L46:

* The docker image was rebuilt, but its hash tag remained the same
because the copy step above is not version tracked.
* The new image was not pushed.

I have tried to move `examples/arm` into `.ci/docker` and setup
softlinks, but the result looks weird
pytorch/executorch#1574 (comment),
so I'll probably drop that approach.

The remaining option here is to add a `force-push` flag that can be used
to, well, force push the image even if its hash tag already exists.
facebook-github-bot pushed a commit that referenced this pull request Jan 23, 2024
Summary:
Instead of #1574, the other option to resolve the stale Docker image issue from #1335 (comment) is to use the new `force-push` flag added by pytorch/test-infra#4866.  This ensures that a rebuilt image will always be pushed upstream even if its hash tag already exists.

### Testing

With the new GH action, the image is pushed correctly https://github.com/pytorch/executorch/actions/runs/7483947498/job/20369991072?pr=1582#step:6:49039

Pull Request resolved: #1582

Reviewed By: digantdesai

Differential Revision: D52684112

Pulled By: huydhn

fbshipit-source-id: a7642c47be3681129802a4405e5cdcdc2b328560
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged partner: arm For backend delegation, kernels, demo, etc. from the 3rd-party partner, Arm triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants