Add golden artifact generation to nightly backend test suite by kirklandsign · Pull Request #17663 · pytorch/executorch

kirklandsign · 2026-02-24T04:11:21Z

Summary

After successful model correctness verification (torch.allclose against eager), dump the input tensors, eager reference output, and serialized .pte as golden files. These artifacts are packaged into per-model zips and a combined golden_artifacts_yymmddhh.zip, then uploaded to S3 via the existing test-infra artifact pipeline.

Controlled by the GOLDEN_ARTIFACTS_DIR environment variable — when unset, behavior is unchanged. The test_backend.sh script sets this automatically.

Test plan

CI

After successful model correctness verification (torch.allclose against eager), dump the input tensors, eager reference output, and serialized .pte as golden files. These artifacts are packaged into per-model zips and a combined golden_artifacts_yymmddhh.zip, then uploaded to S3 via the existing test-infra artifact pipeline. Controlled by the GOLDEN_ARTIFACTS_DIR environment variable — when unset, behavior is unchanged. The test_backend.sh script sets this automatically. Disclosure: PR authored with assistance from Claude.

Temporarily widen the trigger to include changes to test_backend.sh, backends/test/harness/, and backends/test/suite/ so the golden artifacts pipeline runs on this PR. Disclosure: PR authored with assistance from Claude.

pytorch-bot · 2026-02-24T04:11:25Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17663

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2026-02-24T04:12:08Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

The reusable workflow caller (test-backend-xnnpack.yml) does not grant id-token: write, so nested jobs cannot request it. Remove the permissions block and upload-artifact-to-s3 for now — golden artifacts will be available as regular GH artifacts. S3 upload can be re-enabled once the caller workflows grant the necessary permissions. Disclosure: PR authored with assistance from Claude.

Add permissions (id-token: write, contents: read) to the caller workflow test-backend-xnnpack.yml so the nested package-golden-artifacts job can use OIDC for S3 uploads. Restore permissions and upload-artifact-to-s3 on the nested job. Disclosure: PR authored with assistance from Claude.

Replace linux_job_v2 with a plain runs-on: ubuntu-22.04 job that uses actions/download-artifact and actions/upload-artifact directly. This avoids the GITHUB_TOKEN unavailability issue inside the linux_job_v2 docker container and removes the need for id-token:write permissions. Disclosure: PR authored with assistance from Claude.

Switch package-golden-artifacts runner to linux.2xlarge (self-hosted with S3 access) and add seemethere/upload-artifact-s3 step to persist golden artifacts to the gha-artifacts S3 bucket with 90-day retention.

Copilot

Pull request overview

Adds optional “golden artifact” generation to the backend test suite so nightly runs can capture inputs, eager reference outputs, and serialized .pte files for downstream validation/debugging and upload them via the existing artifact pipeline.

Changes:

Plumb artifact_dir/artifact_name through the test runner into the harness to dump input/output .bin files after successful correctness checks.
Dump the serialized .pte as a golden artifact after successful comparisons.
Add CI packaging/upload steps to combine golden artifacts into a timestamped zip and upload to GitHub Actions artifacts + S3.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
`backends/test/suite/runner.py`	Adds artifact naming and `.pte` dumping after successful runs.
`backends/test/suite/conftest.py`	Resolves golden artifact output directory from `GOLDEN_ARTIFACTS_DIR`.
`backends/test/harness/tester.py`	Dumps golden input/output tensors as raw `.bin` files.
`.github/workflows/test-backend-xnnpack.yml`	Expands PR path triggers to include harness/suite/script changes.
`.github/workflows/_test_backend.yml`	Adds a job to package and upload combined golden artifacts.
`.ci/scripts/test_backend.sh`	Sets `GOLDEN_ARTIFACTS_DIR` and zips per-model golden artifacts.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-24T20:11:35Z

+GOLDEN_DIR="${ARTIFACT_DIR}/golden-artifacts"
+export GOLDEN_ARTIFACTS_DIR="${GOLDEN_DIR}"
+


GOLDEN_ARTIFACTS_DIR is exported unconditionally, so the operators suite will also generate golden inputs/outputs and .pte files even though the packaging job only collects *-models artifacts. This will increase artifact size and I/O for operators runs; consider only setting this env var (or only zipping) when SUITE=models (or when a separate opt-in flag is set).

Copilot · 2026-02-24T20:11:35Z

+    # Group files by model name prefix and zip each model's artifacts.
+    for pte in *.pte; do
+        [[ -f "$pte" ]] || continue
+        model_name="${pte%.pte}"
+        zip -j "${GOLDEN_DIR}/${model_name}_golden.zip" \
+            "${model_name}.pte" \
+            ${model_name}_input*.bin \
+            ${model_name}_expected_output*.bin \
+            2>/dev/null || true


Per-model zips are written to ${GOLDEN_DIR}/${model_name}_golden.zip (outside the per-flow directory). In the workflow matrix, multiple flows can produce the same model_name, which will silently overwrite zips from earlier flows. Include $FLOW in the zip filename or keep the per-model zips under ${GOLDEN_DIR}/${FLOW}/ to avoid collisions.

Copilot · 2026-02-24T20:11:36Z

+            -exec cp {} golden_combined/ \;
+
+          if ls golden_combined/*.pte 1>/dev/null 2>&1; then
+            (cd golden_combined && zip -r "../golden_artifacts_${TIMESTAMP}.zip" .)
+            echo "Created golden_artifacts_${TIMESTAMP}.zip with $(ls golden_combined/*.pte | wc -l) models."


The packaging step flattens all .pte/.bin files from downloaded/ into a single golden_combined/ directory via cp. Since artifacts are produced per-flow, identical filenames across flows (same model/test name) will overwrite each other and the combined zip will silently drop files. Preserve directory structure (e.g. copy with --parents or zip from the original tree) or prefix filenames with flow/suite to keep them unique.

Suggested change

-exec cp {} golden_combined/ \;

if ls golden_combined/*.pte 1>/dev/null 2>&1; then

(cd golden_combined && zip -r "../golden_artifacts_${TIMESTAMP}.zip" .)

echo "Created golden_artifacts_${TIMESTAMP}.zip with $(ls golden_combined/*.pte | wc -l) models."

-exec cp --parents {} golden_combined/ \;

if find golden_combined -name '*.pte' -print -quit | grep -q .; then

(cd golden_combined && zip -r "../golden_artifacts_${TIMESTAMP}.zip" .)

echo "Created golden_artifacts_${TIMESTAMP}.zip with $(find golden_combined -name '*.pte' | wc -l) models."

Copilot · 2026-02-24T20:11:36Z

+                self._dump_golden_artifacts(
+                    artifact_dir, artifact_name, inputs_to_run, reference_output
+                )


Golden artifact dumping can raise and fail the test run: the call to _dump_golden_artifacts(...) isn’t wrapped, so any filesystem/serialization issue (permissions, full disk, unsupported dtype -> .numpy(), etc.) will turn an otherwise-successful correctness check into a test failure. Since artifacts are optional, catch exceptions around this call and log a warning (similar to the .pte dump logic in runner.py).

Suggested change

self._dump_golden_artifacts(

artifact_dir, artifact_name, inputs_to_run, reference_output

)

try:

self._dump_golden_artifacts(

artifact_dir, artifact_name, inputs_to_run, reference_output

)

except Exception as e:

logger = logging.getLogger(__name__)

logger.warning(

"Failed to dump golden artifacts for '%s': %s",

artifact_name,

e,

)

Copilot · 2026-02-24T20:11:36Z

@@ -210,11 +218,25 @@ def build_result(
                statistics_callback=lambda stats: error_statistics.append(stats),
                atol=1e-1,
                rtol=4e-2,
+                artifact_dir=artifact_dir,
+                artifact_name=artifact_name,
            )


artifact_name is derived only from test_base_name (the un-parameterized pytest node name). For parameterized tests (e.g. static vs dynamic shapes, dtype variants) this will produce identical filenames and will overwrite previously saved inputs/outputs/.pte within the same run. Consider incorporating test_name (includes params) and/or subtest_index into the artifact name (sanitized for filesystem) to avoid collisions and make artifacts traceable back to the exact test variant.

Copilot · 2026-02-24T20:11:37Z

+export GOLDEN_ARTIFACTS_DIR="${GOLDEN_DIR}"
+
 EXIT_CODE=0
 ${CONDA_RUN_CMD} pytest -c /dev/nul -n auto backends/test/suite/$SUITE/ -m flow_$FLOW --json-report --json-report-file="$REPORT_FILE" || EXIT_CODE=$?


pytest -c /dev/nul looks like a typo: /dev/nul typically doesn’t exist and will cause pytest to fail to start when trying to load the config file. This should likely be /dev/null (commonly used to ignore repo pytest.ini).

Suggested change

${CONDA_RUN_CMD} pytest -c /dev/nul -n auto backends/test/suite/$SUITE/ -m flow_$FLOW --json-report --json-report-file="$REPORT_FILE" || EXIT_CODE=$?

${CONDA_RUN_CMD} pytest -c /dev/null -n auto backends/test/suite/$SUITE/ -m flow_$FLOW --json-report --json-report-file="$REPORT_FILE" || EXIT_CODE=$?

- Prefix per-model golden zips with flow name to avoid cross-flow filename collisions (e.g. xnnpack_mobilenet_v3_small_golden.zip) - Collect pre-packaged golden zips in workflow instead of flattening raw .pte/.bin files that would overwrite across flows - Wrap _dump_golden_artifacts in try/except so filesystem errors don't fail otherwise-passing correctness tests - Append subtest_index to artifact name for parameterized test variants - Fix /dev/nul typo to /dev/null in pytest config override

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 7 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-24T21:39:48Z

+        for i, out in enumerate(reference_output):
+            if isinstance(out, torch.Tensor):
+                suffix = "" if len(reference_output) == 1 else f"_{i}"
+                path = os.path.join(
+                    artifact_dir, f"{artifact_name}_expected_output{suffix}.bin"
+                )
+                out.contiguous().numpy().tofile(path)
+                logger.info(f"Saved golden output to {path}")


Similar to the input handling, the loop only saves outputs that are torch.Tensor instances. If reference_output contains non-tensor elements after being converted to a tuple, those elements will be silently skipped. This could result in incomplete output files. Consider logging a warning when non-tensor outputs are encountered and skipped.

Copilot · 2026-02-24T21:39:48Z

+        if isinstance(reference_output, torch.Tensor):
+            reference_output = (reference_output,)
+        elif isinstance(reference_output, OrderedDict):
+            reference_output = tuple(reference_output.values())


The function does not handle the case where reference_output is already a tuple. According to the existing _compare_outputs method (lines 474-477), the code handles torch.Tensor and OrderedDict, but if reference_output is already a tuple (which is a valid case), it will not be normalized. This could lead to issues if the tuple contains non-tensor elements or needs further processing. Consider adding a check for tuple type or ensuring all possible output types are handled consistently.

Suggested change

reference_output = tuple(reference_output.values())

reference_output = tuple(reference_output.values())

elif isinstance(reference_output, (list, tuple)):

reference_output = tuple(reference_output)

Copilot · 2026-02-24T21:39:49Z

+        zip -j "${GOLDEN_DIR}/${FLOW}_${model_name}_golden.zip" \
+            "${model_name}.pte" \
+            ${model_name}_input*.bin \
+            ${model_name}_expected_output*.bin \
+            2>/dev/null || true


The bash script does not properly quote the glob patterns when they could be empty. If ${model_name}_input*.bin or ${model_name}_expected_output*.bin don't match any files, the zip command will receive literal strings with asterisks. While the 2>/dev/null || true handles errors, this could silently create zips with missing artifacts. Consider explicitly checking for the existence of at least one input and output file before creating the zip, or using a more robust glob pattern that ensures matching files exist.

Copilot · 2026-02-24T21:39:49Z

+        reference_output,
+    ):
+        logger = logging.getLogger(__name__)
+        os.makedirs(artifact_dir, exist_ok=True)


The artifact directory creation should be done earlier to catch errors during the actual test run rather than silently failing later. Currently, if os.makedirs fails, the exception is caught and logged as a warning, but the test continues. Since this is called after successful output comparison, there's a risk that test results could be marked as successful even though artifact generation failed. Consider whether artifact generation failures should be treated as test failures, or at minimum, ensure that the directory creation happens before the comparison so that filesystem issues are caught early.

Copilot · 2026-02-24T21:39:49Z

+      - name: Download model test artifacts
+        uses: actions/download-artifact@v4
+        with:
+          pattern: test-report-*-models


The pattern 'test-report--models' only downloads artifacts from the 'models' suite, but not from the 'operators' suite. According to the test-backend-linux job matrix, both 'models' and 'operators' suites are run (line 47), and both could potentially generate golden artifacts. If golden artifacts are also expected from operator tests, this pattern should be 'test-report-' to include both suites, or the pattern should explicitly include operators as well.

Suggested change

pattern: test-report-*-models

pattern: test-report-*

Copilot · 2026-02-24T21:39:49Z

+
+      - name: Upload golden artifacts to S3
+        uses: seemethere/upload-artifact-s3@v5
+        if: ${{ hashFiles('golden_artifacts_*.zip') != '' }}


The condition checks for the existence of golden_artifacts_*.zip files to determine whether to upload to S3, but this check happens in the step itself (line 98). If for some reason the file doesn't exist at that point, the step will be skipped silently. However, the step name suggests it should "Upload golden artifacts to S3" unconditionally if the package-golden-artifacts job succeeded. Consider whether the conditional should be on the job level (line 63) rather than the step level, or if the conditional logic needs adjustment to match the intended behavior.

Suggested change

if: ${{ hashFiles('golden_artifacts_*.zip') != '' }}

Copilot · 2026-02-24T21:39:50Z

+        for i, inp in enumerate(inputs):
+            if isinstance(inp, torch.Tensor):
+                suffix = "" if len(inputs) == 1 else f"_{i}"
+                path = os.path.join(artifact_dir, f"{artifact_name}_input{suffix}.bin")
+                inp.contiguous().numpy().tofile(path)
+                logger.info(f"Saved golden input to {path}")


The loop only saves inputs that are torch.Tensor instances, silently skipping any non-tensor inputs. This could lead to incomplete golden artifact sets if models accept mixed tensor and non-tensor inputs (e.g., integers, floats, booleans). While this might be intentional for simplicity, it should be documented or a warning should be logged when non-tensor inputs are skipped, so that users are aware that the golden artifacts may not fully represent the test case.

Remove per-model zip step from test_backend.sh — raw .pte/.bin files stay in the per-flow directory. The workflow packaging step now collects them preserving flow directory structure, producing a single zip that consumers only need to unzip once.

lucylq · 2026-02-24T23:36:50Z

+
 EXIT_CODE=0
-${CONDA_RUN_CMD} pytest -c /dev/nul -n auto backends/test/suite/$SUITE/ -m flow_$FLOW --json-report --json-report-file="$REPORT_FILE" || EXIT_CODE=$?
+${CONDA_RUN_CMD} pytest -c /dev/null -n auto backends/test/suite/$SUITE/ -m flow_$FLOW --json-report --json-report-file="$REPORT_FILE" || EXIT_CODE=$?


Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-25T00:23:29Z

+              cp "$flow_dir"/*.pte "$flow_dir"/*_input*.bin "$flow_dir"/*_expected_output*.bin \
+                "golden_combined/${flow_name}/" 2>/dev/null || true


The shell command uses a glob pattern that could fail silently if there are no matching files. The copy command with 2>/dev/null || true suppresses all errors, which could hide legitimate issues like permission problems or disk space errors. Consider checking if the source files exist before attempting to copy, and only suppress the expected "file not found" error.

Suggested change

cp "$flow_dir"/*.pte "$flow_dir"/*_input*.bin "$flow_dir"/*_expected_output*.bin \

"golden_combined/${flow_name}/" 2>/dev/null || true

cp_sources=()

for pattern in "$flow_dir"/*.pte "$flow_dir"/*_input*.bin "$flow_dir"/*_expected_output*.bin; do

for f in $pattern; do

[ -e "$f" ] || continue

cp_sources+=("$f")

done

done

if [ "${#cp_sources[@]}" -gt 0 ]; then

cp "${cp_sources[@]}" "golden_combined/${flow_name}/"

fi

Copilot · 2026-02-25T00:23:30Z

+          done
+
+          if find golden_combined -name '*.pte' | grep -q .; then
+            (cd golden_combined && zip -r "../golden_artifacts_${TIMESTAMP}.zip" .)


The PR description mentions "These artifacts are packaged into per-model zips and a combined golden_artifacts_yymmddhh.zip", but the implementation only creates a combined zip file (line 92). There are no per-model zips being created. Either update the PR description to match the implementation, or add the per-model zip creation step if it was intended.

kirklandsign added 2 commits February 23, 2026 20:05

Expand pull_request trigger paths for test-backend-xnnpack

dc2c80d

Temporarily widen the trigger to include changes to test_backend.sh, backends/test/harness/, and backends/test/suite/ so the golden artifacts pipeline runs on this PR. Disclosure: PR authored with assistance from Claude.

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 24, 2026

kirklandsign added 5 commits February 23, 2026 20:15

Upload golden artifacts to S3 via self-hosted runner

cfbbf25

Switch package-golden-artifacts runner to linux.2xlarge (self-hosted with S3 access) and add seemethere/upload-artifact-s3 step to persist golden artifacts to the gha-artifacts S3 bucket with 90-day retention.

Fix line length lint in runner.py

7038965

kirklandsign marked this pull request as ready for review February 24, 2026 20:04

kirklandsign requested a review from cccclai as a code owner February 24, 2026 20:04

Copilot AI review requested due to automatic review settings February 24, 2026 20:04

Copilot started reviewing on behalf of kirklandsign February 24, 2026 20:05 View session

Copilot AI reviewed Feb 24, 2026

View reviewed changes

kirklandsign added 2 commits February 24, 2026 13:26

Fix lint: simplify artifact_name ternary to single line

af61150

Copilot AI review requested due to automatic review settings February 24, 2026 21:33

Copilot started reviewing on behalf of kirklandsign February 24, 2026 21:34 View session

Copilot AI reviewed Feb 24, 2026

View reviewed changes

kirklandsign added 2 commits February 24, 2026 15:11

Use stable S3 prefix for golden artifacts

9756b37

lucylq approved these changes Feb 24, 2026

View reviewed changes

Remove trailing blank line in test_backend.sh

64f5ae8

Copilot AI review requested due to automatic review settings February 25, 2026 00:17

Copilot started reviewing on behalf of kirklandsign February 25, 2026 00:18 View session

Copilot AI reviewed Feb 25, 2026

View reviewed changes

kirklandsign merged commit 7e1e6b4 into main Feb 25, 2026
325 of 348 checks passed

kirklandsign deleted the android/nightly-golden-artifacts branch February 25, 2026 00:44

		GOLDEN_DIR="${ARTIFACT_DIR}/golden-artifacts"
		export GOLDEN_ARTIFACTS_DIR="${GOLDEN_DIR}"

	${CONDA_RUN_CMD} pytest -c /dev/nul -n auto backends/test/suite/$SUITE/ -m flow_$FLOW --json-report --json-report-file="$REPORT_FILE" \|\| EXIT_CODE=$?
	${CONDA_RUN_CMD} pytest -c /dev/null -n auto backends/test/suite/$SUITE/ -m flow_$FLOW --json-report --json-report-file="$REPORT_FILE" \|\| EXIT_CODE=$?

		cp "$flow_dir"/.pte "$flow_dir"/_input.bin "$flow_dir"/_expected_output*.bin \
		"golden_combined/${flow_name}/" 2>/dev/null \|\| true

-              cp "$flow_dir"/*.pte "$flow_dir"/*_input*.bin "$flow_dir"/*_expected_output*.bin \
-                "golden_combined/${flow_name}/" 2>/dev/null || true
+              cp_sources=()
+              for pattern in "$flow_dir"/*.pte "$flow_dir"/*_input*.bin "$flow_dir"/*_expected_output*.bin; do
+                for f in $pattern; do
+                  [ -e "$f" ] || continue
+                  cp_sources+=("$f")
+                done
+              done
+              if [ "${#cp_sources[@]}" -gt 0 ]; then
+                cp "${cp_sources[@]}" "golden_combined/${flow_name}/"
+              fi

Conversation

kirklandsign commented Feb 24, 2026

Summary

Test plan

Uh oh!

pytorch-bot Bot commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17663

Uh oh!

github-actions Bot commented Feb 24, 2026

This PR needs a release notes: label

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

lucylq Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

pytorch-bot Bot commented Feb 24, 2026 •

edited

Loading

This PR needs a `release notes:` label