Skip to content

Add golden artifact generation to nightly backend test suite#17663

Merged
kirklandsign merged 12 commits into
mainfrom
android/nightly-golden-artifacts
Feb 25, 2026
Merged

Add golden artifact generation to nightly backend test suite#17663
kirklandsign merged 12 commits into
mainfrom
android/nightly-golden-artifacts

Conversation

@kirklandsign
Copy link
Copy Markdown
Contributor

Summary

After successful model correctness verification (torch.allclose against eager), dump the input tensors, eager reference output, and serialized .pte as golden files. These artifacts are packaged into per-model zips and a combined golden_artifacts_yymmddhh.zip, then uploaded to S3 via the existing test-infra artifact pipeline.

Controlled by the GOLDEN_ARTIFACTS_DIR environment variable — when unset, behavior is unchanged. The test_backend.sh script sets this automatically.

Test plan

CI

After successful model correctness verification (torch.allclose against
eager), dump the input tensors, eager reference output, and serialized
.pte as golden files. These artifacts are packaged into per-model zips
and a combined golden_artifacts_yymmddhh.zip, then uploaded to S3 via
the existing test-infra artifact pipeline.

Controlled by the GOLDEN_ARTIFACTS_DIR environment variable — when unset,
behavior is unchanged. The test_backend.sh script sets this automatically.

Disclosure: PR authored with assistance from Claude.
Temporarily widen the trigger to include changes to test_backend.sh,
backends/test/harness/, and backends/test/suite/ so the golden artifacts
pipeline runs on this PR.

Disclosure: PR authored with assistance from Claude.
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Feb 24, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17663

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 24, 2026
@github-actions
Copy link
Copy Markdown

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

The reusable workflow caller (test-backend-xnnpack.yml) does not grant
id-token: write, so nested jobs cannot request it. Remove the permissions
block and upload-artifact-to-s3 for now — golden artifacts will be
available as regular GH artifacts. S3 upload can be re-enabled once the
caller workflows grant the necessary permissions.

Disclosure: PR authored with assistance from Claude.
Add permissions (id-token: write, contents: read) to the caller workflow
test-backend-xnnpack.yml so the nested package-golden-artifacts job can
use OIDC for S3 uploads. Restore permissions and upload-artifact-to-s3
on the nested job.

Disclosure: PR authored with assistance from Claude.
Replace linux_job_v2 with a plain runs-on: ubuntu-22.04 job that uses
actions/download-artifact and actions/upload-artifact directly. This
avoids the GITHUB_TOKEN unavailability issue inside the linux_job_v2
docker container and removes the need for id-token:write permissions.

Disclosure: PR authored with assistance from Claude.
Switch package-golden-artifacts runner to linux.2xlarge (self-hosted
with S3 access) and add seemethere/upload-artifact-s3 step to persist
golden artifacts to the gha-artifacts S3 bucket with 90-day retention.
@kirklandsign kirklandsign marked this pull request as ready for review February 24, 2026 20:04
Copilot AI review requested due to automatic review settings February 24, 2026 20:04
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds optional “golden artifact” generation to the backend test suite so nightly runs can capture inputs, eager reference outputs, and serialized .pte files for downstream validation/debugging and upload them via the existing artifact pipeline.

Changes:

  • Plumb artifact_dir/artifact_name through the test runner into the harness to dump input/output .bin files after successful correctness checks.
  • Dump the serialized .pte as a golden artifact after successful comparisons.
  • Add CI packaging/upload steps to combine golden artifacts into a timestamped zip and upload to GitHub Actions artifacts + S3.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
backends/test/suite/runner.py Adds artifact naming and .pte dumping after successful runs.
backends/test/suite/conftest.py Resolves golden artifact output directory from GOLDEN_ARTIFACTS_DIR.
backends/test/harness/tester.py Dumps golden input/output tensors as raw .bin files.
.github/workflows/test-backend-xnnpack.yml Expands PR path triggers to include harness/suite/script changes.
.github/workflows/_test_backend.yml Adds a job to package and upload combined golden artifacts.
.ci/scripts/test_backend.sh Sets GOLDEN_ARTIFACTS_DIR and zips per-model golden artifacts.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +88 to +90
GOLDEN_DIR="${ARTIFACT_DIR}/golden-artifacts"
export GOLDEN_ARTIFACTS_DIR="${GOLDEN_DIR}"

Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GOLDEN_ARTIFACTS_DIR is exported unconditionally, so the operators suite will also generate golden inputs/outputs and .pte files even though the packaging job only collects *-models artifacts. This will increase artifact size and I/O for operators runs; consider only setting this env var (or only zipping) when SUITE=models (or when a separate opt-in flag is set).

Copilot uses AI. Check for mistakes.
Comment thread .ci/scripts/test_backend.sh Outdated
Comment on lines +99 to +107
# Group files by model name prefix and zip each model's artifacts.
for pte in *.pte; do
[[ -f "$pte" ]] || continue
model_name="${pte%.pte}"
zip -j "${GOLDEN_DIR}/${model_name}_golden.zip" \
"${model_name}.pte" \
${model_name}_input*.bin \
${model_name}_expected_output*.bin \
2>/dev/null || true
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per-model zips are written to ${GOLDEN_DIR}/${model_name}_golden.zip (outside the per-flow directory). In the workflow matrix, multiple flows can produce the same model_name, which will silently overwrite zips from earlier flows. Include $FLOW in the zip filename or keep the per-model zips under ${GOLDEN_DIR}/${FLOW}/ to avoid collisions.

Copilot uses AI. Check for mistakes.
Comment thread .github/workflows/_test_backend.yml Outdated
Comment on lines +80 to +84
-exec cp {} golden_combined/ \;

if ls golden_combined/*.pte 1>/dev/null 2>&1; then
(cd golden_combined && zip -r "../golden_artifacts_${TIMESTAMP}.zip" .)
echo "Created golden_artifacts_${TIMESTAMP}.zip with $(ls golden_combined/*.pte | wc -l) models."
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The packaging step flattens all .pte/.bin files from downloaded/ into a single golden_combined/ directory via cp. Since artifacts are produced per-flow, identical filenames across flows (same model/test name) will overwrite each other and the combined zip will silently drop files. Preserve directory structure (e.g. copy with --parents or zip from the original tree) or prefix filenames with flow/suite to keep them unique.

Suggested change
-exec cp {} golden_combined/ \;
if ls golden_combined/*.pte 1>/dev/null 2>&1; then
(cd golden_combined && zip -r "../golden_artifacts_${TIMESTAMP}.zip" .)
echo "Created golden_artifacts_${TIMESTAMP}.zip with $(ls golden_combined/*.pte | wc -l) models."
-exec cp --parents {} golden_combined/ \;
if find golden_combined -name '*.pte' -print -quit | grep -q .; then
(cd golden_combined && zip -r "../golden_artifacts_${TIMESTAMP}.zip" .)
echo "Created golden_artifacts_${TIMESTAMP}.zip with $(find golden_combined -name '*.pte' | wc -l) models."

Copilot uses AI. Check for mistakes.
Comment thread backends/test/harness/tester.py Outdated
Comment on lines +355 to +357
self._dump_golden_artifacts(
artifact_dir, artifact_name, inputs_to_run, reference_output
)
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Golden artifact dumping can raise and fail the test run: the call to _dump_golden_artifacts(...) isn’t wrapped, so any filesystem/serialization issue (permissions, full disk, unsupported dtype -> .numpy(), etc.) will turn an otherwise-successful correctness check into a test failure. Since artifacts are optional, catch exceptions around this call and log a warning (similar to the .pte dump logic in runner.py).

Suggested change
self._dump_golden_artifacts(
artifact_dir, artifact_name, inputs_to_run, reference_output
)
try:
self._dump_golden_artifacts(
artifact_dir, artifact_name, inputs_to_run, reference_output
)
except Exception as e:
logger = logging.getLogger(__name__)
logger.warning(
"Failed to dump golden artifacts for '%s': %s",
artifact_name,
e,
)

Copilot uses AI. Check for mistakes.
Comment thread backends/test/suite/runner.py Outdated
Comment on lines 207 to 223
@@ -210,11 +218,25 @@ def build_result(
statistics_callback=lambda stats: error_statistics.append(stats),
atol=1e-1,
rtol=4e-2,
artifact_dir=artifact_dir,
artifact_name=artifact_name,
)
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

artifact_name is derived only from test_base_name (the un-parameterized pytest node name). For parameterized tests (e.g. static vs dynamic shapes, dtype variants) this will produce identical filenames and will overwrite previously saved inputs/outputs/.pte within the same run. Consider incorporating test_name (includes params) and/or subtest_index into the artifact name (sanitized for filesystem) to avoid collisions and make artifacts traceable back to the exact test variant.

Copilot uses AI. Check for mistakes.
Comment thread .ci/scripts/test_backend.sh Outdated
export GOLDEN_ARTIFACTS_DIR="${GOLDEN_DIR}"

EXIT_CODE=0
${CONDA_RUN_CMD} pytest -c /dev/nul -n auto backends/test/suite/$SUITE/ -m flow_$FLOW --json-report --json-report-file="$REPORT_FILE" || EXIT_CODE=$?
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pytest -c /dev/nul looks like a typo: /dev/nul typically doesn’t exist and will cause pytest to fail to start when trying to load the config file. This should likely be /dev/null (commonly used to ignore repo pytest.ini).

Suggested change
${CONDA_RUN_CMD} pytest -c /dev/nul -n auto backends/test/suite/$SUITE/ -m flow_$FLOW --json-report --json-report-file="$REPORT_FILE" || EXIT_CODE=$?
${CONDA_RUN_CMD} pytest -c /dev/null -n auto backends/test/suite/$SUITE/ -m flow_$FLOW --json-report --json-report-file="$REPORT_FILE" || EXIT_CODE=$?

Copilot uses AI. Check for mistakes.
- Prefix per-model golden zips with flow name to avoid cross-flow
  filename collisions (e.g. xnnpack_mobilenet_v3_small_golden.zip)
- Collect pre-packaged golden zips in workflow instead of flattening
  raw .pte/.bin files that would overwrite across flows
- Wrap _dump_golden_artifacts in try/except so filesystem errors don't
  fail otherwise-passing correctness tests
- Append subtest_index to artifact name for parameterized test variants
- Fix /dev/nul typo to /dev/null in pytest config override
Copilot AI review requested due to automatic review settings February 24, 2026 21:33
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 7 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +393 to +400
for i, out in enumerate(reference_output):
if isinstance(out, torch.Tensor):
suffix = "" if len(reference_output) == 1 else f"_{i}"
path = os.path.join(
artifact_dir, f"{artifact_name}_expected_output{suffix}.bin"
)
out.contiguous().numpy().tofile(path)
logger.info(f"Saved golden output to {path}")
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the input handling, the loop only saves outputs that are torch.Tensor instances. If reference_output contains non-tensor elements after being converted to a tuple, those elements will be silently skipped. This could result in incomplete output files. Consider logging a warning when non-tensor outputs are encountered and skipped.

Copilot uses AI. Check for mistakes.
if isinstance(reference_output, torch.Tensor):
reference_output = (reference_output,)
elif isinstance(reference_output, OrderedDict):
reference_output = tuple(reference_output.values())
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function does not handle the case where reference_output is already a tuple. According to the existing _compare_outputs method (lines 474-477), the code handles torch.Tensor and OrderedDict, but if reference_output is already a tuple (which is a valid case), it will not be normalized. This could lead to issues if the tuple contains non-tensor elements or needs further processing. Consider adding a check for tuple type or ensuring all possible output types are handled consistently.

Suggested change
reference_output = tuple(reference_output.values())
reference_output = tuple(reference_output.values())
elif isinstance(reference_output, (list, tuple)):
reference_output = tuple(reference_output)

Copilot uses AI. Check for mistakes.
Comment thread .ci/scripts/test_backend.sh Outdated
Comment on lines +102 to +106
zip -j "${GOLDEN_DIR}/${FLOW}_${model_name}_golden.zip" \
"${model_name}.pte" \
${model_name}_input*.bin \
${model_name}_expected_output*.bin \
2>/dev/null || true
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bash script does not properly quote the glob patterns when they could be empty. If ${model_name}_input*.bin or ${model_name}_expected_output*.bin don't match any files, the zip command will receive literal strings with asterisks. While the 2>/dev/null || true handles errors, this could silently create zips with missing artifacts. Consider explicitly checking for the existence of at least one input and output file before creating the zip, or using a more robust glob pattern that ensures matching files exist.

Copilot uses AI. Check for mistakes.
reference_output,
):
logger = logging.getLogger(__name__)
os.makedirs(artifact_dir, exist_ok=True)
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The artifact directory creation should be done earlier to catch errors during the actual test run rather than silently failing later. Currently, if os.makedirs fails, the exception is caught and logged as a warning, but the test continues. Since this is called after successful output comparison, there's a risk that test results could be marked as successful even though artifact generation failed. Consider whether artifact generation failures should be treated as test failures, or at minimum, ensure that the directory creation happens before the comparison so that filesystem issues are caught early.

Copilot uses AI. Check for mistakes.
- name: Download model test artifacts
uses: actions/download-artifact@v4
with:
pattern: test-report-*-models
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pattern 'test-report--models' only downloads artifacts from the 'models' suite, but not from the 'operators' suite. According to the test-backend-linux job matrix, both 'models' and 'operators' suites are run (line 47), and both could potentially generate golden artifacts. If golden artifacts are also expected from operator tests, this pattern should be 'test-report-' to include both suites, or the pattern should explicitly include operators as well.

Suggested change
pattern: test-report-*-models
pattern: test-report-*

Copilot uses AI. Check for mistakes.

- name: Upload golden artifacts to S3
uses: seemethere/upload-artifact-s3@v5
if: ${{ hashFiles('golden_artifacts_*.zip') != '' }}
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition checks for the existence of golden_artifacts_*.zip files to determine whether to upload to S3, but this check happens in the step itself (line 98). If for some reason the file doesn't exist at that point, the step will be skipped silently. However, the step name suggests it should "Upload golden artifacts to S3" unconditionally if the package-golden-artifacts job succeeded. Consider whether the conditional should be on the job level (line 63) rather than the step level, or if the conditional logic needs adjustment to match the intended behavior.

Suggested change
if: ${{ hashFiles('golden_artifacts_*.zip') != '' }}

Copilot uses AI. Check for mistakes.
Comment on lines +381 to +386
for i, inp in enumerate(inputs):
if isinstance(inp, torch.Tensor):
suffix = "" if len(inputs) == 1 else f"_{i}"
path = os.path.join(artifact_dir, f"{artifact_name}_input{suffix}.bin")
inp.contiguous().numpy().tofile(path)
logger.info(f"Saved golden input to {path}")
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The loop only saves inputs that are torch.Tensor instances, silently skipping any non-tensor inputs. This could lead to incomplete golden artifact sets if models accept mixed tensor and non-tensor inputs (e.g., integers, floats, booleans). While this might be intentional for simplicity, it should be documented or a warning should be logged when non-tensor inputs are skipped, so that users are aware that the golden artifacts may not fully represent the test case.

Copilot uses AI. Check for mistakes.
Remove per-model zip step from test_backend.sh — raw .pte/.bin files
stay in the per-flow directory. The workflow packaging step now collects
them preserving flow directory structure, producing a single zip that
consumers only need to unzip once.

EXIT_CODE=0
${CONDA_RUN_CMD} pytest -c /dev/nul -n auto backends/test/suite/$SUITE/ -m flow_$FLOW --json-report --json-report-file="$REPORT_FILE" || EXIT_CODE=$?
${CONDA_RUN_CMD} pytest -c /dev/null -n auto backends/test/suite/$SUITE/ -m flow_$FLOW --json-report --json-report-file="$REPORT_FILE" || EXIT_CODE=$?
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

haha

Copilot AI review requested due to automatic review settings February 25, 2026 00:17
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +86 to +87
cp "$flow_dir"/*.pte "$flow_dir"/*_input*.bin "$flow_dir"/*_expected_output*.bin \
"golden_combined/${flow_name}/" 2>/dev/null || true
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The shell command uses a glob pattern that could fail silently if there are no matching files. The copy command with 2>/dev/null || true suppresses all errors, which could hide legitimate issues like permission problems or disk space errors. Consider checking if the source files exist before attempting to copy, and only suppress the expected "file not found" error.

Suggested change
cp "$flow_dir"/*.pte "$flow_dir"/*_input*.bin "$flow_dir"/*_expected_output*.bin \
"golden_combined/${flow_name}/" 2>/dev/null || true
cp_sources=()
for pattern in "$flow_dir"/*.pte "$flow_dir"/*_input*.bin "$flow_dir"/*_expected_output*.bin; do
for f in $pattern; do
[ -e "$f" ] || continue
cp_sources+=("$f")
done
done
if [ "${#cp_sources[@]}" -gt 0 ]; then
cp "${cp_sources[@]}" "golden_combined/${flow_name}/"
fi

Copilot uses AI. Check for mistakes.
done

if find golden_combined -name '*.pte' | grep -q .; then
(cd golden_combined && zip -r "../golden_artifacts_${TIMESTAMP}.zip" .)
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description mentions "These artifacts are packaged into per-model zips and a combined golden_artifacts_yymmddhh.zip", but the implementation only creates a combined zip file (line 92). There are no per-model zips being created. Either update the PR description to match the implementation, or add the per-model zip creation step if it was intended.

Copilot uses AI. Check for mistakes.
@kirklandsign kirklandsign merged commit 7e1e6b4 into main Feb 25, 2026
325 of 348 checks passed
@kirklandsign kirklandsign deleted the android/nightly-golden-artifacts branch February 25, 2026 00:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants