Add golden artifact generation to nightly backend test suite#17663
Conversation
After successful model correctness verification (torch.allclose against eager), dump the input tensors, eager reference output, and serialized .pte as golden files. These artifacts are packaged into per-model zips and a combined golden_artifacts_yymmddhh.zip, then uploaded to S3 via the existing test-infra artifact pipeline. Controlled by the GOLDEN_ARTIFACTS_DIR environment variable — when unset, behavior is unchanged. The test_backend.sh script sets this automatically. Disclosure: PR authored with assistance from Claude.
Temporarily widen the trigger to include changes to test_backend.sh, backends/test/harness/, and backends/test/suite/ so the golden artifacts pipeline runs on this PR. Disclosure: PR authored with assistance from Claude.
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17663
Note: Links to docs will display an error until the docs builds have been completed. This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
The reusable workflow caller (test-backend-xnnpack.yml) does not grant id-token: write, so nested jobs cannot request it. Remove the permissions block and upload-artifact-to-s3 for now — golden artifacts will be available as regular GH artifacts. S3 upload can be re-enabled once the caller workflows grant the necessary permissions. Disclosure: PR authored with assistance from Claude.
Add permissions (id-token: write, contents: read) to the caller workflow test-backend-xnnpack.yml so the nested package-golden-artifacts job can use OIDC for S3 uploads. Restore permissions and upload-artifact-to-s3 on the nested job. Disclosure: PR authored with assistance from Claude.
Replace linux_job_v2 with a plain runs-on: ubuntu-22.04 job that uses actions/download-artifact and actions/upload-artifact directly. This avoids the GITHUB_TOKEN unavailability issue inside the linux_job_v2 docker container and removes the need for id-token:write permissions. Disclosure: PR authored with assistance from Claude.
Switch package-golden-artifacts runner to linux.2xlarge (self-hosted with S3 access) and add seemethere/upload-artifact-s3 step to persist golden artifacts to the gha-artifacts S3 bucket with 90-day retention.
There was a problem hiding this comment.
Pull request overview
Adds optional “golden artifact” generation to the backend test suite so nightly runs can capture inputs, eager reference outputs, and serialized .pte files for downstream validation/debugging and upload them via the existing artifact pipeline.
Changes:
- Plumb
artifact_dir/artifact_namethrough the test runner into the harness to dump input/output.binfiles after successful correctness checks. - Dump the serialized
.pteas a golden artifact after successful comparisons. - Add CI packaging/upload steps to combine golden artifacts into a timestamped zip and upload to GitHub Actions artifacts + S3.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
backends/test/suite/runner.py |
Adds artifact naming and .pte dumping after successful runs. |
backends/test/suite/conftest.py |
Resolves golden artifact output directory from GOLDEN_ARTIFACTS_DIR. |
backends/test/harness/tester.py |
Dumps golden input/output tensors as raw .bin files. |
.github/workflows/test-backend-xnnpack.yml |
Expands PR path triggers to include harness/suite/script changes. |
.github/workflows/_test_backend.yml |
Adds a job to package and upload combined golden artifacts. |
.ci/scripts/test_backend.sh |
Sets GOLDEN_ARTIFACTS_DIR and zips per-model golden artifacts. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| GOLDEN_DIR="${ARTIFACT_DIR}/golden-artifacts" | ||
| export GOLDEN_ARTIFACTS_DIR="${GOLDEN_DIR}" | ||
|
|
There was a problem hiding this comment.
GOLDEN_ARTIFACTS_DIR is exported unconditionally, so the operators suite will also generate golden inputs/outputs and .pte files even though the packaging job only collects *-models artifacts. This will increase artifact size and I/O for operators runs; consider only setting this env var (or only zipping) when SUITE=models (or when a separate opt-in flag is set).
| # Group files by model name prefix and zip each model's artifacts. | ||
| for pte in *.pte; do | ||
| [[ -f "$pte" ]] || continue | ||
| model_name="${pte%.pte}" | ||
| zip -j "${GOLDEN_DIR}/${model_name}_golden.zip" \ | ||
| "${model_name}.pte" \ | ||
| ${model_name}_input*.bin \ | ||
| ${model_name}_expected_output*.bin \ | ||
| 2>/dev/null || true |
There was a problem hiding this comment.
Per-model zips are written to ${GOLDEN_DIR}/${model_name}_golden.zip (outside the per-flow directory). In the workflow matrix, multiple flows can produce the same model_name, which will silently overwrite zips from earlier flows. Include $FLOW in the zip filename or keep the per-model zips under ${GOLDEN_DIR}/${FLOW}/ to avoid collisions.
| -exec cp {} golden_combined/ \; | ||
|
|
||
| if ls golden_combined/*.pte 1>/dev/null 2>&1; then | ||
| (cd golden_combined && zip -r "../golden_artifacts_${TIMESTAMP}.zip" .) | ||
| echo "Created golden_artifacts_${TIMESTAMP}.zip with $(ls golden_combined/*.pte | wc -l) models." |
There was a problem hiding this comment.
The packaging step flattens all .pte/.bin files from downloaded/ into a single golden_combined/ directory via cp. Since artifacts are produced per-flow, identical filenames across flows (same model/test name) will overwrite each other and the combined zip will silently drop files. Preserve directory structure (e.g. copy with --parents or zip from the original tree) or prefix filenames with flow/suite to keep them unique.
| -exec cp {} golden_combined/ \; | |
| if ls golden_combined/*.pte 1>/dev/null 2>&1; then | |
| (cd golden_combined && zip -r "../golden_artifacts_${TIMESTAMP}.zip" .) | |
| echo "Created golden_artifacts_${TIMESTAMP}.zip with $(ls golden_combined/*.pte | wc -l) models." | |
| -exec cp --parents {} golden_combined/ \; | |
| if find golden_combined -name '*.pte' -print -quit | grep -q .; then | |
| (cd golden_combined && zip -r "../golden_artifacts_${TIMESTAMP}.zip" .) | |
| echo "Created golden_artifacts_${TIMESTAMP}.zip with $(find golden_combined -name '*.pte' | wc -l) models." |
| self._dump_golden_artifacts( | ||
| artifact_dir, artifact_name, inputs_to_run, reference_output | ||
| ) |
There was a problem hiding this comment.
Golden artifact dumping can raise and fail the test run: the call to _dump_golden_artifacts(...) isn’t wrapped, so any filesystem/serialization issue (permissions, full disk, unsupported dtype -> .numpy(), etc.) will turn an otherwise-successful correctness check into a test failure. Since artifacts are optional, catch exceptions around this call and log a warning (similar to the .pte dump logic in runner.py).
| self._dump_golden_artifacts( | |
| artifact_dir, artifact_name, inputs_to_run, reference_output | |
| ) | |
| try: | |
| self._dump_golden_artifacts( | |
| artifact_dir, artifact_name, inputs_to_run, reference_output | |
| ) | |
| except Exception as e: | |
| logger = logging.getLogger(__name__) | |
| logger.warning( | |
| "Failed to dump golden artifacts for '%s': %s", | |
| artifact_name, | |
| e, | |
| ) |
| @@ -210,11 +218,25 @@ def build_result( | |||
| statistics_callback=lambda stats: error_statistics.append(stats), | |||
| atol=1e-1, | |||
| rtol=4e-2, | |||
| artifact_dir=artifact_dir, | |||
| artifact_name=artifact_name, | |||
| ) | |||
There was a problem hiding this comment.
artifact_name is derived only from test_base_name (the un-parameterized pytest node name). For parameterized tests (e.g. static vs dynamic shapes, dtype variants) this will produce identical filenames and will overwrite previously saved inputs/outputs/.pte within the same run. Consider incorporating test_name (includes params) and/or subtest_index into the artifact name (sanitized for filesystem) to avoid collisions and make artifacts traceable back to the exact test variant.
| export GOLDEN_ARTIFACTS_DIR="${GOLDEN_DIR}" | ||
|
|
||
| EXIT_CODE=0 | ||
| ${CONDA_RUN_CMD} pytest -c /dev/nul -n auto backends/test/suite/$SUITE/ -m flow_$FLOW --json-report --json-report-file="$REPORT_FILE" || EXIT_CODE=$? |
There was a problem hiding this comment.
pytest -c /dev/nul looks like a typo: /dev/nul typically doesn’t exist and will cause pytest to fail to start when trying to load the config file. This should likely be /dev/null (commonly used to ignore repo pytest.ini).
| ${CONDA_RUN_CMD} pytest -c /dev/nul -n auto backends/test/suite/$SUITE/ -m flow_$FLOW --json-report --json-report-file="$REPORT_FILE" || EXIT_CODE=$? | |
| ${CONDA_RUN_CMD} pytest -c /dev/null -n auto backends/test/suite/$SUITE/ -m flow_$FLOW --json-report --json-report-file="$REPORT_FILE" || EXIT_CODE=$? |
- Prefix per-model golden zips with flow name to avoid cross-flow filename collisions (e.g. xnnpack_mobilenet_v3_small_golden.zip) - Collect pre-packaged golden zips in workflow instead of flattening raw .pte/.bin files that would overwrite across flows - Wrap _dump_golden_artifacts in try/except so filesystem errors don't fail otherwise-passing correctness tests - Append subtest_index to artifact name for parameterized test variants - Fix /dev/nul typo to /dev/null in pytest config override
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated 7 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| for i, out in enumerate(reference_output): | ||
| if isinstance(out, torch.Tensor): | ||
| suffix = "" if len(reference_output) == 1 else f"_{i}" | ||
| path = os.path.join( | ||
| artifact_dir, f"{artifact_name}_expected_output{suffix}.bin" | ||
| ) | ||
| out.contiguous().numpy().tofile(path) | ||
| logger.info(f"Saved golden output to {path}") |
There was a problem hiding this comment.
Similar to the input handling, the loop only saves outputs that are torch.Tensor instances. If reference_output contains non-tensor elements after being converted to a tuple, those elements will be silently skipped. This could result in incomplete output files. Consider logging a warning when non-tensor outputs are encountered and skipped.
| if isinstance(reference_output, torch.Tensor): | ||
| reference_output = (reference_output,) | ||
| elif isinstance(reference_output, OrderedDict): | ||
| reference_output = tuple(reference_output.values()) |
There was a problem hiding this comment.
The function does not handle the case where reference_output is already a tuple. According to the existing _compare_outputs method (lines 474-477), the code handles torch.Tensor and OrderedDict, but if reference_output is already a tuple (which is a valid case), it will not be normalized. This could lead to issues if the tuple contains non-tensor elements or needs further processing. Consider adding a check for tuple type or ensuring all possible output types are handled consistently.
| reference_output = tuple(reference_output.values()) | |
| reference_output = tuple(reference_output.values()) | |
| elif isinstance(reference_output, (list, tuple)): | |
| reference_output = tuple(reference_output) |
| zip -j "${GOLDEN_DIR}/${FLOW}_${model_name}_golden.zip" \ | ||
| "${model_name}.pte" \ | ||
| ${model_name}_input*.bin \ | ||
| ${model_name}_expected_output*.bin \ | ||
| 2>/dev/null || true |
There was a problem hiding this comment.
The bash script does not properly quote the glob patterns when they could be empty. If ${model_name}_input*.bin or ${model_name}_expected_output*.bin don't match any files, the zip command will receive literal strings with asterisks. While the 2>/dev/null || true handles errors, this could silently create zips with missing artifacts. Consider explicitly checking for the existence of at least one input and output file before creating the zip, or using a more robust glob pattern that ensures matching files exist.
| reference_output, | ||
| ): | ||
| logger = logging.getLogger(__name__) | ||
| os.makedirs(artifact_dir, exist_ok=True) |
There was a problem hiding this comment.
The artifact directory creation should be done earlier to catch errors during the actual test run rather than silently failing later. Currently, if os.makedirs fails, the exception is caught and logged as a warning, but the test continues. Since this is called after successful output comparison, there's a risk that test results could be marked as successful even though artifact generation failed. Consider whether artifact generation failures should be treated as test failures, or at minimum, ensure that the directory creation happens before the comparison so that filesystem issues are caught early.
| - name: Download model test artifacts | ||
| uses: actions/download-artifact@v4 | ||
| with: | ||
| pattern: test-report-*-models |
There was a problem hiding this comment.
The pattern 'test-report--models' only downloads artifacts from the 'models' suite, but not from the 'operators' suite. According to the test-backend-linux job matrix, both 'models' and 'operators' suites are run (line 47), and both could potentially generate golden artifacts. If golden artifacts are also expected from operator tests, this pattern should be 'test-report-' to include both suites, or the pattern should explicitly include operators as well.
| pattern: test-report-*-models | |
| pattern: test-report-* |
|
|
||
| - name: Upload golden artifacts to S3 | ||
| uses: seemethere/upload-artifact-s3@v5 | ||
| if: ${{ hashFiles('golden_artifacts_*.zip') != '' }} |
There was a problem hiding this comment.
The condition checks for the existence of golden_artifacts_*.zip files to determine whether to upload to S3, but this check happens in the step itself (line 98). If for some reason the file doesn't exist at that point, the step will be skipped silently. However, the step name suggests it should "Upload golden artifacts to S3" unconditionally if the package-golden-artifacts job succeeded. Consider whether the conditional should be on the job level (line 63) rather than the step level, or if the conditional logic needs adjustment to match the intended behavior.
| if: ${{ hashFiles('golden_artifacts_*.zip') != '' }} |
| for i, inp in enumerate(inputs): | ||
| if isinstance(inp, torch.Tensor): | ||
| suffix = "" if len(inputs) == 1 else f"_{i}" | ||
| path = os.path.join(artifact_dir, f"{artifact_name}_input{suffix}.bin") | ||
| inp.contiguous().numpy().tofile(path) | ||
| logger.info(f"Saved golden input to {path}") |
There was a problem hiding this comment.
The loop only saves inputs that are torch.Tensor instances, silently skipping any non-tensor inputs. This could lead to incomplete golden artifact sets if models accept mixed tensor and non-tensor inputs (e.g., integers, floats, booleans). While this might be intentional for simplicity, it should be documented or a warning should be logged when non-tensor inputs are skipped, so that users are aware that the golden artifacts may not fully represent the test case.
Remove per-model zip step from test_backend.sh — raw .pte/.bin files stay in the per-flow directory. The workflow packaging step now collects them preserving flow directory structure, producing a single zip that consumers only need to unzip once.
|
|
||
| EXIT_CODE=0 | ||
| ${CONDA_RUN_CMD} pytest -c /dev/nul -n auto backends/test/suite/$SUITE/ -m flow_$FLOW --json-report --json-report-file="$REPORT_FILE" || EXIT_CODE=$? | ||
| ${CONDA_RUN_CMD} pytest -c /dev/null -n auto backends/test/suite/$SUITE/ -m flow_$FLOW --json-report --json-report-file="$REPORT_FILE" || EXIT_CODE=$? |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| cp "$flow_dir"/*.pte "$flow_dir"/*_input*.bin "$flow_dir"/*_expected_output*.bin \ | ||
| "golden_combined/${flow_name}/" 2>/dev/null || true |
There was a problem hiding this comment.
The shell command uses a glob pattern that could fail silently if there are no matching files. The copy command with 2>/dev/null || true suppresses all errors, which could hide legitimate issues like permission problems or disk space errors. Consider checking if the source files exist before attempting to copy, and only suppress the expected "file not found" error.
| cp "$flow_dir"/*.pte "$flow_dir"/*_input*.bin "$flow_dir"/*_expected_output*.bin \ | |
| "golden_combined/${flow_name}/" 2>/dev/null || true | |
| cp_sources=() | |
| for pattern in "$flow_dir"/*.pte "$flow_dir"/*_input*.bin "$flow_dir"/*_expected_output*.bin; do | |
| for f in $pattern; do | |
| [ -e "$f" ] || continue | |
| cp_sources+=("$f") | |
| done | |
| done | |
| if [ "${#cp_sources[@]}" -gt 0 ]; then | |
| cp "${cp_sources[@]}" "golden_combined/${flow_name}/" | |
| fi |
| done | ||
|
|
||
| if find golden_combined -name '*.pte' | grep -q .; then | ||
| (cd golden_combined && zip -r "../golden_artifacts_${TIMESTAMP}.zip" .) |
There was a problem hiding this comment.
The PR description mentions "These artifacts are packaged into per-model zips and a combined golden_artifacts_yymmddhh.zip", but the implementation only creates a combined zip file (line 92). There are no per-model zips being created. Either update the PR description to match the implementation, or add the per-model zip creation step if it was intended.
Summary
After successful model correctness verification (torch.allclose against eager), dump the input tensors, eager reference output, and serialized .pte as golden files. These artifacts are packaged into per-model zips and a combined golden_artifacts_yymmddhh.zip, then uploaded to S3 via the existing test-infra artifact pipeline.
Controlled by the GOLDEN_ARTIFACTS_DIR environment variable — when unset, behavior is unchanged. The test_backend.sh script sets this automatically.
Test plan
CI