Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion .ci/scripts/test_backend.sh
Original file line number Diff line number Diff line change
Expand Up @@ -85,7 +85,10 @@ else
fi
CMAKE_ARGS="$EXTRA_BUILD_ARGS" ${CONDA_RUN_CMD} $SETUP_SCRIPT --build-tool cmake --build-mode Release --editable true

GOLDEN_DIR="${ARTIFACT_DIR}/golden-artifacts"
export GOLDEN_ARTIFACTS_DIR="${GOLDEN_DIR}"

Comment on lines +88 to +90
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GOLDEN_ARTIFACTS_DIR is exported unconditionally, so the operators suite will also generate golden inputs/outputs and .pte files even though the packaging job only collects *-models artifacts. This will increase artifact size and I/O for operators runs; consider only setting this env var (or only zipping) when SUITE=models (or when a separate opt-in flag is set).

Copilot uses AI. Check for mistakes.
EXIT_CODE=0
${CONDA_RUN_CMD} pytest -c /dev/nul -n auto backends/test/suite/$SUITE/ -m flow_$FLOW --json-report --json-report-file="$REPORT_FILE" || EXIT_CODE=$?
${CONDA_RUN_CMD} pytest -c /dev/null -n auto backends/test/suite/$SUITE/ -m flow_$FLOW --json-report --json-report-file="$REPORT_FILE" || EXIT_CODE=$?
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

haha

# Generate markdown summary.
${CONDA_RUN_CMD} python -m executorch.backends.test.suite.generate_markdown_summary_json "$REPORT_FILE" > ${GITHUB_STEP_SUMMARY:-"step_summary.md"} --exit-code $EXIT_CODE
55 changes: 55 additions & 0 deletions .github/workflows/_test_backend.yml
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,61 @@ jobs:

source .ci/scripts/test_backend.sh "${{ matrix.suite }}" "${{ matrix.flow }}" "${RUNNER_ARTIFACT_DIR}"

package-golden-artifacts:
if: ${{ inputs.run-linux }}
needs: test-backend-linux
runs-on: linux.2xlarge
steps:
- name: Download model test artifacts
uses: actions/download-artifact@v4
with:
pattern: test-report-*-models
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pattern 'test-report--models' only downloads artifacts from the 'models' suite, but not from the 'operators' suite. According to the test-backend-linux job matrix, both 'models' and 'operators' suites are run (line 47), and both could potentially generate golden artifacts. If golden artifacts are also expected from operator tests, this pattern should be 'test-report-' to include both suites, or the pattern should explicitly include operators as well.

Suggested change
pattern: test-report-*-models
pattern: test-report-*

Copilot uses AI. Check for mistakes.
path: downloaded/

- name: Package golden artifacts
run: |
set -eux
TIMESTAMP=$(date -u +%y%m%d%H)
mkdir -p golden_combined

# Collect golden artifacts preserving flow directory structure.
# Raw files live under downloaded/*/golden-artifacts/{flow}/.
for flow_dir in downloaded/*/golden-artifacts/*/; do
[ -d "$flow_dir" ] || continue
flow_name=$(basename "$flow_dir")
if ls "$flow_dir"/*.pte 1>/dev/null 2>&1; then
mkdir -p "golden_combined/${flow_name}"
cp "$flow_dir"/*.pte "$flow_dir"/*_input*.bin "$flow_dir"/*_expected_output*.bin \
"golden_combined/${flow_name}/" 2>/dev/null || true
Comment on lines +86 to +87
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The shell command uses a glob pattern that could fail silently if there are no matching files. The copy command with 2>/dev/null || true suppresses all errors, which could hide legitimate issues like permission problems or disk space errors. Consider checking if the source files exist before attempting to copy, and only suppress the expected "file not found" error.

Suggested change
cp "$flow_dir"/*.pte "$flow_dir"/*_input*.bin "$flow_dir"/*_expected_output*.bin \
"golden_combined/${flow_name}/" 2>/dev/null || true
cp_sources=()
for pattern in "$flow_dir"/*.pte "$flow_dir"/*_input*.bin "$flow_dir"/*_expected_output*.bin; do
for f in $pattern; do
[ -e "$f" ] || continue
cp_sources+=("$f")
done
done
if [ "${#cp_sources[@]}" -gt 0 ]; then
cp "${cp_sources[@]}" "golden_combined/${flow_name}/"
fi

Copilot uses AI. Check for mistakes.
fi
done

if find golden_combined -name '*.pte' | grep -q .; then
(cd golden_combined && zip -r "../golden_artifacts_${TIMESTAMP}.zip" .)
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description mentions "These artifacts are packaged into per-model zips and a combined golden_artifacts_yymmddhh.zip", but the implementation only creates a combined zip file (line 92). There are no per-model zips being created. Either update the PR description to match the implementation, or add the per-model zip creation step if it was intended.

Copilot uses AI. Check for mistakes.
echo "Created golden_artifacts_${TIMESTAMP}.zip"
find golden_combined -type f | head -20
else
echo "No golden artifacts found."
fi

- name: Upload combined golden artifacts
uses: actions/upload-artifact@v4
with:
name: golden-artifacts-${{ inputs.backend }}
path: golden_artifacts_*.zip
if-no-files-found: ignore

- name: Upload golden artifacts to S3
uses: seemethere/upload-artifact-s3@v5
if: ${{ hashFiles('golden_artifacts_*.zip') != '' }}
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition checks for the existence of golden_artifacts_*.zip files to determine whether to upload to S3, but this check happens in the step itself (line 98). If for some reason the file doesn't exist at that point, the step will be skipped silently. However, the step name suggests it should "Upload golden artifacts to S3" unconditionally if the package-golden-artifacts job succeeded. Consider whether the conditional should be on the job level (line 63) rather than the step level, or if the conditional logic needs adjustment to match the intended behavior.

Suggested change
if: ${{ hashFiles('golden_artifacts_*.zip') != '' }}

Copilot uses AI. Check for mistakes.
with:
s3-bucket: gha-artifacts
s3-prefix: |
${{ github.repository }}/test-backend-artifacts/golden-artifacts-${{ inputs.backend }}
retention-days: 90
if-no-files-found: ignore
path: golden_artifacts_*.zip

test-backend-macos:
if: ${{ inputs.run-macos }}
strategy:
Expand Down
3 changes: 3 additions & 0 deletions .github/workflows/test-backend-xnnpack.yml
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,9 @@ on:
paths:
- .github/workflows/test-backend-xnnpack.yml
- .github/workflows/_test_backend.yml
- .ci/scripts/test_backend.sh
- backends/test/harness/**
- backends/test/suite/**
workflow_dispatch:

concurrency:
Expand Down
51 changes: 51 additions & 0 deletions backends/test/harness/tester.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,8 @@
# This source code is licensed under the BSD-style license found in the
# LICENSE file in the root directory of this source tree.

import logging
import os
import random
from collections import Counter, OrderedDict
from typing import Any, Callable, Dict, List, Optional, Tuple
Expand Down Expand Up @@ -317,11 +319,14 @@ def run_method_and_compare_outputs(
rtol=1e-03,
qtol=0,
statistics_callback: Callable[[ErrorStatistics], None] | None = None,
artifact_dir: Optional[str] = None,
artifact_name: Optional[str] = None,
):
number_of_runs = 1 if inputs is not None else num_runs
reference_stage = self.stages[StageType.EXPORT]

stage = stage or self.cur
artifacts_saved = False

for _ in range(number_of_runs):
inputs_to_run = inputs if inputs else next(self.generate_random_inputs())
Expand All @@ -346,8 +351,54 @@ def run_method_and_compare_outputs(
statistics_callback,
)

if artifact_dir and artifact_name and not artifacts_saved:
try:
self._dump_golden_artifacts(
artifact_dir,
artifact_name,
inputs_to_run,
reference_output,
)
except Exception:
logging.getLogger(__name__).warning(
f"Failed to dump golden artifacts for {artifact_name}",
exc_info=True,
)
artifacts_saved = True

return self

@staticmethod
def _dump_golden_artifacts(
artifact_dir: str,
artifact_name: str,
inputs: Tuple[torch.Tensor],
reference_output,
):
logger = logging.getLogger(__name__)
os.makedirs(artifact_dir, exist_ok=True)
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The artifact directory creation should be done earlier to catch errors during the actual test run rather than silently failing later. Currently, if os.makedirs fails, the exception is caught and logged as a warning, but the test continues. Since this is called after successful output comparison, there's a risk that test results could be marked as successful even though artifact generation failed. Consider whether artifact generation failures should be treated as test failures, or at minimum, ensure that the directory creation happens before the comparison so that filesystem issues are caught early.

Copilot uses AI. Check for mistakes.

for i, inp in enumerate(inputs):
if isinstance(inp, torch.Tensor):
suffix = "" if len(inputs) == 1 else f"_{i}"
path = os.path.join(artifact_dir, f"{artifact_name}_input{suffix}.bin")
inp.contiguous().numpy().tofile(path)
logger.info(f"Saved golden input to {path}")
Comment on lines +381 to +386
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The loop only saves inputs that are torch.Tensor instances, silently skipping any non-tensor inputs. This could lead to incomplete golden artifact sets if models accept mixed tensor and non-tensor inputs (e.g., integers, floats, booleans). While this might be intentional for simplicity, it should be documented or a warning should be logged when non-tensor inputs are skipped, so that users are aware that the golden artifacts may not fully represent the test case.

Copilot uses AI. Check for mistakes.

if isinstance(reference_output, torch.Tensor):
reference_output = (reference_output,)
elif isinstance(reference_output, OrderedDict):
reference_output = tuple(reference_output.values())
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function does not handle the case where reference_output is already a tuple. According to the existing _compare_outputs method (lines 474-477), the code handles torch.Tensor and OrderedDict, but if reference_output is already a tuple (which is a valid case), it will not be normalized. This could lead to issues if the tuple contains non-tensor elements or needs further processing. Consider adding a check for tuple type or ensuring all possible output types are handled consistently.

Suggested change
reference_output = tuple(reference_output.values())
reference_output = tuple(reference_output.values())
elif isinstance(reference_output, (list, tuple)):
reference_output = tuple(reference_output)

Copilot uses AI. Check for mistakes.

for i, out in enumerate(reference_output):
if isinstance(out, torch.Tensor):
suffix = "" if len(reference_output) == 1 else f"_{i}"
path = os.path.join(
artifact_dir, f"{artifact_name}_expected_output{suffix}.bin"
)
out.contiguous().numpy().tofile(path)
logger.info(f"Saved golden output to {path}")
Comment on lines +393 to +400
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the input handling, the loop only saves outputs that are torch.Tensor instances. If reference_output contains non-tensor elements after being converted to a tuple, those elements will be silently skipped. This could result in incomplete output files. Consider logging a warning when non-tensor outputs are encountered and skipped.

Copilot uses AI. Check for mistakes.

@staticmethod
def _assert_outputs_equal(
model_output,
Expand Down
9 changes: 9 additions & 0 deletions backends/test/suite/conftest.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
import os
from typing import Any

import pytest
Expand Down Expand Up @@ -32,6 +33,13 @@ def __init__(self, flow, test_name, test_base_name):
self._test_base_name = test_base_name
self._subtest = 0
self._results = []
self._artifact_dir = self._resolve_artifact_dir()

def _resolve_artifact_dir(self) -> str | None:
base = os.environ.get("GOLDEN_ARTIFACTS_DIR")
if not base:
return None
return os.path.join(base, self._flow.name)

def lower_and_run_model(
self,
Expand All @@ -50,6 +58,7 @@ def lower_and_run_model(
None,
generate_random_test_inputs=generate_random_test_inputs,
dynamic_shapes=dynamic_shapes,
artifact_dir=self._artifact_dir,
)

self._subtest += 1
Expand Down
22 changes: 22 additions & 0 deletions backends/test/suite/runner.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
import argparse
import hashlib
import importlib
import logging
import os
import random
import re
import time
Expand Down Expand Up @@ -92,6 +94,7 @@ def run_test( # noqa: C901
params: dict | None,
dynamic_shapes: Any | None = None,
generate_random_test_inputs: bool = True,
artifact_dir: str | None = None,
) -> TestCaseSummary:
"""
Top-level test run function for a model, input set, and tester. Handles test execution
Expand Down Expand Up @@ -201,6 +204,11 @@ def build_result(
# We can do this if we ever see to_executorch() or serialize() fail due a backend issue.
return build_result(TestResult.UNKNOWN_FAIL, e)

artifact_name = None
if artifact_dir:
base = test_base_name.removeprefix("test_")
artifact_name = f"{base}_{subtest_index}" if subtest_index > 0 else base

# TODO We should consider refactoring the tester slightly to return more signal on
# the cause of a failure in run_method_and_compare_outputs. We can look for
# AssertionErrors to catch output mismatches, but this might catch more than that.
Expand All @@ -210,11 +218,25 @@ def build_result(
statistics_callback=lambda stats: error_statistics.append(stats),
atol=1e-1,
rtol=4e-2,
artifact_dir=artifact_dir,
artifact_name=artifact_name,
)
except AssertionError as e:
return build_result(TestResult.OUTPUT_MISMATCH_FAIL, e)
except Exception as e:
return build_result(TestResult.PTE_RUN_FAIL, e)

# Dump .pte after successful comparison.
if artifact_dir and artifact_name and flow.supports_serialize:
logger = logging.getLogger(__name__)
try:
pte_path = os.path.join(artifact_dir, f"{artifact_name}.pte")
tester.stages[StageType.SERIALIZE].dump_artifact(pte_path)
logger.info(f"Saved golden .pte to {pte_path}")
except Exception:
logger.warning(
f"Failed to save .pte for {artifact_name}", exc_info=True
)
else:
# Skip the test if nothing is delegated
return build_result(TestResult.SUCCESS_UNDELEGATED)
Expand Down
Loading