Add a separate benchmark comparison pipeline for Python and .NET#15
Add a separate benchmark comparison pipeline for Python and .NET#15sharpninja merged 2 commits intomainfrom
Conversation
Co-authored-by: sharpninja <16146732+sharpninja@users.noreply.github.com> Agent-Logs-Url: https://github.com/sharpninja/graphrag/sessions/041ee5aa-dc22-44d5-b2e8-5608209d11ac
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
There was a problem hiding this comment.
Pull request overview
Adds a standalone CI workflow and shared Python harness to benchmark Python vs .NET implementations using the existing smoke-test fixtures, producing a single comparison report artifact.
Changes:
- Added
Benchmark ComparisonGitHub Actions workflow to run Python + .NET benchmarks and publish JSON + markdown report artifacts. - Added
scripts/benchmark_smoke.pyto discover fixtures, run indexing + queries, and render a consolidated comparison report. - Added unit tests covering fixture loading, command generation, and report rendering.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
.github/workflows/benchmark-comparison.yml |
New workflow that builds/runs both implementations and uploads benchmark artifacts. |
scripts/benchmark_smoke.py |
New benchmark harness for fixture discovery, execution, JSON output, and markdown comparison rendering. |
tests/unit/test_benchmark_smoke.py |
Unit tests for key benchmark harness behaviors. |
.semversioner/next-release/patch-20260320154154809493.json |
Semversioner entry documenting the patch-level change. |
| ) | ||
| expected_artifacts = tuple( | ||
| artifact | ||
| for workflow in workflow_config.values() |
There was a problem hiding this comment.
load_fixture_cases() assumes every workflow_config value is a dict and calls workflow.get(...), but tests/fixtures/azure/config.json contains non-dict entries (e.g., "skip_assert": true). This will raise AttributeError when the benchmark workflow loads all fixtures. Filter to dict values (or explicitly skip known non-workflow keys like azure/skip_assert) before reading expected_artifacts.
| for workflow in workflow_config.values() | |
| for workflow in workflow_config.values() | |
| if isinstance(workflow, dict) |
| def test_load_fixture_cases_uses_smoke_configs(): | ||
| module = load_module() | ||
| repo_root = Path(__file__).resolve().parents[2] | ||
|
|
||
| fixtures = module.load_fixture_cases(repo_root, ["text", "min-csv"]) | ||
|
|
||
| assert [fixture.name for fixture in fixtures] == ["min-csv", "text"] | ||
| assert fixtures[0].queries[0].method == "local" | ||
| assert fixtures[1].queries[-1].method == "basic" | ||
| assert "community_reports.csv" in fixtures[1].expected_artifacts |
There was a problem hiding this comment.
Unit tests only exercise text/min-csv fixtures, so the non-dict workflow_config values in the azure fixture (e.g., skip_assert) aren’t covered and the current loader bug would slip through. Add a test case that loads the azure fixture (or a minimal config with skip_assert) to ensure fixture parsing works for all fixture shapes.
Description
Adds a dedicated benchmark workflow that runs both implementations against the shared smoke-test fixtures and publishes a single comparison report. The pipeline measures indexing plus all configured query paths so reviewers can see a like-for-like Python vs .NET result set in one place.
Related Issues
Addresses the benchmarking request for:
Proposed Changes
Benchmark workflow
.github/workflows/benchmark-comparison.ymlShared benchmark harness
scripts/benchmark_smoke.pytests/fixtures/*/config.jsonComparison report
benchmark-comparison.mdwith:stats.jsonwhen availableFixture-driven scope
Focused test coverage
tests/unit/test_benchmark_smoke.pyExample of the generated benchmark flow:
Checklist
Additional Notes
The report is intentionally status-aware, not timing-only. For the current .NET CLI, query timings are recorded, while indexing is reported with missing expected outputs so the comparison remains honest and actionable instead of implying feature parity where it does not yet exist.
✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.