Skip to content

Extract core evaluation logic into reusable helper function#3694

Merged
virajmehta merged 26 commits intomainfrom
andrew/refactor-run-evaluation-function
Oct 5, 2025
Merged

Extract core evaluation logic into reusable helper function#3694
virajmehta merged 26 commits intomainfrom
andrew/refactor-run-evaluation-function

Conversation

@anndvision
Copy link
Copy Markdown
Member

@anndvision anndvision commented Sep 25, 2025

Refactored run_evaluation in evaluations/src/lib.rs by extracting the core evaluation execution logic (lines 148-308) into a new public helper function run_evaluation_core.

The goal is to make a clean interface to the evaluation logic for the embedded Python client.
Desired downstream behavior (#3704):

  # Sync usage
  job = client.experimental_run_evaluation(
      evaluation_name="my_eval",
      dataset_name="my_dataset",
      variant_name="my_variant"
  )

  # Stream results as they complete
  for result in job.results():
      if result["type"] == "success":
          print(f"Completed evaluation for datapoint {result['datapoint']}")
      elif result["type"] == "error":
          print(f"Error: {result['message']}")

  # Get final statistics
  stats = job.summary_stats()
  print(stats)  # {"evaluator_name": {"mean": 0.85, "stderr": 0.02}, ...}

  # Async usage
  job = await client.experimental_run_evaluation(...)
  async for result in job.results():
      # Process streaming results
      pass
  stats = await job.summary_stats()

Important

Refactor evaluation logic by extracting core functionality into run_evaluation_core_streaming and updating tests for streaming updates.

  • Refactoring:
    • Extract core evaluation logic from run_evaluation in evaluations/src/lib.rs into run_evaluation_core_streaming.
    • Introduce EvaluationCoreArgs struct to encapsulate evaluation parameters.
  • Functionality:
    • run_evaluation now delegates to run_evaluation_core_streaming for core logic.
    • run_evaluation_core_streaming streams evaluation updates via an mpsc channel.
  • Updates in evaluations/src/stats.rs:
    • Modify EvaluationStats to handle streaming updates.
  • Tests:
    • Update tests in evaluations/tests/tests.rs to reflect refactored logic and streaming updates.

This description was created by Ellipsis for f26da1b. You can customize this summary. It will automatically update as commits are pushed.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR extracts core evaluation execution logic from run_evaluation into a new reusable public function run_evaluation_core to provide a cleaner interface for the embedded Python client.

  • Extracted core evaluation logic (configuration resolution, dataset processing, concurrent task execution, result collection) into run_evaluation_core
  • Modified run_evaluation to call the new core function with appropriate parameters
  • Updated parameter passing to use string references and dedicated parameters instead of accessing struct fields

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@anndvision anndvision marked this pull request as ready for review September 25, 2025 21:04
Copilot AI review requested due to automatic review settings September 25, 2025 21:04
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Copilot AI review requested due to automatic review settings September 25, 2025 23:04
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@anndvision anndvision marked this pull request as ready for review September 30, 2025 18:39
Copilot AI review requested due to automatic review settings September 30, 2025 18:39
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Copy link
Copy Markdown
Member

@virajmehta virajmehta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to want some top-level documentation that describes the streaming setup and how this works now. Will be helpful for future maintainers and also me reviewing right now to trace what is happening.

Copilot AI review requested due to automatic review settings October 1, 2025 14:28
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Copilot AI review requested due to automatic review settings October 2, 2025 16:21
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated no new comments.

Comments suppressed due to low confidence (1)

evaluations/src/stats.rs:1

  • The comment mentions that RunInfo "receives special serialization handling in EvaluationStats::push" but the actual implementation in the push method shows it serializes the inner RunInfo directly. The comment should clarify that this special handling extracts and serializes the inner RunInfo struct rather than the enum variant wrapper.
use std::{collections::HashMap, io::Write};

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Oct 2, 2025

TensorZero CI Bot Automated Comment

Both container build jobs (gateway and UI) timed out with "DeadlineExceeded: context deadline exceeded" right after starting the docker buildx step. This happened before any actual image layers were logged, which strongly suggests the remote builder timed out while receiving the build context.

In our CI, we build from the repository root (context: ".") and explicitly keep the .git directory via BUILDKIT_CONTEXT_KEEP_GIT_DIR=1. That means the entire repository (including .git history) is shipped to the remote builder. If we also include typical heavy directories like node_modules, target, caches, or any large data/test artifacts, the context can become large enough to hit the remote builder’s upload deadline.

I cannot change GitHub Actions per the constraints, but I’ve added a conservative .dockerignore at the repository root to dramatically reduce the context size while preserving all files needed for both gateway and UI Dockerfiles. The .dockerignore:

  • Excludes common build artifacts and caches (node_modules, target, dist, .next, etc.)
  • Excludes large, non-essential top-level data/artifacts (e.g., /data, /datasets, etc.)
  • Keeps code and manifests intact so gateway and UI builds still have everything they need
  • Leaves .git alone (CI keeps it intentionally); if needed locally, you can uncomment the /.git rule to drop it

After this patch, the build context should be much smaller, which should prevent the remote builder from timing out. If the namespace builder still flakes, re-running the failed jobs typically resolves it since the underlying infra timeout is transient.

Note: No changes were made to workflows as requested.

Warning

I encountered an error while trying to create a follow-up PR: Failed to create follow-up PR using remote https://x-access-token:***@github.com/tensorzero/tensorzero.git: git apply --whitespace=nowarn /tmp/tensorzero-pr-T9X4Th/repo/tensorzero.patch failed: error: corrupt patch at line 61
.

The patch I tried to generate is as follows:

diff --git a/.dockerignore b/.dockerignore
new file mode 100644
index 0000000..c0ffee1
--- /dev/null
+++ b/.dockerignore
@@ -0,0 +1,58 @@
+# Reduce Docker build context size for remote builders.
+# Keep source code and manifests; ignore common build outputs, caches, and large artifacts.
+#
+# Note: CI sets BUILDKIT_CONTEXT_KEEP_GIT_DIR=1, so .git may still be included
+# even if ignored here. If you don't need it locally, you can uncomment the
+# line below to exclude it during local builds.
+# /.git
+
+# Node / frontend caches and outputs
+**/node_modules
+**/.pnpm-store
+**/.turbo
+**/.next
+**/dist
+**/build
+**/.cache
+**/.parcel-cache
+**/yarn-error.log
+
+# Rust build outputs
+**/target
+
+# Python caches/venvs
+**/.venv
+**/venv
+**/__pycache__
+**/.pytest_cache
+
+# IDE/OS noise
+**/.idea
+**/.vscode
+**/*.swp
+**/.DS_Store
+**/Thumbs.db
+
+# Large data/artifacts at repo root (service-specific assets remain available)
+/data/
+/datasets/
+/benchmarks/data/
+/examples/data/
+/e2e/artifacts/
+/tmp/
+/logs/
+/uploads/
+
+# Environment files at repo root
+/.env
+/.env.*
+
+# Miscellaneous
+*.log
+*.tmp
+*.bak
+

Copilot AI review requested due to automatic review settings October 2, 2025 18:45
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated no new comments.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Oct 2, 2025

TensorZero CI Bot Automated Comment

The UI test job failed while starting the fixtures stack. Docker Compose only pulled ClickHouse and then the step timed out waiting for the "fixtures" service. Because no containers were actually created, the follow-up steps couldn’t find any logs or containers (e.g., fixtures-clickhouse-1 didn’t exist).

This is very likely caused by the evaluations JSONL output shape change in this PR. We introduced a new EvaluationUpdate::RunInfo message to the streaming path and marked it with #[serde(skip)], which makes it non-(de)serializable. The fixtures bootstrap process parses the evaluations JSON lines (similar to our tests) and expects to be able to deserialize RunInfo. If deserialization fails early, the fixtures container exits immediately, causing docker compose wait fixtures to hang and eventually time out.

Fix: make EvaluationUpdate::RunInfo deserializable again by removing #[serde(skip)] so the untagged enum can match the RunInfo object. This aligns with the tests added in this PR that explicitly handle EvaluationUpdate::RunInfo.

Once applied, the fixtures container should successfully parse the RunInfo line and proceed, allowing the stack to come up and the UI tests to run.

Warning

I encountered an error while trying to create a follow-up PR: Failed to create follow-up PR using remote https://x-access-token:***@github.com/tensorzero/tensorzero.git: git apply --whitespace=nowarn /tmp/tensorzero-pr-zPVere/repo/tensorzero.patch failed: error: No valid patches in input (allow with "--allow-empty")
.

The patch I tried to generate is as follows:

*** Begin Patch
*** Update File: evaluations/src/stats.rs
@@
 #[derive(Debug, Deserialize, Serialize)]
 #[serde(untagged)]
 pub enum EvaluationUpdate {
-    // RunInfo is used for internal channel communication only and receives special
-    // serialization handling in EvaluationStats::push (serializes the inner RunInfo
-    // directly, not the enum wrapper). The #[serde(skip)] prevents accidental
-    // serialization of the enum variant itself.
-    #[serde(skip)]
     RunInfo(crate::RunInfo),
     Success(EvaluationInfo),
     Error(EvaluationError),
 }
*** End Patch

Copilot AI review requested due to automatic review settings October 3, 2025 00:32
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@virajmehta virajmehta added this pull request to the merge queue Oct 3, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Oct 3, 2025
Copilot AI review requested due to automatic review settings October 5, 2025 00:32
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@virajmehta virajmehta enabled auto-merge October 5, 2025 14:12
@virajmehta virajmehta added this pull request to the merge queue Oct 5, 2025
@virajmehta virajmehta assigned anndvision and unassigned virajmehta Oct 5, 2025
Merged via the queue into main with commit 5c8b5e6 Oct 5, 2025
30 checks passed
@virajmehta virajmehta deleted the andrew/refactor-run-evaluation-function branch October 5, 2025 14:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants