Extract core evaluation logic into reusable helper function by anndvision · Pull Request #3694 · tensorzero/tensorzero

anndvision · 2025-09-25T18:56:04Z

Refactored run_evaluation in evaluations/src/lib.rs by extracting the core evaluation execution logic (lines 148-308) into a new public helper function run_evaluation_core.

The goal is to make a clean interface to the evaluation logic for the embedded Python client.
Desired downstream behavior (#3704):

  # Sync usage
  job = client.experimental_run_evaluation(
      evaluation_name="my_eval",
      dataset_name="my_dataset",
      variant_name="my_variant"
  )

  # Stream results as they complete
  for result in job.results():
      if result["type"] == "success":
          print(f"Completed evaluation for datapoint {result['datapoint']}")
      elif result["type"] == "error":
          print(f"Error: {result['message']}")

  # Get final statistics
  stats = job.summary_stats()
  print(stats)  # {"evaluator_name": {"mean": 0.85, "stderr": 0.02}, ...}

  # Async usage
  job = await client.experimental_run_evaluation(...)
  async for result in job.results():
      # Process streaming results
      pass
  stats = await job.summary_stats()

Important

Refactor evaluation logic by extracting core functionality into run_evaluation_core_streaming and updating tests for streaming updates.

Refactoring:
- Extract core evaluation logic from run_evaluation in evaluations/src/lib.rs into run_evaluation_core_streaming.
- Introduce EvaluationCoreArgs struct to encapsulate evaluation parameters.
Functionality:
- run_evaluation now delegates to run_evaluation_core_streaming for core logic.
- run_evaluation_core_streaming streams evaluation updates via an mpsc channel.
Updates in evaluations/src/stats.rs:
- Modify EvaluationStats to handle streaming updates.
Tests:
- Update tests in evaluations/tests/tests.rs to reflect refactored logic and streaming updates.

^{This description was created by}^{for f26da1b. You can customize this summary. It will automatically update as commits are pushed.}

Copilot

Pull Request Overview

This PR extracts core evaluation execution logic from run_evaluation into a new reusable public function run_evaluation_core to provide a cleaner interface for the embedded Python client.

Extracted core evaluation logic (configuration resolution, dataset processing, concurrent task execution, result collection) into run_evaluation_core
Modified run_evaluation to call the new core function with appropriate parameters
Updated parameter passing to use string references and dedicated parameters instead of accessing struct fields

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

evaluations/src/lib.rs

Copilot

Pull Request Overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

evaluations/src/lib.rs

Copilot

Pull Request Overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

evaluations/src/lib.rs

Copilot

Pull Request Overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

evaluations/src/stats.rs

evaluations/src/lib.rs

virajmehta

I'm going to want some top-level documentation that describes the streaming setup and how this works now. Will be helpful for future maintainers and also me reviewing right now to trace what is happening.

evaluations/src/lib.rs

…efactor-run-evaluation-function

Copilot

Pull Request Overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

evaluations/src/lib.rs

Copilot

Pull Request Overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated no new comments.

Comments suppressed due to low confidence (1)

evaluations/src/stats.rs:1

The comment mentions that RunInfo "receives special serialization handling in EvaluationStats::push" but the actual implementation in the push method shows it serializes the inner RunInfo directly. The comment should clarify that this special handling extracts and serializes the inner RunInfo struct rather than the enum variant wrapper.

use std::{collections::HashMap, io::Write};

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

github-actions · 2025-10-02T16:44:43Z

TensorZero CI Bot Automated Comment

Both container build jobs (gateway and UI) timed out with "DeadlineExceeded: context deadline exceeded" right after starting the docker buildx step. This happened before any actual image layers were logged, which strongly suggests the remote builder timed out while receiving the build context.

In our CI, we build from the repository root (context: ".") and explicitly keep the .git directory via BUILDKIT_CONTEXT_KEEP_GIT_DIR=1. That means the entire repository (including .git history) is shipped to the remote builder. If we also include typical heavy directories like node_modules, target, caches, or any large data/test artifacts, the context can become large enough to hit the remote builder’s upload deadline.

I cannot change GitHub Actions per the constraints, but I’ve added a conservative .dockerignore at the repository root to dramatically reduce the context size while preserving all files needed for both gateway and UI Dockerfiles. The .dockerignore:

Excludes common build artifacts and caches (node_modules, target, dist, .next, etc.)
Excludes large, non-essential top-level data/artifacts (e.g., /data, /datasets, etc.)
Keeps code and manifests intact so gateway and UI builds still have everything they need
Leaves .git alone (CI keeps it intentionally); if needed locally, you can uncomment the /.git rule to drop it

After this patch, the build context should be much smaller, which should prevent the remote builder from timing out. If the namespace builder still flakes, re-running the failed jobs typically resolves it since the underlying infra timeout is transient.

Note: No changes were made to workflows as requested.

Warning

I encountered an error while trying to create a follow-up PR: Failed to create follow-up PR using remote https://x-access-token:***@github.com/tensorzero/tensorzero.git: git apply --whitespace=nowarn /tmp/tensorzero-pr-T9X4Th/repo/tensorzero.patch failed: error: corrupt patch at line 61
.

The patch I tried to generate is as follows:

diff --git a/.dockerignore b/.dockerignore
new file mode 100644
index 0000000..c0ffee1
--- /dev/null
+++ b/.dockerignore
@@ -0,0 +1,58 @@
+# Reduce Docker build context size for remote builders.
+# Keep source code and manifests; ignore common build outputs, caches, and large artifacts.
+#
+# Note: CI sets BUILDKIT_CONTEXT_KEEP_GIT_DIR&#x3D;1, so .git may still be included
+# even if ignored here. If you don&#x27;t need it locally, you can uncomment the
+# line below to exclude it during local builds.
+# /.git
+
+# Node / frontend caches and outputs
+**/node_modules
+**/.pnpm-store
+**/.turbo
+**/.next
+**/dist
+**/build
+**/.cache
+**/.parcel-cache
+**/yarn-error.log
+
+# Rust build outputs
+**/target
+
+# Python caches/venvs
+**/.venv
+**/venv
+**/__pycache__
+**/.pytest_cache
+
+# IDE/OS noise
+**/.idea
+**/.vscode
+**/*.swp
+**/.DS_Store
+**/Thumbs.db
+
+# Large data/artifacts at repo root (service-specific assets remain available)
+/data/
+/datasets/
+/benchmarks/data/
+/examples/data/
+/e2e/artifacts/
+/tmp/
+/logs/
+/uploads/
+
+# Environment files at repo root
+/.env
+/.env.*
+
+# Miscellaneous
+*.log
+*.tmp
+*.bak
+

Copilot

Pull Request Overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated no new comments.

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

github-actions · 2025-10-02T19:09:53Z

TensorZero CI Bot Automated Comment

The UI test job failed while starting the fixtures stack. Docker Compose only pulled ClickHouse and then the step timed out waiting for the "fixtures" service. Because no containers were actually created, the follow-up steps couldn’t find any logs or containers (e.g., fixtures-clickhouse-1 didn’t exist).

This is very likely caused by the evaluations JSONL output shape change in this PR. We introduced a new EvaluationUpdate::RunInfo message to the streaming path and marked it with #[serde(skip)], which makes it non-(de)serializable. The fixtures bootstrap process parses the evaluations JSON lines (similar to our tests) and expects to be able to deserialize RunInfo. If deserialization fails early, the fixtures container exits immediately, causing docker compose wait fixtures to hang and eventually time out.

Fix: make EvaluationUpdate::RunInfo deserializable again by removing #[serde(skip)] so the untagged enum can match the RunInfo object. This aligns with the tests added in this PR that explicitly handle EvaluationUpdate::RunInfo.

Once applied, the fixtures container should successfully parse the RunInfo line and proceed, allowing the stack to come up and the UI tests to run.

Warning

I encountered an error while trying to create a follow-up PR: Failed to create follow-up PR using remote https://x-access-token:***@github.com/tensorzero/tensorzero.git: git apply --whitespace=nowarn /tmp/tensorzero-pr-zPVere/repo/tensorzero.patch failed: error: No valid patches in input (allow with "--allow-empty")
.

The patch I tried to generate is as follows:

*** Begin Patch
*** Update File: evaluations/src/stats.rs
@@
 #[derive(Debug, Deserialize, Serialize)]
 #[serde(untagged)]
 pub enum EvaluationUpdate {
-    // RunInfo is used for internal channel communication only and receives special
-    // serialization handling in EvaluationStats::push (serializes the inner RunInfo
-    // directly, not the enum wrapper). The #[serde(skip)] prevents accidental
-    // serialization of the enum variant itself.
-    #[serde(skip)]
     RunInfo(crate::RunInfo),
     Success(EvaluationInfo),
     Error(EvaluationError),
 }
*** End Patch

Copilot

Pull Request Overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

evaluations/src/stats.rs

evaluations/src/lib.rs

Copilot

Pull Request Overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

evaluations/src/lib.rs

pull out logic to run_evaluation_core

f5e9c36

anndvision requested review from Aaron1011, Copilot and virajmehta September 25, 2025 18:56

Copilot AI reviewed Sep 25, 2025

View reviewed changes

evaluations/src/lib.rs Outdated Show resolved Hide resolved

virajmehta reviewed Sep 25, 2025

View reviewed changes

evaluations/src/lib.rs Outdated Show resolved Hide resolved

create struct for input

83de82b

anndvision marked this pull request as ready for review September 25, 2025 21:04

Copilot AI review requested due to automatic review settings September 25, 2025 21:04

Copilot AI reviewed Sep 25, 2025

View reviewed changes

evaluations/src/lib.rs Outdated Show resolved Hide resolved

remove unecessary clone

c79eff0

anndvision requested a review from virajmehta September 25, 2025 21:19

GabrielBianconi assigned virajmehta Sep 25, 2025

Merge branch 'main' into andrew/refactor-run-evaluation-function

0516502

Copilot AI review requested due to automatic review settings September 25, 2025 23:04

Copilot AI reviewed Sep 25, 2025

View reviewed changes

evaluations/src/lib.rs Show resolved Hide resolved

anndvision mentioned this pull request Sep 26, 2025

Add experimental_run_evaluation functions to python client #3704

Merged

anndvision marked this pull request as draft September 26, 2025 14:35

anndvision added 2 commits September 30, 2025 14:09

make streaming

ccb317d

address merge conflicts

fb2b8a3

anndvision marked this pull request as ready for review September 30, 2025 18:39

Copilot AI review requested due to automatic review settings September 30, 2025 18:39

Copilot AI reviewed Sep 30, 2025

View reviewed changes

evaluations/src/stats.rs Show resolved Hide resolved

evaluations/src/lib.rs Outdated Show resolved Hide resolved

evaluations/src/lib.rs Show resolved Hide resolved

virajmehta reviewed Oct 1, 2025

View reviewed changes

evaluations/src/lib.rs Show resolved Hide resolved

virajmehta requested changes Oct 1, 2025

View reviewed changes

evaluations/src/lib.rs Show resolved Hide resolved

evaluations/src/lib.rs Show resolved Hide resolved

anndvision added 3 commits September 30, 2025 21:36

Merge branch 'main' of github.com:tensorzero/tensorzero into andrew/r…

446fc2d

…efactor-run-evaluation-function

bring error messages back

e0a73a8

add documentation

4bcb0ab

Copilot AI review requested due to automatic review settings October 1, 2025 14:28

Copilot AI reviewed Oct 1, 2025

View reviewed changes

evaluations/src/lib.rs Outdated Show resolved Hide resolved

anndvision added 2 commits October 2, 2025 11:22

remove redundant wait the other way

16285a7

temp live-tests on push

1496fb6

Copilot AI review requested due to automatic review settings October 2, 2025 16:21

Copilot AI reviewed Oct 2, 2025

View reviewed changes

anndvision added 2 commits October 2, 2025 13:52

test old

09c1b93

test new with drop

3764fd9

Copilot AI review requested due to automatic review settings October 2, 2025 18:45

Copilot AI reviewed Oct 2, 2025

View reviewed changes

anndvision and others added 3 commits October 2, 2025 15:17

remove old function

4f05742

remove push

f26da1b

Merge branch 'main' into andrew/refactor-run-evaluation-function

e72cbc5

Copilot AI review requested due to automatic review settings October 3, 2025 00:32

Copilot AI reviewed Oct 3, 2025

View reviewed changes

evaluations/src/stats.rs Show resolved Hide resolved

evaluations/src/lib.rs Show resolved Hide resolved

virajmehta approved these changes Oct 3, 2025

View reviewed changes

virajmehta added this pull request to the merge queue Oct 3, 2025

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Oct 3, 2025

anndvision and others added 2 commits October 3, 2025 12:12

Merge branch 'main' into andrew/refactor-run-evaluation-function

f81278a

Merge branch 'main' into andrew/refactor-run-evaluation-function

261d52b

Copilot AI review requested due to automatic review settings October 5, 2025 00:32

Copilot AI reviewed Oct 5, 2025

View reviewed changes

evaluations/src/lib.rs Show resolved Hide resolved

virajmehta enabled auto-merge October 5, 2025 14:12

virajmehta added this pull request to the merge queue Oct 5, 2025

virajmehta approved these changes Oct 5, 2025

View reviewed changes

virajmehta assigned anndvision and unassigned virajmehta Oct 5, 2025

Merged via the queue into main with commit 5c8b5e6 Oct 5, 2025
30 checks passed

virajmehta deleted the andrew/refactor-run-evaluation-function branch October 5, 2025 14:39

Conversation

anndvision commented Sep 25, 2025 • edited by ellipsis-dev bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

virajmehta left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

github-actions bot commented Oct 2, 2025

TensorZero CI Bot Automated Comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

github-actions bot commented Oct 2, 2025

TensorZero CI Bot Automated Comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

anndvision commented Sep 25, 2025 •

edited by ellipsis-dev bot

Loading