Extract core evaluation logic into reusable helper function#3694
Extract core evaluation logic into reusable helper function#3694virajmehta merged 26 commits intomainfrom
Conversation
There was a problem hiding this comment.
Pull Request Overview
This PR extracts core evaluation execution logic from run_evaluation into a new reusable public function run_evaluation_core to provide a cleaner interface for the embedded Python client.
- Extracted core evaluation logic (configuration resolution, dataset processing, concurrent task execution, result collection) into
run_evaluation_core - Modified
run_evaluationto call the new core function with appropriate parameters - Updated parameter passing to use string references and dedicated parameters instead of accessing struct fields
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
There was a problem hiding this comment.
Pull Request Overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
There was a problem hiding this comment.
Pull Request Overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
There was a problem hiding this comment.
Pull Request Overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
virajmehta
left a comment
There was a problem hiding this comment.
I'm going to want some top-level documentation that describes the streaming setup and how this works now. Will be helpful for future maintainers and also me reviewing right now to trace what is happening.
…efactor-run-evaluation-function
There was a problem hiding this comment.
Pull Request Overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
There was a problem hiding this comment.
Pull Request Overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated no new comments.
Comments suppressed due to low confidence (1)
evaluations/src/stats.rs:1
- The comment mentions that RunInfo "receives special serialization handling in EvaluationStats::push" but the actual implementation in the
pushmethod shows it serializes the inner RunInfo directly. The comment should clarify that this special handling extracts and serializes the inner RunInfo struct rather than the enum variant wrapper.
use std::{collections::HashMap, io::Write};
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
TensorZero CI Bot Automated CommentBoth container build jobs (gateway and UI) timed out with "DeadlineExceeded: context deadline exceeded" right after starting the docker buildx step. This happened before any actual image layers were logged, which strongly suggests the remote builder timed out while receiving the build context. In our CI, we build from the repository root (context: ".") and explicitly keep the .git directory via BUILDKIT_CONTEXT_KEEP_GIT_DIR=1. That means the entire repository (including .git history) is shipped to the remote builder. If we also include typical heavy directories like node_modules, target, caches, or any large data/test artifacts, the context can become large enough to hit the remote builder’s upload deadline. I cannot change GitHub Actions per the constraints, but I’ve added a conservative .dockerignore at the repository root to dramatically reduce the context size while preserving all files needed for both gateway and UI Dockerfiles. The .dockerignore:
After this patch, the build context should be much smaller, which should prevent the remote builder from timing out. If the namespace builder still flakes, re-running the failed jobs typically resolves it since the underlying infra timeout is transient. Note: No changes were made to workflows as requested. Warning I encountered an error while trying to create a follow-up PR: Failed to create follow-up PR using remote https://x-access-token:***@github.com/tensorzero/tensorzero.git: git apply --whitespace=nowarn /tmp/tensorzero-pr-T9X4Th/repo/tensorzero.patch failed: error: corrupt patch at line 61 The patch I tried to generate is as follows: diff --git a/.dockerignore b/.dockerignore
new file mode 100644
index 0000000..c0ffee1
--- /dev/null
+++ b/.dockerignore
@@ -0,0 +1,58 @@
+# Reduce Docker build context size for remote builders.
+# Keep source code and manifests; ignore common build outputs, caches, and large artifacts.
+#
+# Note: CI sets BUILDKIT_CONTEXT_KEEP_GIT_DIR=1, so .git may still be included
+# even if ignored here. If you don't need it locally, you can uncomment the
+# line below to exclude it during local builds.
+# /.git
+
+# Node / frontend caches and outputs
+**/node_modules
+**/.pnpm-store
+**/.turbo
+**/.next
+**/dist
+**/build
+**/.cache
+**/.parcel-cache
+**/yarn-error.log
+
+# Rust build outputs
+**/target
+
+# Python caches/venvs
+**/.venv
+**/venv
+**/__pycache__
+**/.pytest_cache
+
+# IDE/OS noise
+**/.idea
+**/.vscode
+**/*.swp
+**/.DS_Store
+**/Thumbs.db
+
+# Large data/artifacts at repo root (service-specific assets remain available)
+/data/
+/datasets/
+/benchmarks/data/
+/examples/data/
+/e2e/artifacts/
+/tmp/
+/logs/
+/uploads/
+
+# Environment files at repo root
+/.env
+/.env.*
+
+# Miscellaneous
+*.log
+*.tmp
+*.bak
+ |
There was a problem hiding this comment.
Pull Request Overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated no new comments.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
TensorZero CI Bot Automated CommentThe UI test job failed while starting the fixtures stack. Docker Compose only pulled ClickHouse and then the step timed out waiting for the "fixtures" service. Because no containers were actually created, the follow-up steps couldn’t find any logs or containers (e.g., fixtures-clickhouse-1 didn’t exist). This is very likely caused by the evaluations JSONL output shape change in this PR. We introduced a new EvaluationUpdate::RunInfo message to the streaming path and marked it with #[serde(skip)], which makes it non-(de)serializable. The fixtures bootstrap process parses the evaluations JSON lines (similar to our tests) and expects to be able to deserialize RunInfo. If deserialization fails early, the fixtures container exits immediately, causing docker compose wait fixtures to hang and eventually time out. Fix: make EvaluationUpdate::RunInfo deserializable again by removing #[serde(skip)] so the untagged enum can match the RunInfo object. This aligns with the tests added in this PR that explicitly handle EvaluationUpdate::RunInfo. Once applied, the fixtures container should successfully parse the RunInfo line and proceed, allowing the stack to come up and the UI tests to run. Warning I encountered an error while trying to create a follow-up PR: Failed to create follow-up PR using remote https://x-access-token:***@github.com/tensorzero/tensorzero.git: git apply --whitespace=nowarn /tmp/tensorzero-pr-zPVere/repo/tensorzero.patch failed: error: No valid patches in input (allow with "--allow-empty") The patch I tried to generate is as follows: *** Begin Patch
*** Update File: evaluations/src/stats.rs
@@
#[derive(Debug, Deserialize, Serialize)]
#[serde(untagged)]
pub enum EvaluationUpdate {
- // RunInfo is used for internal channel communication only and receives special
- // serialization handling in EvaluationStats::push (serializes the inner RunInfo
- // directly, not the enum wrapper). The #[serde(skip)] prevents accidental
- // serialization of the enum variant itself.
- #[serde(skip)]
RunInfo(crate::RunInfo),
Success(EvaluationInfo),
Error(EvaluationError),
}
*** End Patch |
There was a problem hiding this comment.
Pull Request Overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
There was a problem hiding this comment.
Pull Request Overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
Refactored run_evaluation in evaluations/src/lib.rs by extracting the core evaluation execution logic (lines 148-308) into a new public helper function run_evaluation_core.
The goal is to make a clean interface to the evaluation logic for the embedded Python client.
Desired downstream behavior (#3704):
Important
Refactor evaluation logic by extracting core functionality into
run_evaluation_core_streamingand updating tests for streaming updates.run_evaluationinevaluations/src/lib.rsintorun_evaluation_core_streaming.EvaluationCoreArgsstruct to encapsulate evaluation parameters.run_evaluationnow delegates torun_evaluation_core_streamingfor core logic.run_evaluation_core_streamingstreams evaluation updates via an mpsc channel.evaluations/src/stats.rs:EvaluationStatsto handle streaming updates.evaluations/tests/tests.rsto reflect refactored logic and streaming updates.This description was created by
for f26da1b. You can customize this summary. It will automatically update as commits are pushed.