Add level-aware eval dashboard with agent/model breakdowns by nadheesh · Pull Request #481 · wso2/agent-manager

nadheesh · 2026-03-04T07:32:23Z

Summary

Introduce evaluation level concepts (trace, agent, LLM span) across the backend score models, DB migrations, and repository queries
Update the evaluation SDK and job to publish score-level context (agent name, model name, span metadata) alongside evaluation results
Add new API endpoints and grouped-scores response models for per-level stats and label-based score breakdowns
Revamp the eval monitor dashboard to show per-level evaluation summaries, level badges on evaluator legends, and new Score Breakdown by Agent / Score Breakdown by Model tables

Test plan

Verify make dev-test passes for agent-manager-service
Verify evaluation-job unit tests pass
Verify amp-evaluation SDK tests pass
Confirm dashboard renders per-level stats correctly for monitors with mixed evaluator levels
Confirm agent and model breakdown tables only appear when relevant evaluators exist

Resolves #470

Summary by CodeRabbit

New Features
- Added score breakdown endpoint to view evaluation scores grouped by span label at agent or LLM level.
- Added monitor descriptions for better organization and context.
- Enhanced evaluation dashboard with new score breakdown and summary cards.
API Updates
- Renamed response fields: displayName → evaluatorName, traceTimestamp → traceStartTime for consistency.
- Enhanced span context data in score responses.

coderabbitai · 2026-03-04T07:32:48Z

Caution

Review failed

Pull request was closed or merged during review

📝 Walkthrough

Walkthrough

This pull request introduces level-aware evaluation dashboards and score breakdown by agent/model. It adds a new /scores/breakdown API endpoint, refactors database columns to use evaluator_name and trace_start_time, introduces SpanContext and TaskContext for richer span metadata, adds frontend components for per-agent and per-model score visualization, and updates the evaluation library to enrich results with contextual span data.

Changes

Cohort / File(s)	Summary
Database Schema & Migrations `agent-manager-service/db_migrations/005_create_monitors.go`, `agent-manager-service/db_migrations/006_create_evaluation_scores.go`	Added `description` column to monitors; renamed `display_name`→`evaluator_name`, `trace_timestamp`→`trace_start_time` in evaluators/scores tables; added `span_label` column and index for score grouping; updated unique constraints and index definitions.
API Models & Responses `agent-manager-service/models/score.go`, `agent-manager-service/spec/model_*.go` (grouped scores, label summary, span context, trace monitor/span groups, publish items)	Introduced `GroupedScoresResponse`, `ScoreLabelGroup`, `LabelEvaluatorSummary`, `SpanContext`, `TraceMonitorGroup`, `TraceSpanGroup`, `TraceEvaluatorScore`; renamed `DisplayName`→`EvaluatorName`, `TraceTimestamp`→`TraceStartTime`; added `SpanContext` field to publish items; removed `Metadata` from score items.
API Routes & Controllers `agent-manager-service/api/monitor_routes.go`, `agent-manager-service/controllers/monitor_scores_controller.go`, `agent-manager-service/controllers/monitor_controller.go`, `agent-manager-service/controllers/evaluator_controller.go`, `agent-manager-service/controllers/monitor_scores_publisher_controller.go`	Added GET `/scores/breakdown` route; implemented `GetGroupedScores` handler with time/level validation; refactored path parameter extraction to use centralized constants (`PathParamOrgName`, `PathParamMonitorName`, etc.).
Data Access Layer `agent-manager-service/repositories/evaluation_score_repository.go`	Added `GetScoresGroupedByLabel` method returning `LabelAggregation`; updated all queries to use `evaluator_name` and `trace_start_time` instead of `display_name` and `trace_timestamp`; adjusted upsert logic and index/group-by clauses.
Services `agent-manager-service/services/monitor_scores_service.go`	Implemented `GetGroupedScores` to aggregate scores by span label and evaluator; refactored `PublishScores` to use `EvaluatorName`/`Identifier` fields, compute `SpanLabel` and `SpanContext`; updated time-series and aggregation logic to use `TraceStartTime`.
Utilities & Constants `agent-manager-service/utils/constants.go`, `agent-manager-service/utils/makeresults.go`	Added path parameter constants (`PathParamMonitorName`, `PathParamMonitorId`, etc.); added `ConvertToGroupedScoresResponse`; removed `Metadata` field mapping in `ConvertToTraceScoresResponse`.
OpenAPI & Client Specs `agent-manager-service/docs/api_v1_openapi.yaml`, `agent-manager-service/spec/api_default.go`	Added `/scores/breakdown` endpoint schema; introduced new response types (`GroupedScoresResponse`, `ScoreLabelGroup`, `LabelEvaluatorSummary`, `SpanContext`, `TraceMonitorGroup`, `TraceSpanGroup`, `TraceEvaluatorScore`); renamed fields across public schemas.
Frontend API Client `console/workspaces/libs/api-client/src/apis/monitors.ts`, `console/workspaces/libs/api-client/src/hooks/monitors.ts`, `console/workspaces/libs/types/src/api/monitors.ts`	Added `getGroupedScores` function and `useGroupedScores` hook; added `GroupedScoresResponse`, `GroupedScoresQueryParams`, `LabelEvaluatorSummary`, `ScoreLabelGroup` types; added optional `description` fields to `MonitorResponse` and `CreateMonitorRequest`.
Frontend Monitor Management `console/workspaces/pages/eval/src/CreateMonitor.Component.tsx`, `console/workspaces/pages/eval/src/EditMonitor.Component.tsx`, `console/workspaces/pages/eval/src/form/schema.ts`, `console/workspaces/pages/eval/src/utils/monitorFormUtils.ts`	Added `description` field to create/edit monitor forms and validation schema; refactored validation flow to centralize cross-field checks via `superRefine`; minor formatting adjustments.
Frontend Dashboard Components `console/workspaces/pages/eval/src/ViewMonitor.Component.tsx`, `console/workspaces/pages/eval/src/subComponents/AgentPerformanceCard.tsx`, `console/workspaces/pages/eval/src/subComponents/EvaluationSummaryCard.tsx`, `console/workspaces/pages/eval/src/subComponents/PerformanceByEvaluatorCard.tsx`	Refactored evaluation summary to display per-level stats; added theme support and level badges to performance charts; updated props to accept level-aware data structures; added `ScoreBreakdownCard` for agent/model breakdowns; introduced `RadarDataPoint` and `EvaluatorInfo` types with level information.
Frontend Score Visualization `console/workspaces/pages/eval/src/subComponents/ScoreChip.tsx`, `console/workspaces/pages/eval/src/subComponents/ScoreBreakdownCard.tsx`, `console/workspaces/pages/eval/src/subComponents/levelConfig.ts`	Added `ScoreChip` component with color-coded score rendering; added `ScoreBreakdownCard` table component for per-agent/per-model score display; added `levelConfig` with level definitions and styling utilities.
Frontend UI Refactoring `console/workspaces/pages/eval/src/subComponents/MonitorRunDrawer.tsx`, `console/workspaces/pages/eval/src/subComponents/MonitorRunList.tsx`, `console/workspaces/pages/eval/src/subComponents/MonitorTable.tsx`, `console/workspaces/pages/eval/src/subComponents/RunSummaryCard.tsx`	Comprehensive UI restructuring using Drawer patterns, tab-based layouts, and improved list rendering; added state management for active tabs; updated data fetching and error handling paths.
Evaluation Library Core `libs/amp-evaluation/src/amp_evaluation/models.py`	Added `SpanContext` (span_id, agent_name, model, vendor) and `TaskContext` (task_id, trial_id) dataclasses; refactored `EvaluatorScore` to use `trace_start_time` instead of `timestamp`; added `span_context` and `task_context` fields; replaced `get_by_metadata` with `get_agent_scores`/`get_model_scores`; removed `details` field from `EvalResult`.
Evaluation Library Enrichment `libs/amp-evaluation/src/amp_evaluation/evaluators/base.py`, `libs/amp-evaluation/src/amp_evaluation/evaluators/builtin/standard.py`, `libs/amp-evaluation/src/amp_evaluation/evaluators/builtin/deepeval.py`, `libs/amp-evaluation/src/amp_evaluation/runner.py`	Updated `BaseEvaluator.run` to return `List[EvaluatorScore]` with span/task context enrichment; refactored result normalization to embed model details in explanations instead of separate `details` field; removed `details` payload from all evaluators.
Trace Parsing & Processing `libs/amp-evaluation/src/amp_evaluation/trace/fetcher.py`, `libs/amp-evaluation/src/amp_evaluation/trace/parser.py`	Introduced typed AMP attribute models (`AmpAttributes`, `AmpSpanData`, `AmpSpanStatus`); refactored span parsing to use typed attributes instead of dicts; renamed `_parse_token_usage` to `_parse_otel_token_usage`; consolidated span parsing functions (`_parse_llm_span`, `_parse_tool_span`, etc.).
Evaluation Job Publishing `evaluation-job/main.py`, `evaluation-job/test_main.py`	Refactored payload to use `evaluatorName` instead of `displayName`; replaced direct `span_id` with nested `spanContext` object; renamed `traceTimestamp` to `traceStartTime`; updated mock construction to build `SpanContext` structures.
Tests `agent-manager-service/tests/monitor_scores_test.go`, `agent-manager-service/tests/score_repository_test.go`, `libs/amp-evaluation/tests/test_*.py`	Updated test stubs and expectations to reflect new field names (`TraceStartTime`, `EvaluatorName`, `Identifier`), new API endpoint, and `EvaluatorScore`-based assertions; removed expectations for `details` field; adjusted mock data structures.
Monitor & Config `agent-manager-service/models/monitor.go`, `agent-manager-service/services/monitor_manager.go`	Added `Description` field to `Monitor` model and `CreateMonitorRequest`/`MonitorResponse`; persisted description in service layer.
Miscellaneous Frontend `console/workspaces/pages/eval/src/EvalMonitors.Component.tsx`, `console/workspaces/pages/eval/src/subComponents/EvaluatorDetailsDrawer.tsx`, `console/workspaces/pages/eval/src/subComponents/MetricsTooltip.tsx`, `console/workspaces/pages/eval/src/subComponents/MonitorFormWizard.tsx`, `console/workspaces/pages/eval/src/subComponents/MonitorStartStopButton.tsx`, `console/workspaces/pages/eval/src/subComponents/SelectPresetMonitors.tsx`, `console/workspaces/pages/eval/src/hooks/useValidatedForm.ts`, `console/workspaces/pages/eval/src/index.ts`	Formatting and structural refactoring; minor type annotation adjustments; no behavioral changes to most files.
.gitignore `agent-manager-service/.gitignore`	Changed from ignoring `data/builtin_evaluators.json` to ignoring entire `data/` directory.

Sequence Diagram(s)

sequenceDiagram
    participant Client as Client
    participant API as API<br/>(Monitor Routes)
    participant Controller as GetGroupedScores<br/>Controller
    participant Service as MonitorScoresService
    participant Repo as ScoreRepository
    participant DB as Database

    Client->>API: GET /scores/breakdown<br/>(monitorName, startTime, endTime, level)
    API->>Controller: Route to GetGroupedScores
    Controller->>Controller: Validate timeRange
    Controller->>Controller: Extract level param
    Controller->>Service: GetGroupedScores(monitorID,<br/>timeRange, level)
    Service->>Repo: GetScoresGroupedByLabel<br/>(monitorID, startTime, endTime, level)
    Repo->>DB: SELECT span_label, evaluator_name,<br/>AVG(score), COUNT(*), skipped_count<br/>GROUP BY span_label, evaluator_name<br/>WHERE monitor_id = ? AND trace_start_time BETWEEN ?
    DB-->>Repo: [LabelAggregation]
    Repo-->>Service: []LabelAggregation
    Service->>Service: Build GroupedScoresResponse<br/>(groups: ScoreLabelGroup[],<br/>each with evaluators)
    Service-->>Controller: GroupedScoresResponse
    Controller->>Controller: Convert to spec<br/>representation
    Controller-->>Client: JSON GroupedScoresResponse

sequenceDiagram
    participant Evaluator as Evaluator<br/>(Base/Builtin)
    participant Result as EvalResult
    participant Score as EvaluatorScore
    participant Runner as Runner
    participant Publisher as Publisher

    Evaluator->>Evaluator: evaluate(trace, task)
    Evaluator->>Result: Create EvalResult<br/>(score, explanation)
    Result-->>Evaluator: EvalResult
    Evaluator->>Score: EvaluatorScore.from_eval_result<br/>(result, trace_id,<br/>span_context)
    Score-->>Evaluator: EvaluatorScore<br/>(with span/task context)
    Evaluator-->>Runner: [EvaluatorScore]
    Runner->>Runner: Enrich with<br/>task_context if present
    Runner->>Publisher: publish([EvaluatorScore])
    Publisher->>Publisher: Build payload<br/>with spanContext,<br/>traceStartTime
    Publisher-->>Publisher: POST scores

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~40 minutes

Possibly related PRs

Update MetricsTooltip Component #456: Modifies MetricsTooltip.tsx and related evaluation dashboard components, indicating overlapping UI/tooltip changes.
Improvements for Evaluation/Monitor UI #443: Touches similar evaluation monitor UI surfaces (ViewMonitor, performance cards, hooks) and monitor scores API, showing connected feature development.

Poem

🐰 With whiskers twitching, I see the light,
Span contexts grouped, and labels shining bright!
From trace to agent to LLM so keen,
Dashboard breakdowns reveal what's unseen!
Level-aware stats hop into place—
A metrics wonderland in cyberspace! 🎉

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

coderabbitai

Actionable comments posted: 18

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

libs/amp-evaluation/src/amp_evaluation/trace/parser.py (1)

99-113: ⚠️ Potential issue | 🟠 Major

Direct semantic roots are never counted as orphans

Line 99 already filters to semantic kinds, so the Line 112 infrastructure check is unreachable. This misses direct semantic roots (parentSpanId is None) and can prevent synthetic-root creation when multiple semantic roots exist.

Suggested fix

     for span in semantic_spans:
         parent_id = span.parentSpanId
         if parent_id:
             # Walk up to find semantic parent
             final_parent = remap_map.get(parent_id, parent_id)
             if final_parent is None:
                 orphans.append(span)
-        elif parent_id is None:
-            # Root span - check if it's infrastructure
-            kind = span.ampAttributes.kind
-            if kind in INFRASTRUCTURE_KINDS:
-                orphans.append(span)
+        else:
+            # Direct semantic root has no semantic parent
+            orphans.append(span)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@libs/amp-evaluation/src/amp_evaluation/trace/parser.py` around lines 99 -
113, The loop currently iterates only semantic_spans (filtered by
SEMANTIC_KINDS), so the parentSpanId is None branch that checks
INFRASTRUCTURE_KINDS is never reached and direct semantic roots are missed;
change the loop to iterate over the original spans list (or iterate both
semantic and non-semantic as appropriate) and within the loop: if
span.ampAttributes.kind in SEMANTIC_KINDS perform the parent_id/remap_map
parent-walk logic (appending to orphans when remap_map resolves to None), and
separately if span.parentSpanId is None and span.ampAttributes.kind not in
INFRASTRUCTURE_KINDS treat it as an orphan semantic root (append to orphans) so
synthetic-root creation will work when multiple semantic roots exist.

🧹 Nitpick comments (15)

console/workspaces/pages/eval/src/utils/monitorFormUtils.ts (1)

70-75: Consider avoiding unconditional interval reset on future type switch.

Line 74 forces intervalMinutes to 10, which can clobber user-selected values when toggling monitor type. Consider applying 10 only at initial form defaulting (or preserving existing interval on toggle).

Possible adjustment

   if (type === "future") {
     return {
       traceStart: null,
       traceEnd: null,
-      intervalMinutes: 10,
+      intervalMinutes: undefined,
     };
   }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@console/workspaces/pages/eval/src/utils/monitorFormUtils.ts` around lines 70
- 75, The branch handling type === "future" currently unconditionally sets
intervalMinutes to 10 and will clobber a user-selected value; update that block
in monitorFormUtils.ts (the type === "future" branch) to preserve an existing
intervalMinutes if provided (e.g., use values?.intervalMinutes or a passed-in
priorForm value) and only fall back to 10 when no existing interval is present,
or move the 10 default to the initial form-defaulting path so toggling to
"future" does not overwrite the user's interval.

console/workspaces/pages/eval/src/subComponents/MonitorRunList.tsx (2)

234-241: Use exhaustive status mapping with TypeScript enforcement instead of casting.

The cast on line 334 (as keyof typeof statusIcons) bypasses TypeScript's type checking. While all current MonitorRunStatus values are covered, if the status type is extended in the future, TypeScript won't warn that statusIcons is incomplete. Adopt the exhaustive mapping pattern already used in MonitorRunDrawer:
const statusIcons: Record<MonitorRunStatus, React.ReactNode> = {
  failed: <CircleAlert size={20} color={palette?.error.main} />,
  success: <CheckCircle size={20} color={palette?.success.main} />,
  running: <Activity size={20} color={palette?.warning.main} />,
  pending: <CircularProgress size={20} />,
};
This enforces TypeScript to verify all status values are handled, eliminating the cast at line 334.

Also applies to: 334-335
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@console/workspaces/pages/eval/src/subComponents/MonitorRunList.tsx` around
lines 234 - 241, The statusIcons object is currently created with useMemo and
later accessed via a cast which defeats TypeScript exhaustiveness checks;
replace the useMemo + cast pattern with an explicit typed mapping to enforce all
MonitorRunStatus values are handled (e.g. declare statusIcons as
Record<MonitorRunStatus, React.ReactNode> and populate keys
failed/success/running/pending), remove the "as keyof typeof statusIcons" cast
where the icon is selected, and follow the same exhaustive pattern used in
MonitorRunDrawer so future MonitorRunStatus additions will cause a compile-time
error if not covered.
124-125: Add error handling to the useRerunMonitor mutation.

The mutation already invalidates monitor-runs queries on success, triggering a list refresh automatically. However, there is no error callback to surface failures to users. Add an onError handler to useRerunMonitor or configure global error handlers on the QueryClient via MutationCache to ensure errors are consistently displayed.

Also applies to: 387-395
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@console/workspaces/pages/eval/src/subComponents/MonitorRunList.tsx` around
lines 124 - 125, The useRerunMonitor mutation call lacks an onError handler so
failures aren't surfaced; update the invocation of useRerunMonitor (and the
identical pattern at the other occurrence around where lines 387-395 handle
rerun) to pass an onError callback that receives the error and displays it to
the user (for example by calling your existing notification/toast function or
setting local error state), while keeping the existing onSuccess invalidation
logic; alternatively, configure a global MutationCache on the QueryClient to
funnel mutation errors to the same user-facing handler so all rerun errors are
consistently shown.

console/workspaces/pages/eval/src/subComponents/MonitorRunDrawer.tsx (2)

113-113: Replace hardcoded pixel sizing with token/flex-based sizing.

Fixed pixel offsets (96px, 300px, 80) make this drawer brittle across viewport sizes. Prefer theme-token/flex sizing for consistent behavior.

♻️ Suggested refactor

-        <Stack spacing={2} height="calc(100vh - 96px)">
+        <Stack spacing={2} sx={{ flex: 1, minHeight: 0 }}>
...
-                  sx={{ height: 80 }}
+                  sx={{ height: (theme) => theme.spacing(10) }}
...
-                maxHeight="calc(100vh - 300px)"
+                maxHeight="100%"
...
-              <Box sx={{ overflowY: "auto", maxHeight: "calc(100vh - 300px)" }}>
+              <Box sx={{ overflowY: "auto", flex: 1, minHeight: 0 }}>

As per coding guidelines, "Use theme tokens via the `sx` prop instead of hardcoded colors and spacing values".

Also applies to: 133-133, 165-171

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@console/workspaces/pages/eval/src/subComponents/MonitorRunDrawer.tsx` at line
113, The component MonitorRunDrawer uses hardcoded pixel sizes (e.g. the Stack
height "calc(100vh - 96px)", the drawer width ~300px, and the 80px value) which
makes layout brittle; replace these with theme tokens and flexible layout rules
by using the sx prop and theme.spacing or responsive breakpoints (e.g. use sx={{
height: `calc(100vh - ${theme.mixins.toolbar?.minHeight || theme.spacing(12)})`
}} or better, use flex: 1 / maxHeight and theme.spacing(token) for gaps) and
swap fixed width/height properties for flexbox/grow/shrink or
theme.breakpoints-based values in the same MonitorRunDrawer component (look for
the Stack element and the drawer/containers around it and the places with
300px/80 values) so the drawer sizes adapt to viewport and theme tokens.

72-78: Avoid silent empty runId fallback when fetching logs.

Passing "" as runId can produce avoidable invalid fetches/error states. Prefer guarding the query with a valid run.id instead of coercing to an empty string.

♻️ Suggested direction

-  const { data, isLoading, error } = useMonitorRunLogs({
+  const runId = run.id;
+  const { data, isLoading, error } = useMonitorRunLogs({
     orgName,
     projName: projectName,
     agentName,
     monitorName,
-    runId: run.id ?? "",
+    runId: runId ?? "",
   });
+  // If supported by the hook, also gate execution with `enabled: Boolean(runId)`.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@console/workspaces/pages/eval/src/subComponents/MonitorRunDrawer.tsx` around
lines 72 - 78, The current call to useMonitorRunLogs passes run.id ?? "" which
sends an empty runId and triggers invalid fetches; change the component to only
invoke useMonitorRunLogs when a valid run.id exists (or pass undefined and use
the hook's enabled option) — locate the call to useMonitorRunLogs and replace
the unconditional call with a guarded invocation (e.g., check run.id truthiness
before calling or pass { enabled: !!run.id } into the hook) so that
useMonitorRunLogs is not invoked with an empty string runId.

console/workspaces/pages/eval/src/subComponents/MonitorStartStopButton.tsx (1)

64-69: Minor redundancy in null checks.

The check !orgId || !projectId || !agentId on line 67 is already covered by isDisabled (lines 50-56). Consider simplifying:

♻️ Suggested simplification

   const handleClick = useCallback(
     (event: MouseEvent<HTMLButtonElement>) => {
       event.stopPropagation();
-      if (isDisabled || !orgId || !projectId || !agentId) {
+      if (isDisabled) {
         return;
       }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@console/workspaces/pages/eval/src/subComponents/MonitorStartStopButton.tsx`
around lines 64 - 69, The handleClick useCallback currently redundantly checks
!orgId || !projectId || !agentId even though those conditions are already
encapsulated by the isDisabled variable; simplify the handler by removing the
redundant null/undefined checks and only early-return on isDisabled (keep
event.stopPropagation() and the existing guards relying on isDisabled),
referencing handleClick, isDisabled, orgId, projectId, and agentId to locate the
code to edit.

console/workspaces/pages/eval/src/subComponents/ScoreChip.tsx (1)

26-30: Consider using theme tokens for score colors.

The hardcoded hex colors violate the coding guideline to "use theme tokens via sx prop instead of hardcoded colors." Consider using theme palette colors like success.main, warning.main, and error.main:
♻️ Suggested refactor using theme tokens
-export function scoreColor(score: number): string {
-  if (score >= 0.8) return "#22c55e";
-  if (score >= 0.6) return "#f59e0b";
-  return "#ef4444";
+import type { Theme } from "@wso2/oxygen-ui";
+
+export function scoreColor(score: number, theme: Theme): string {
+  if (score >= 0.8) return theme.palette.success.main;
+  if (score >= 0.6) return theme.palette.warning.main;
+  return theme.palette.error.main;
 }
This would require updating the component and callers to pass the theme. Alternatively, if these specific colors are intentional brand/semantic colors for score visualization, document this as an exception.
As per coding guidelines: "Use theme tokens via the sx prop instead of hardcoded colors and spacing values"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@console/workspaces/pages/eval/src/subComponents/ScoreChip.tsx` around lines
26 - 30, Replace the hardcoded hex values in scoreColor with theme palette
tokens: change scoreColor(score: number) to accept the theme (e.g.,
scoreColor(score: number, theme: Theme)) or return a palette key and update
callers accordingly; map scores to theme.palette.success.main,
theme.palette.warning.main and theme.palette.error.main (or to the string keys
"success.main"/"warning.main"/"error.main" if you prefer to let callers resolve
via sx), and update all callers of scoreColor (e.g., in ScoreChip component) to
pass the theme or use the returned token in the sx prop; if these hex values are
intentionally brand-specific, add a short comment documenting this exception
instead.

console/workspaces/pages/eval/src/subComponents/levelConfig.ts (1)

21-28: Consider using theme tokens for level colors.

The hardcoded hex colors (#7c3aed, #3f8cff, #22c55e) work but deviate from the coding guideline to use theme tokens via the sx prop. If these level colors are intended to be themeable or respect dark/light mode palette variations, consider defining them in the theme's palette (e.g., theme.palette.levels.trace) and referencing them here.

If these are intentionally fixed brand colors for level distinction, this is acceptable as-is.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@console/workspaces/pages/eval/src/subComponents/levelConfig.ts` around lines
21 - 28, LEVEL_CONFIG currently uses hard-coded hex colors for each
EvaluationLevel (trace, agent, llm); change it to reference theme tokens so
colors respond to theme/dark-mode. Update LEVEL_CONFIG (and any consumers that
set sx) to use keys like theme.palette.levels.trace / .agent / .llm (or
similarly named tokens you add to the theme) instead of literal hex strings; add
the corresponding entries to your theme palette (e.g., palette.levels = { trace:
..., agent: ..., llm: ... }) so components can read theme.palette.levels.<level>
when rendering. Ensure EvaluationLevel keys still match and preserve label/unit
values.

console/workspaces/libs/api-client/src/hooks/monitors.ts (1)

328-350: Use GroupedScoresQueryParams directly to avoid query-shape drift.

The hook currently redefines the query contract inline. Reusing the shared API type keeps this aligned with backend schema changes and avoids future silent divergence.

♻️ Suggested refactor

 import {
   type CreateMonitorPathParams,
   type CreateMonitorRequest,
   type DeleteMonitorPathParams,
   type EvaluationLevel,
   type GetMonitorPathParams,
   type GroupedScoresPathParams,
+  type GroupedScoresQueryParams,
   type GroupedScoresResponse,
@@
 export function useGroupedScores(
   params: GroupedScoresPathParams,
-  query: { level: EvaluationLevel; timeRange?: TraceListTimeRange },
+  query: GroupedScoresQueryParams & { timeRange?: TraceListTimeRange },
   options?: { enabled?: boolean },
 ) {
@@
-      const { timeRange, ...rest } = query;
-      let finalQuery = rest as { level: EvaluationLevel; startTime?: string; endTime?: string };
+      const { timeRange, ...rest } = query;
+      let finalQuery: GroupedScoresQueryParams = rest;
       if (timeRange) {
         const { startTime, endTime } = getTimeRange(timeRange);
         finalQuery = { ...finalQuery, startTime, endTime };
       }
       return getGroupedScores(params, finalQuery, getToken);
@@
       !!params.projName &&
       !!params.agentName &&
       !!params.monitorName &&
-      !!query.timeRange,
+      (!!query.timeRange || (!!query.startTime && !!query.endTime)),
   });
 }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@console/workspaces/libs/api-client/src/hooks/monitors.ts` around lines 328 -
350, Replace the ad-hoc inline query shape with the shared API type
GroupedScoresQueryParams to prevent drift: change the hook parameter type from
the inline "{ level: EvaluationLevel; timeRange?: TraceListTimeRange }" to
GroupedScoresQueryParams, update the local finalQuery typing to use
GroupedScoresQueryParams (or a compatible extension with startTime/endTime), and
pass that typed finalQuery into getGroupedScores; keep the getTimeRange,
getGroupedScores, useQuery and queryKey logic intact but reference
GroupedScoresQueryParams wherever the query shape is declared or cast.

evaluation-job/test_main.py (1)

123-131: Broaden span-context test helper inputs to cover agent/model propagation.

_make_evaluator_score hardcodes agent_name/model/vendor as None, so regressions in those fields won’t be caught by tests even though the payload now supports them.

🧪 Suggested helper enhancement

 def _make_evaluator_score(
     trace_id,
     score,
     span_id=None,
+    agent_name=None,
+    model=None,
+    vendor=None,
     explanation=None,
     timestamp=None,
     error=None,
 ):
@@
-    if span_id is not None:
+    if any(v is not None for v in (span_id, agent_name, model, vendor)):
         span_ctx = MagicMock()
         span_ctx.span_id = span_id
-        span_ctx.agent_name = None
-        span_ctx.model = None
-        span_ctx.vendor = None
+        span_ctx.agent_name = agent_name
+        span_ctx.model = model
+        span_ctx.vendor = vendor
         s.span_context = span_ctx
     else:
         s.span_context = None

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@evaluation-job/test_main.py` around lines 123 - 131, The test helper
_make_evaluator_score currently sets span_ctx.agent_name, span_ctx.model, and
span_ctx.vendor to None when span_id is provided, which hides regressions;
update _make_evaluator_score (and any helper constructing s.span_context) to
accept optional parameters agent_name, model, and vendor (defaulting to None)
and assign them to span_ctx.agent_name, span_ctx.model, span_ctx.vendor when
creating the MagicMock span context; then update tests that rely on propagation
to pass concrete agent/model/vendor values (preserving None defaults for
existing callers).

console/workspaces/pages/eval/src/ViewMonitor.Component.tsx (1)

142-146: Avoid implicit return in forEach callback.

Same static analysis flag as in PerformanceByEvaluatorCard — the callback returns the result of s.add(). Use a block body to silence the linter.

♻️ Proposed fix

   const levelsPresent = useMemo(() => {
     const s = new Set<EvaluationLevel>();
-    evaluators.forEach((e) => s.add(e.level));
+    evaluators.forEach((e) => {
+      s.add(e.level);
+    });
     return s;
   }, [evaluators]);

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@console/workspaces/pages/eval/src/ViewMonitor.Component.tsx` around lines 142
- 146, The linter flags the implicit return inside the forEach callback in the
useMemo that builds levelsPresent; update the callback in the useMemo (the one
iterating evaluators and calling s.add) to use a block body (or a for...of loop)
so it does not implicitly return s.add(), e.g., change the
evaluators.forEach((e) => s.add(e.level)) to a block callback that calls
s.add(e.level); and then return the Set as before.

agent-manager-service/services/monitor_scores_service.go (1)

394-397: Consider distinguishing "no data" from "zero score" in grouped results.

When agg.MeanScore is nil, the mean defaults to 0.0. This conflates "evaluator returned a score of 0" with "no scores were computed for this label/evaluator combination." Downstream consumers (UI tables) may want to display these differently (e.g., "N/A" vs "0%").

If the API contract allows, consider making Mean a pointer or adding an explicit "has data" flag.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@agent-manager-service/services/monitor_scores_service.go` around lines 394 -
397, The current code sets mean := 0.0 when agg.MeanScore is nil which conflates
"no data" with a real zero; change the response to preserve nil by making the
output Mean a *float64 (or add an explicit boolean like HasMean) and assign Mean
= agg.MeanScore (or set HasMean = agg.MeanScore != nil) instead of defaulting to
0.0; update the struct used for grouped results (the response type that contains
Mean) and any JSON tags/consumers to handle a null value (or the new flag) so
downstream UI can distinguish "N/A" from 0.

console/workspaces/pages/eval/src/subComponents/PerformanceByEvaluatorCard.tsx (1)

166-170: Avoid implicit return in forEach callback.

Static analysis flags that the callback passed to forEach() returns a value (the result of m.set()). While functionally harmless, this can cause linter noise and confusion. Use a block body or explicit void.

♻️ Proposed fix

   const levelMap = useMemo(() => {
     const m = new Map<string, EvaluationLevel>();
-    evaluators.forEach((e) => m.set(e.name, e.level));
+    evaluators.forEach((e) => {
+      m.set(e.name, e.level);
+    });
     return m;
   }, [evaluators]);

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In
`@console/workspaces/pages/eval/src/subComponents/PerformanceByEvaluatorCard.tsx`
around lines 166 - 170, The forEach callback used in computing levelMap
implicitly returns the value from m.set(), causing linter noise; update the
callback in the useMemo that builds levelMap so it uses a block body (e.g.,
evaluators.forEach((e) => { m.set(e.name, e.level); });) or explicitly void the
call so the callback returns undefined, keeping the rest of the logic in
levelMap, EvaluationLevel, evaluators and m.set unchanged.

libs/amp-evaluation/src/amp_evaluation/models.py (1)

256-284: from_eval_result() does not propagate task_context

EvaluatorScore now carries task_context, but the primary factory cannot set it. Consider adding it to keep creation paths symmetric and avoid losing dataset-run context.

Suggested refactor

     def from_eval_result(
         cls,
         eval_result: "EvalResult",
         trace_id: str,
         trace_start_time: Optional[datetime] = None,
         span_context: Optional["SpanContext"] = None,
+        task_context: Optional["TaskContext"] = None,
     ) -> "EvaluatorScore":
@@
             return cls(
                 trace_id=trace_id,
                 trace_start_time=trace_start_time,
                 skip_reason=eval_result.skip_reason,
                 span_context=span_context,
+                task_context=task_context,
             )
         return cls(
             trace_id=trace_id,
             trace_start_time=trace_start_time,
             score=eval_result.score,
             passed=eval_result.passed,
             explanation=eval_result.explanation,
             span_context=span_context,
+            task_context=task_context,
         )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@libs/amp-evaluation/src/amp_evaluation/models.py` around lines 256 - 284, The
from_eval_result classmethod on EvaluatorScore does not accept or propagate
task_context, causing EvaluatorScore instances to lose dataset-run context;
update the from_eval_result signature to accept an optional task_context
parameter (matching the EvaluatorScore.task_context type), and pass that
task_context through into the returned cls(...) in both the is_skipped branch
and the normal branch (use eval_result.task_context if the caller doesn't
provide one) so task_context is preserved when constructing EvaluatorScore from
an EvalResult.

agent-manager-service/repositories/evaluation_score_repository.go (1)

289-291: Filter clauses should be applied before GROUP BY.

The EvaluatorName and Level filters (lines 293-298) are added after the .Group() call. While GORM typically builds the query correctly, this ordering can be confusing and may lead to unexpected behavior in some edge cases. Consider moving the filter conditions before the aggregation setup for clarity.

♻️ Suggested reordering for clarity

 query := r.db.Table("scores s").
-		Select(`
-			mre.evaluator_name,
-			mre.level,
-			COUNT(*) as total_count,
-			COUNT(CASE WHEN s.skip_reason IS NOT NULL THEN 1 END) as skipped_count,
-			AVG(CASE WHEN s.skip_reason IS NULL THEN s.score END) as mean_score
-		`).
 		Joins("JOIN monitor_run_evaluators mre ON s.run_evaluator_id = mre.id").
 		Where("s.monitor_id = ?", monitorID).
 		Where("s.trace_start_time BETWEEN ? AND ?", startTime, endTime)

-	if filters.EvaluatorName != "" {
-		query = query.Where("mre.evaluator_name = ?", filters.EvaluatorName)
-	}
-	if filters.Level != "" {
-		query = query.Where("mre.level = ?", filters.Level)
-	}
+	if filters.EvaluatorName != "" {
+		query = query.Where("mre.evaluator_name = ?", filters.EvaluatorName)
+	}
+	if filters.Level != "" {
+		query = query.Where("mre.level = ?", filters.Level)
+	}

+	query = query.Select(`
+			mre.evaluator_name,
+			mre.level,
+			COUNT(*) as total_count,
+			COUNT(CASE WHEN s.skip_reason IS NOT NULL THEN 1 END) as skipped_count,
+			AVG(CASE WHEN s.skip_reason IS NULL THEN s.score END) as mean_score
+		`).
 		Group("mre.evaluator_name, mre.level").
 		Order("mre.evaluator_name")

-	if filters.EvaluatorName != "" {
-		query = query.Where("mre.evaluator_name = ?", filters.EvaluatorName)
-	}
-	if filters.Level != "" {
-		query = query.Where("mre.level = ?", filters.Level)
-	}

 	err := query.Find(&results).Error

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@agent-manager-service/repositories/evaluation_score_repository.go` around
lines 289 - 291, The query applies the EvaluatorName and Level filters after the
.Group() call which is confusing; move the filter Where clauses for
EvaluatorName and Level to precede the .Group() invocation so the sequence is
Where(start/end time) -> Where(EvaluatorName) -> Where(Level) ->
Group("mre.evaluator_name, mre.level") -> Order(...). Update the query built
around the existing Where("s.trace_start_time BETWEEN ? AND ?", startTime,
endTime), the subsequent EvaluatorName/Level Where calls, and the Group(...) /
Order(...) chain in the function that constructs this GORM query to reflect this
ordering.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@agent-manager-service/db_migrations/006_create_evaluation_scores.go`:
- Line 45: The unique constraint currently defined as uq_run_evaluator UNIQUE
(monitor_run_id, evaluator_name) must be changed to use the per-run instance
identity column instead of evaluator_name: alter the migration that creates the
monitor_run_evaluators table to define uq_run_evaluator as UNIQUE
(monitor_run_id, identifier) (or UNIQUE (monitor_run_id, display_name) if your
schema uses display_name as the per-instance key), ensure the identifier column
is created before the constraint, and update any referenced index/constraint
name usages that assume the old evaluator_name uniqueness; locate
monitor_run_evaluators, the uq_run_evaluator constraint, and the
evaluator_name/identifier/display_name column definitions to make this change.

In `@agent-manager-service/docs/api_v1_openapi.yaml`:
- Around line 9229-9251: Update the LabelEvaluatorSummary.mean schema so it can
be null while remaining required: change the mean property to accept nulls
(e.g., add nullable: true or make its type allow ["number","null"] depending on
OpenAPI version) and keep mean listed under required; target the
LabelEvaluatorSummary.mean definition to ensure skipped-only groups can
represent undefined means as null.

In `@agent-manager-service/models/score.go`:
- Around line 194-200: LabelEvaluatorSummary currently defines Mean as float64
which forces NULL DB means to become 0.0; change the Mean field to *float64 (and
add `json:"mean,omitempty"`) so NULL is preserved, then update the conversion
logic in monitor_scores_service.go (the block that maps
LabelAggregation.MeanScore into LabelEvaluatorSummary, around the existing
conversion that dereferences into 0.0) to assign the pointer value directly (or
pass through the existing *float64) instead of converting NULL to 0.0; ensure
any downstream code that reads LabelEvaluatorSummary.Mean handles nil checks.

In `@agent-manager-service/spec/model_time_series_point.go`:
- Line 29: The generated file model_time_series_point.go was manually edited but
will be overwritten; instead add the clarification about the "mean" key to the
OpenAPI spec so it persists: update the TimeSeriesPoint.aggregations field
description in docs/api_v1_openapi.yaml to include the sentence that the "mean"
key is always present and is a number (0–1) when scored data exists or null when
all evaluations were skipped, then regenerate the client so the comment appears
in the generated TimeSeriesPoint type.

In `@console/workspaces/pages/eval/src/CreateMonitor.Component.tsx`:
- Around line 101-114: The payload construction for CreateMonitorRequest omits
the user-entered description (values.description) so user input is dropped;
update the payload object in CreateMonitor.Component (the const payload:
CreateMonitorRequest block) to include a description property populated from
values.description (e.g., description: values.description?.trim() ?? undefined),
or if CreateMonitorRequest truly does not support description remove the
description field from the form and related form state to avoid collecting
unusable input; ensure you modify the CreateMonitorRequest usage and any form
validation in the same component accordingly.

In `@console/workspaces/pages/eval/src/EditMonitor.Component.tsx`:
- Around line 123-125: The sanitization currently only filters out entries whose
config.value is "****" but not those set to undefined; update the sanitation
logic around sanitizedLlmProviderConfigs (using values.llmProviderConfigs) to
filter out both config.value === "****" and config.value === undefined (or
config.value == null) before constructing the update payload so undefined-secret
entries are not included; if the resulting array is empty set
sanitizedLlmProviderConfigs to undefined so the payload omits the field
entirely.

In `@console/workspaces/pages/eval/src/form/schema.ts`:
- Around line 79-94: The samplingRate schema currently allows 0 but the error
message says "between 1 and 100"; update the refine predicate in samplingRate
(the .refine call on the samplingRate union/transform) to require value >= 1
instead of >= 0 and keep the message consistent; ensure the transform and refine
remain intact (transform that converts empty string/null/undefined to undefined
and the refine that checks Number.isInteger(value) && value >= 1 && value <=
100).
- Around line 63-77: The refine on intervalMinutes currently accepts value === 5
but the error message says "greater than 5 minutes"; update them to match:
either change the predicate in the refine on intervalMinutes (inside the .refine
callback) to require value === undefined || (Number.isInteger(value) && value >
5) if you intend to disallow 5, or change the message to "Interval must be at
least 5 minutes" if you want to allow 5; keep the existing .transform behavior
that converts ""/null/undefined to undefined.

In `@console/workspaces/pages/eval/src/index.ts`:
- Around line 51-56: The editMonitor page entry is missing its required path
property which breaks routing; update the editMonitor object (the entry with
component EditMonitorComponent and icon MonitorCheck) to include an appropriate
path string consistent with the other entries — e.g. a static route like
"edit-monitor" or a parameterized route like "edit/:id" if it needs an eval id —
and ensure the chosen path matches navigation and route registration conventions
used elsewhere in this pages map.

In `@console/workspaces/pages/eval/src/subComponents/CreateMonitorForm.tsx`:
- Around line 188-195: The TextField for intervalMinutes in CreateMonitorForm
currently passes event.target.value (a string) into onFieldChange, causing
formData.intervalMinutes to become a string; update the onChange handler in the
TextField with id "intervalMinutes" to parse the input to a number (e.g.,
Number(...) or parseInt) and handle empty string (set undefined/null) before
calling onFieldChange so formData.intervalMinutes remains a number type for API
payloads; update any validation in CreateMonitorForm/onFieldChange accordingly
to accept numeric values only.

In `@console/workspaces/pages/eval/src/subComponents/MonitorRunDrawer.tsx`:
- Around line 139-141: The Alert for failed runs is using severity="warning"
which under-communicates errors; in MonitorRunDrawer update the Alert that
renders when run.errorMessage is present (the JSX referencing run.errorMessage
and the Alert component) to use severity="error" so failed runs are clearly
signalled.

In `@console/workspaces/pages/eval/src/subComponents/MonitorRunList.tsx`:
- Around line 214-228: The IconButton instances in MonitorRunList.tsx are
icon-only and missing accessible names; add an explicit accessible label (e.g.,
aria-label="Refresh runs" or aria-label="Pause run" as appropriate) to each
IconButton usage (the one wrapping CircularProgress/RefreshCcw and the other
occurrence around line ~380) so screen readers can announce the action, and
optionally add a title prop to surface a tooltip for sighted users; keep the
existing onClick/disabled logic intact.

In `@console/workspaces/pages/eval/src/subComponents/MonitorTable.tsx`:
- Line 281: Update the UI copy in MonitorTable.tsx where the extra-evaluators
label is rendered: replace the string literal `{`+${monitor.evaluators.length -
2} more..`}` with the correct punctuation (e.g., `more...` or simply `more`) so
the displayed text uses the intended ellipsis; locate the render that references
`monitor.evaluators.length` in the MonitorTable component and change the text
accordingly.
- Around line 297-343: The IconButton controls for the edit and delete actions
are icon-only and need accessible labels: add an aria-label prop to the edit
IconButton near the Edit icon (the IconButton that calls
navigate(generatePath(...)) with agentId/orgId/projectId/envId/monitorId) and to
the delete IconButton (the IconButton wrapped by Tooltip that calls
addConfirmation and deleteMonitor). Ensure the aria-labels are descriptive
(e.g., "Edit monitor" and "Delete monitor") so screen readers can identify the
buttons while keeping existing onClick behavior unchanged.
- Around line 132-141: The search haystack currently concatenates monitor.name
but omits monitor.displayName, so add monitor.displayName into the array used to
build haystack (alongside monitor.name, monitor.environment, monitor.dataSource,
...monitor.evaluators, monitor.status) and ensure it is included safely (e.g.,
fallback or conditional check) before joining and lowercasing so term matching
will find the visible display name.

In `@console/workspaces/pages/eval/src/subComponents/ScoreBreakdownCard.tsx`:
- Around line 47-58: The title lookup using CARD_TITLES[level as "agent" |
"llm"] is unsafe because ScoreBreakdownCard's level can be "trace" and may yield
undefined; fix by narrowing the ScoreBreakdownCard prop type so level is only
"agent" | "llm" (update the ScoreBreakdownCardProps definition) or add a safe
fallback when deriving title (check level against allowed keys before indexing
CARD_TITLES and use a default like "Score Breakdown" when not matched);
reference CARD_TITLES and the ScoreBreakdownCard level prop when making the
change.
- Around line 63-64: The nested forEach callbacks (data.groups.forEach and
g.evaluators.forEach) currently use concise arrow bodies that return the result
of nameSet.add(...), triggering the Biome useIterableCallbackReturn lint rule;
change both callbacks to block bodies with no returned value (e.g., replace the
concise arrow bodies with { nameSet.add(e.evaluatorName); } ) so the callbacks
perform the side-effect only. Ensure you update the callbacks that reference
data.groups, g.evaluators, nameSet.add, and e.evaluatorName to use
statement-style bodies and do not return anything.

In `@libs/amp-evaluation/src/amp_evaluation/evaluators/builtin/standard.py`:
- Around line 232-235: The current mismatch explanation constructs raw text
previews (variables output_preview, expected_preview) and sets explanation to
include them; replace that with metadata-safe diagnostics: do not embed raw
output/expected excerpts in the EvalResult.explanation, instead include
non-sensitive attributes such as lengths (len(output), len(expected)), a boolean
exact_match flag, and a short stable hash fingerprint (e.g., first 8 chars of a
SHA256) or delta indicator; update the code that assigns explanation (and any
variables output_preview/expected_preview) in standard.py to produce a message
like "Output mismatch: exact_match=False, output_len=..., expected_len=...,
output_hash=..., expected_hash=..." so persisted explanations contain no raw
content.

---

Outside diff comments:
In `@libs/amp-evaluation/src/amp_evaluation/trace/parser.py`:
- Around line 99-113: The loop currently iterates only semantic_spans (filtered
by SEMANTIC_KINDS), so the parentSpanId is None branch that checks
INFRASTRUCTURE_KINDS is never reached and direct semantic roots are missed;
change the loop to iterate over the original spans list (or iterate both
semantic and non-semantic as appropriate) and within the loop: if
span.ampAttributes.kind in SEMANTIC_KINDS perform the parent_id/remap_map
parent-walk logic (appending to orphans when remap_map resolves to None), and
separately if span.parentSpanId is None and span.ampAttributes.kind not in
INFRASTRUCTURE_KINDS treat it as an orphan semantic root (append to orphans) so
synthetic-root creation will work when multiple semantic roots exist.

---

Nitpick comments:
In `@agent-manager-service/repositories/evaluation_score_repository.go`:
- Around line 289-291: The query applies the EvaluatorName and Level filters
after the .Group() call which is confusing; move the filter Where clauses for
EvaluatorName and Level to precede the .Group() invocation so the sequence is
Where(start/end time) -> Where(EvaluatorName) -> Where(Level) ->
Group("mre.evaluator_name, mre.level") -> Order(...). Update the query built
around the existing Where("s.trace_start_time BETWEEN ? AND ?", startTime,
endTime), the subsequent EvaluatorName/Level Where calls, and the Group(...) /
Order(...) chain in the function that constructs this GORM query to reflect this
ordering.

In `@agent-manager-service/services/monitor_scores_service.go`:
- Around line 394-397: The current code sets mean := 0.0 when agg.MeanScore is
nil which conflates "no data" with a real zero; change the response to preserve
nil by making the output Mean a *float64 (or add an explicit boolean like
HasMean) and assign Mean = agg.MeanScore (or set HasMean = agg.MeanScore != nil)
instead of defaulting to 0.0; update the struct used for grouped results (the
response type that contains Mean) and any JSON tags/consumers to handle a null
value (or the new flag) so downstream UI can distinguish "N/A" from 0.

In `@console/workspaces/libs/api-client/src/hooks/monitors.ts`:
- Around line 328-350: Replace the ad-hoc inline query shape with the shared API
type GroupedScoresQueryParams to prevent drift: change the hook parameter type
from the inline "{ level: EvaluationLevel; timeRange?: TraceListTimeRange }" to
GroupedScoresQueryParams, update the local finalQuery typing to use
GroupedScoresQueryParams (or a compatible extension with startTime/endTime), and
pass that typed finalQuery into getGroupedScores; keep the getTimeRange,
getGroupedScores, useQuery and queryKey logic intact but reference
GroupedScoresQueryParams wherever the query shape is declared or cast.

In `@console/workspaces/pages/eval/src/subComponents/levelConfig.ts`:
- Around line 21-28: LEVEL_CONFIG currently uses hard-coded hex colors for each
EvaluationLevel (trace, agent, llm); change it to reference theme tokens so
colors respond to theme/dark-mode. Update LEVEL_CONFIG (and any consumers that
set sx) to use keys like theme.palette.levels.trace / .agent / .llm (or
similarly named tokens you add to the theme) instead of literal hex strings; add
the corresponding entries to your theme palette (e.g., palette.levels = { trace:
..., agent: ..., llm: ... }) so components can read theme.palette.levels.<level>
when rendering. Ensure EvaluationLevel keys still match and preserve label/unit
values.

In `@console/workspaces/pages/eval/src/subComponents/MonitorRunDrawer.tsx`:
- Line 113: The component MonitorRunDrawer uses hardcoded pixel sizes (e.g. the
Stack height "calc(100vh - 96px)", the drawer width ~300px, and the 80px value)
which makes layout brittle; replace these with theme tokens and flexible layout
rules by using the sx prop and theme.spacing or responsive breakpoints (e.g. use
sx={{ height: `calc(100vh - ${theme.mixins.toolbar?.minHeight ||
theme.spacing(12)})` }} or better, use flex: 1 / maxHeight and
theme.spacing(token) for gaps) and swap fixed width/height properties for
flexbox/grow/shrink or theme.breakpoints-based values in the same
MonitorRunDrawer component (look for the Stack element and the drawer/containers
around it and the places with 300px/80 values) so the drawer sizes adapt to
viewport and theme tokens.
- Around line 72-78: The current call to useMonitorRunLogs passes run.id ?? ""
which sends an empty runId and triggers invalid fetches; change the component to
only invoke useMonitorRunLogs when a valid run.id exists (or pass undefined and
use the hook's enabled option) — locate the call to useMonitorRunLogs and
replace the unconditional call with a guarded invocation (e.g., check run.id
truthiness before calling or pass { enabled: !!run.id } into the hook) so that
useMonitorRunLogs is not invoked with an empty string runId.

In `@console/workspaces/pages/eval/src/subComponents/MonitorRunList.tsx`:
- Around line 234-241: The statusIcons object is currently created with useMemo
and later accessed via a cast which defeats TypeScript exhaustiveness checks;
replace the useMemo + cast pattern with an explicit typed mapping to enforce all
MonitorRunStatus values are handled (e.g. declare statusIcons as
Record<MonitorRunStatus, React.ReactNode> and populate keys
failed/success/running/pending), remove the "as keyof typeof statusIcons" cast
where the icon is selected, and follow the same exhaustive pattern used in
MonitorRunDrawer so future MonitorRunStatus additions will cause a compile-time
error if not covered.
- Around line 124-125: The useRerunMonitor mutation call lacks an onError
handler so failures aren't surfaced; update the invocation of useRerunMonitor
(and the identical pattern at the other occurrence around where lines 387-395
handle rerun) to pass an onError callback that receives the error and displays
it to the user (for example by calling your existing notification/toast function
or setting local error state), while keeping the existing onSuccess invalidation
logic; alternatively, configure a global MutationCache on the QueryClient to
funnel mutation errors to the same user-facing handler so all rerun errors are
consistently shown.

In `@console/workspaces/pages/eval/src/subComponents/MonitorStartStopButton.tsx`:
- Around line 64-69: The handleClick useCallback currently redundantly checks
!orgId || !projectId || !agentId even though those conditions are already
encapsulated by the isDisabled variable; simplify the handler by removing the
redundant null/undefined checks and only early-return on isDisabled (keep
event.stopPropagation() and the existing guards relying on isDisabled),
referencing handleClick, isDisabled, orgId, projectId, and agentId to locate the
code to edit.

In
`@console/workspaces/pages/eval/src/subComponents/PerformanceByEvaluatorCard.tsx`:
- Around line 166-170: The forEach callback used in computing levelMap
implicitly returns the value from m.set(), causing linter noise; update the
callback in the useMemo that builds levelMap so it uses a block body (e.g.,
evaluators.forEach((e) => { m.set(e.name, e.level); });) or explicitly void the
call so the callback returns undefined, keeping the rest of the logic in
levelMap, EvaluationLevel, evaluators and m.set unchanged.

In `@console/workspaces/pages/eval/src/subComponents/ScoreChip.tsx`:
- Around line 26-30: Replace the hardcoded hex values in scoreColor with theme
palette tokens: change scoreColor(score: number) to accept the theme (e.g.,
scoreColor(score: number, theme: Theme)) or return a palette key and update
callers accordingly; map scores to theme.palette.success.main,
theme.palette.warning.main and theme.palette.error.main (or to the string keys
"success.main"/"warning.main"/"error.main" if you prefer to let callers resolve
via sx), and update all callers of scoreColor (e.g., in ScoreChip component) to
pass the theme or use the returned token in the sx prop; if these hex values are
intentionally brand-specific, add a short comment documenting this exception
instead.

In `@console/workspaces/pages/eval/src/utils/monitorFormUtils.ts`:
- Around line 70-75: The branch handling type === "future" currently
unconditionally sets intervalMinutes to 10 and will clobber a user-selected
value; update that block in monitorFormUtils.ts (the type === "future" branch)
to preserve an existing intervalMinutes if provided (e.g., use
values?.intervalMinutes or a passed-in priorForm value) and only fall back to 10
when no existing interval is present, or move the 10 default to the initial
form-defaulting path so toggling to "future" does not overwrite the user's
interval.

In `@console/workspaces/pages/eval/src/ViewMonitor.Component.tsx`:
- Around line 142-146: The linter flags the implicit return inside the forEach
callback in the useMemo that builds levelsPresent; update the callback in the
useMemo (the one iterating evaluators and calling s.add) to use a block body (or
a for...of loop) so it does not implicitly return s.add(), e.g., change the
evaluators.forEach((e) => s.add(e.level)) to a block callback that calls
s.add(e.level); and then return the Set as before.

In `@evaluation-job/test_main.py`:
- Around line 123-131: The test helper _make_evaluator_score currently sets
span_ctx.agent_name, span_ctx.model, and span_ctx.vendor to None when span_id is
provided, which hides regressions; update _make_evaluator_score (and any helper
constructing s.span_context) to accept optional parameters agent_name, model,
and vendor (defaulting to None) and assign them to span_ctx.agent_name,
span_ctx.model, span_ctx.vendor when creating the MagicMock span context; then
update tests that rely on propagation to pass concrete agent/model/vendor values
(preserving None defaults for existing callers).

In `@libs/amp-evaluation/src/amp_evaluation/models.py`:
- Around line 256-284: The from_eval_result classmethod on EvaluatorScore does
not accept or propagate task_context, causing EvaluatorScore instances to lose
dataset-run context; update the from_eval_result signature to accept an optional
task_context parameter (matching the EvaluatorScore.task_context type), and pass
that task_context through into the returned cls(...) in both the is_skipped
branch and the normal branch (use eval_result.task_context if the caller doesn't
provide one) so task_context is preserved when constructing EvaluatorScore from
an EvalResult.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 33e77ca3-add9-42e6-bdb8-af2b73054e8f

📥 Commits

Reviewing files that changed from the base of the PR and between 7231904 and 04dda5e.

⛔ Files ignored due to path filters (1)

console/common/config/rush/pnpm-lock.yaml is excluded by !**/pnpm-lock.yaml

📒 Files selected for processing (65)

agent-manager-service/.gitignore
agent-manager-service/api/monitor_routes.go
agent-manager-service/controllers/monitor_scores_controller.go
agent-manager-service/db_migrations/006_create_evaluation_scores.go
agent-manager-service/docs/api_v1_openapi.yaml
agent-manager-service/models/score.go
agent-manager-service/repositories/evaluation_score_repository.go
agent-manager-service/services/monitor_scores_service.go
agent-manager-service/spec/api_default.go
agent-manager-service/spec/model_evaluation_score_response.go
agent-manager-service/spec/model_evaluator_score_summary.go
agent-manager-service/spec/model_grouped_scores_response.go
agent-manager-service/spec/model_label_evaluator_summary.go
agent-manager-service/spec/model_publish_aggregate_item.go
agent-manager-service/spec/model_publish_score_item.go
agent-manager-service/spec/model_score_label_group.go
agent-manager-service/spec/model_span_context.go
agent-manager-service/spec/model_time_series_point.go
agent-manager-service/tests/monitor_scores_test.go
agent-manager-service/tests/score_repository_test.go
agent-manager-service/utils/makeresults.go
console/workspaces/libs/api-client/src/apis/monitors.ts
console/workspaces/libs/api-client/src/hooks/monitors.ts
console/workspaces/libs/types/src/api/monitors.ts
console/workspaces/pages/eval/src/CreateMonitor.Component.tsx
console/workspaces/pages/eval/src/EditMonitor.Component.tsx
console/workspaces/pages/eval/src/EvalMonitors.Component.tsx
console/workspaces/pages/eval/src/ViewMonitor.Component.tsx
console/workspaces/pages/eval/src/form/schema.ts
console/workspaces/pages/eval/src/hooks/useValidatedForm.ts
console/workspaces/pages/eval/src/index.ts
console/workspaces/pages/eval/src/subComponents/AgentPerformanceCard.tsx
console/workspaces/pages/eval/src/subComponents/CreateMonitorForm.tsx
console/workspaces/pages/eval/src/subComponents/EvaluationSummaryCard.tsx
console/workspaces/pages/eval/src/subComponents/EvaluatorDetailsDrawer.tsx
console/workspaces/pages/eval/src/subComponents/EvaluatorLlmProviderSection.tsx
console/workspaces/pages/eval/src/subComponents/MetricsTooltip.tsx
console/workspaces/pages/eval/src/subComponents/MonitorFormWizard.tsx
console/workspaces/pages/eval/src/subComponents/MonitorRunDrawer.tsx
console/workspaces/pages/eval/src/subComponents/MonitorRunList.tsx
console/workspaces/pages/eval/src/subComponents/MonitorStartStopButton.tsx
console/workspaces/pages/eval/src/subComponents/MonitorTable.tsx
console/workspaces/pages/eval/src/subComponents/PerformanceByEvaluatorCard.tsx
console/workspaces/pages/eval/src/subComponents/RunSummaryCard.tsx
console/workspaces/pages/eval/src/subComponents/ScoreBreakdownCard.tsx
console/workspaces/pages/eval/src/subComponents/ScoreChip.tsx
console/workspaces/pages/eval/src/subComponents/SelectPresetMonitors.tsx
console/workspaces/pages/eval/src/subComponents/levelConfig.ts
console/workspaces/pages/eval/src/utils/monitorFormUtils.ts
evaluation-job/main.py
evaluation-job/test_main.py
libs/amp-evaluation/src/amp_evaluation/__init__.py
libs/amp-evaluation/src/amp_evaluation/evaluators/base.py
libs/amp-evaluation/src/amp_evaluation/evaluators/builtin/deepeval.py
libs/amp-evaluation/src/amp_evaluation/evaluators/builtin/standard.py
libs/amp-evaluation/src/amp_evaluation/models.py
libs/amp-evaluation/src/amp_evaluation/runner.py
libs/amp-evaluation/src/amp_evaluation/trace/fetcher.py
libs/amp-evaluation/src/amp_evaluation/trace/parser.py
libs/amp-evaluation/tests/test_base_evaluator_coverage.py
libs/amp-evaluation/tests/test_eval_result.py
libs/amp-evaluation/tests/test_evaluators_deepeval_agent.py
libs/amp-evaluation/tests/test_llm_as_judge.py
libs/amp-evaluation/tests/test_span_filtering.py
libs/amp-evaluation/tests/test_trace_parser.py

💤 Files with no reviewable changes (1)

console/workspaces/pages/eval/src/subComponents/EvaluatorLlmProviderSection.tsx

coderabbitai

Actionable comments posted: 3

♻️ Duplicate comments (1)

agent-manager-service/docs/api_v1_openapi.yaml (1)

10119-10123: ⚠️ Potential issue | 🟠 Major

LabelEvaluatorSummary.mean should allow null for skipped-only groups.

When a group has only skipped evaluations, mean is undefined. Keeping mean as non-nullable forces coercion and can mislead analytics/UI.

Proposed schema fix

     LabelEvaluatorSummary:
       type: object
       required:
         - evaluatorName
         - mean
         - count
         - skippedCount
       properties:
         evaluatorName:
           type: string
         mean:
           type: number
           format: double
-          description: Mean score (0-1)
+          nullable: true
+          description: Mean score (0-1), null when all evaluations are skipped

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@agent-manager-service/docs/api_v1_openapi.yaml` around lines 10119 - 10123,
The schema for LabelEvaluatorSummary.mean must allow null for cases with only
skipped evaluations; update the OpenAPI definition for the property
(LabelEvaluatorSummary.mean) to be nullable by adding nullable: true (or change
type to ["number","null"]) while keeping format: double and the description, so
consumers can distinguish undefined means; ensure generated clients/validators
reflect the nullable change.

🧹 Nitpick comments (7)

console/workspaces/pages/eval/src/form/schema.ts (1)

42-54: Consider removing redundant .min(1) checks.

The .min(1) validators for displayName (line 45) and name (line 51) are superseded by the subsequent .min(3) checks. An empty string or 1-2 character string will trigger the .min(3) error, so the .min(1) message is unreachable.

♻️ Suggested simplification

     displayName: z
       .string()
       .trim()
-      .min(1, "Monitor title is required")
-      .min(3, "Monitor title must be at least 3 characters")
+      .min(3, "Monitor title is required (at least 3 characters)")
       .max(120, "Monitor title must be at most 120 characters"),
     name: z
       .string()
       .trim()
-      .min(1, "Monitor identifier is required")
       .regex(/^[a-z0-9-]+$/, "Use lowercase letters, numbers, and hyphens only")
-      .min(3, "Identifier must be at least 3 characters")
+      .min(3, "Monitor identifier is required (at least 3 characters)")
       .max(60, "Identifier must be at most 60 characters"),

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@console/workspaces/pages/eval/src/form/schema.ts` around lines 42 - 54,
Remove the redundant .min(1) validators from the zod schema for displayName and
name in this file: in the displayName chain (symbol displayName) drop the
.min(1, "Monitor title is required") call and in the name chain (symbol name)
drop the .min(1, "Monitor identifier is required") call so the existing .min(3)
and other validators (trim, max, regex) remain and continue to enforce
validation.

console/workspaces/pages/eval/src/subComponents/MonitorTable.tsx (2)

241-256: Remove unused envId from route params.

The route template /org/:orgId/project/:projectId/agents/:agentId/evaluation/monitor/view/:monitorId doesn't include :envId. The extra param is silently ignored by generatePath, but it's unnecessary and may mislead future readers.

Suggested fix

                 navigate(
                   generatePath(
                     absoluteRouteMap.children.org.children.projects.children
                       .agents.children.evaluation.children.monitor.children.view
                       .path,
                     {
                       agentId,
                       orgId,
                       projectId,
-                      envId,
                       monitorId: monitor.name,
                     },
                   ),
                 )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@console/workspaces/pages/eval/src/subComponents/MonitorTable.tsx` around
lines 241 - 256, The route generation call using generatePath inside the onClick
handler is passing an unnecessary envId param; remove envId from the params
object so only the actual route params (agentId, orgId, projectId, monitorId)
are passed (refer to the onClick handler, generatePath call,
absoluteRouteMap...monitor.children.view.path, and the monitorId value derived
from monitor.name).

297-316: Remove unused envId from edit route params.

Same as the view path—the edit route /org/:orgId/project/:projectId/agents/:agentId/evaluation/monitor/edit/:monitorId doesn't include :envId.

Suggested fix

                   <IconButton
                     onClick={() =>
                       navigate(
                         generatePath(
                           absoluteRouteMap.children.org.children.projects
                             .children.agents.children.evaluation.children
                             .monitor.children.edit.path,
                           {
                             agentId,
                             orgId,
                             projectId,
-                            envId,
                             monitorId: monitor.name,
                           },
                         ),
                       )
                     }
                   >

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@console/workspaces/pages/eval/src/subComponents/MonitorTable.tsx` around
lines 297 - 316, The edit route call is passing an unused envId param; update
the IconButton onClick that calls navigate(generatePath(...)) to remove envId
from the params object so it matches the route shape used for edit (monitorId,
agentId, orgId, projectId and any other required ids) — locate the IconButton
onClick block referencing generatePath and
absoluteRouteMap.children.org.children.projects.children.agents.children.evaluation.children.monitor.children.edit.path
and delete the envId property from the params passed (keep monitorId:
monitor.name and the other ids intact).

console/workspaces/pages/eval/src/subComponents/MonitorRunDrawer.tsx (2)

188-191: Use a collision-safe key for evaluator rows.

key={ev.identifier} can collide if identifiers repeat (especially with mixed evaluator levels), which can cause incorrect row reconciliation.

♻️ Proposed refactor

-                    {(run.evaluators ?? []).map((ev) => (
+                    {(run.evaluators ?? []).map((ev, index) => (
                       <Box
-                        key={ev.identifier}
+                        key={`${ev.identifier}-${index}`}
                         sx={{

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@console/workspaces/pages/eval/src/subComponents/MonitorRunDrawer.tsx` around
lines 188 - 191, The evaluator list mapping in MonitorRunDrawer uses a
non-unique key (key={ev.identifier}) which can collide; update the key used in
the (run.evaluators ?? []).map(...) render to a collision-safe composite key
that uniquely identifies each row (for example combine evaluator level,
identifier and the iteration index or another unique property) so React can
reconcile rows correctly; locate the map in MonitorRunDrawer and replace the
single ev.identifier key with a composite key using ev.level, ev.identifier and
index (or a true unique id if available).

133-133: Replace hardcoded pixel sizing with theme tokens in sx.

height: 80 and minWidth: 120 hardcode dimensions and make theme consistency harder.

♻️ Proposed refactor

-                  sx={{ height: 80 }}
+                  sx={{ height: (theme) => theme.spacing(10) }}
...
-                                  sx={{ minWidth: 120 }}
+                                  sx={{ minWidth: (theme) => theme.spacing(15) }}

As per coding guidelines: Use theme tokens via the sx prop instead of hardcoded colors and spacing values.

Also applies to: 220-220

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@console/workspaces/pages/eval/src/subComponents/MonitorRunDrawer.tsx` at line
133, In MonitorRunDrawer's JSX (the sx prop where height: 80 and minWidth: 120
are set) replace the hardcoded numeric px values with theme tokens via the sx
function callback: use theme spacing or size tokens for height and minWidth
(e.g., derive values from theme.spacing(...) or theme.breakpoints/size tokens)
so the component follows the design system and scales with the theme; update
both occurrences of height: 80 and minWidth: 120 in MonitorRunDrawer.tsx to use
theme-based tokens instead of raw numbers.

agent-manager-service/models/score.go (1)

83-83: Unify level validation for individual and aggregated score payloads.

PublishAggregateItem.Level is constrained to trace|agent|llm, but PublishScoreItem.Level currently accepts any non-empty string. Tightening this prevents invalid level values from entering publish flow.

♻️ Suggested change

 type PublishScoreItem struct {
 	EvaluatorName  string              `json:"evaluatorName" validate:"required"`
-	Level          string              `json:"level" validate:"required"`
+	Level          string              `json:"level" validate:"required,oneof=trace agent llm"`
 	TraceID        string              `json:"traceId" validate:"required"`
 	Score          *float64            `json:"score,omitempty" validate:"omitempty,min=0,max=1"`
 	Explanation    *string             `json:"explanation,omitempty"`
 	TraceStartTime *time.Time          `json:"traceStartTime,omitempty"`
 	SkipReason     *string             `json:"skipReason,omitempty"`
 	SpanContext    *PublishSpanContext `json:"spanContext,omitempty"`
 }

Also applies to: 96-96

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@agent-manager-service/models/score.go` at line 83, Update the validation tags
on the Level fields to enforce the same allowed values across both payloads:
change PublishScoreItem.Level and PublishAggregateItem.Level to use
validate:"required,oneof=trace agent llm" so only "trace", "agent", or "llm" are
accepted; leave the json tags as-is and ensure any downstream validation uses
these struct tags.

agent-manager-service/tests/monitor_scores_test.go (1)

101-101: Add direct test coverage for the new /scores/breakdown endpoint.

Line 101 wires the handler, but this suite doesn’t yet assert validation/behavior for that path (e.g., required startTime, endTime, and level constraints).

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@agent-manager-service/tests/monitor_scores_test.go` at line 101, Add direct
test cases for the new "/scores/breakdown" route wired by mux.HandleFunc and
served by ctrl.GetGroupedScores: create tests that call GET
base+"/scores/breakdown" with missing parameters to assert 400 responses for
missing required startTime, endTime, and level; test invalid formats for
startTime/endTime and out-of-range or unsupported values for level to assert
appropriate validation errors; also add a happy-path test providing valid
startTime, endTime, and level to assert 200 and validate the response shape
(grouped score structure) returned by GetGroupedScores.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@agent-manager-service/db_migrations/005_create_monitors.go`:
- Line 32: Do not modify the historical migration variable migration005 in
agent-manager-service/db_migrations/005_create_monitors.go; instead create a new
forward migration file (next unused ID) named e.g.
db_migrations/00X_add_monitor_description.go that registers a migration
(migration00X) which runs a transactional ALTER TABLE monitors ADD COLUMN IF NOT
EXISTS description VARCHAR(512) NOT NULL DEFAULT ''; ensure the new migration's
ID is unique and the Migrate function uses runSQL(tx, `ALTER TABLE monitors ADD
COLUMN IF NOT EXISTS description VARCHAR(512) NOT NULL DEFAULT ''`) so existing
databases get the new column without changing past migration 005.

In `@agent-manager-service/docs/api_v1_openapi.yaml`:
- Around line 10012-10033: The OpenAPI schema changed
TraceScoresResponse.monitors to use TraceMonitorGroup (which requires
monitorName, evaluators: TraceEvaluatorScore[], and spans: TraceSpanGroup[]) but
the client types still expect the old MonitorTraceGroup/EvaluatorTraceGroup
shape (with monitorId, runId, and nested scores), causing a breaking contract;
either update the client type definitions to match the new schema (replace
MonitorTraceGroup/EvaluatorTraceGroup usages with TraceMonitorGroup and adjust
fields to monitorName, evaluators, spans and their
TraceEvaluatorScore/TraceSpanGroup shapes) or revert the schema change to
restore monitorId/runId/scores; locate and update the client models and any code
referencing TraceScoresResponse.monitors, MonitorTraceGroup,
EvaluatorTraceGroup, monitorId, runId, and scores so the API contract and client
types are consistent.

In `@console/workspaces/pages/eval/src/subComponents/MonitorTable.tsx`:
- Around line 51-60: The getStatusColor function declares its parameter as
MonitorStatus but callers sometimes pass the string "Unknown"; update
getStatusColor to handle "Unknown" explicitly (e.g., add a case returning
"default" or "warning") and/or widen the parameter type to include "Unknown" so
TypeScript is safe; ensure all call sites that may pass a fallback "Unknown"
value use the updated type or map that fallback before calling getStatusColor.

---

Duplicate comments:
In `@agent-manager-service/docs/api_v1_openapi.yaml`:
- Around line 10119-10123: The schema for LabelEvaluatorSummary.mean must allow
null for cases with only skipped evaluations; update the OpenAPI definition for
the property (LabelEvaluatorSummary.mean) to be nullable by adding nullable:
true (or change type to ["number","null"]) while keeping format: double and the
description, so consumers can distinguish undefined means; ensure generated
clients/validators reflect the nullable change.

---

Nitpick comments:
In `@agent-manager-service/models/score.go`:
- Line 83: Update the validation tags on the Level fields to enforce the same
allowed values across both payloads: change PublishScoreItem.Level and
PublishAggregateItem.Level to use validate:"required,oneof=trace agent llm" so
only "trace", "agent", or "llm" are accepted; leave the json tags as-is and
ensure any downstream validation uses these struct tags.

In `@agent-manager-service/tests/monitor_scores_test.go`:
- Line 101: Add direct test cases for the new "/scores/breakdown" route wired by
mux.HandleFunc and served by ctrl.GetGroupedScores: create tests that call GET
base+"/scores/breakdown" with missing parameters to assert 400 responses for
missing required startTime, endTime, and level; test invalid formats for
startTime/endTime and out-of-range or unsupported values for level to assert
appropriate validation errors; also add a happy-path test providing valid
startTime, endTime, and level to assert 200 and validate the response shape
(grouped score structure) returned by GetGroupedScores.

In `@console/workspaces/pages/eval/src/form/schema.ts`:
- Around line 42-54: Remove the redundant .min(1) validators from the zod schema
for displayName and name in this file: in the displayName chain (symbol
displayName) drop the .min(1, "Monitor title is required") call and in the name
chain (symbol name) drop the .min(1, "Monitor identifier is required") call so
the existing .min(3) and other validators (trim, max, regex) remain and continue
to enforce validation.

In `@console/workspaces/pages/eval/src/subComponents/MonitorRunDrawer.tsx`:
- Around line 188-191: The evaluator list mapping in MonitorRunDrawer uses a
non-unique key (key={ev.identifier}) which can collide; update the key used in
the (run.evaluators ?? []).map(...) render to a collision-safe composite key
that uniquely identifies each row (for example combine evaluator level,
identifier and the iteration index or another unique property) so React can
reconcile rows correctly; locate the map in MonitorRunDrawer and replace the
single ev.identifier key with a composite key using ev.level, ev.identifier and
index (or a true unique id if available).
- Line 133: In MonitorRunDrawer's JSX (the sx prop where height: 80 and
minWidth: 120 are set) replace the hardcoded numeric px values with theme tokens
via the sx function callback: use theme spacing or size tokens for height and
minWidth (e.g., derive values from theme.spacing(...) or theme.breakpoints/size
tokens) so the component follows the design system and scales with the theme;
update both occurrences of height: 80 and minWidth: 120 in MonitorRunDrawer.tsx
to use theme-based tokens instead of raw numbers.

In `@console/workspaces/pages/eval/src/subComponents/MonitorTable.tsx`:
- Around line 241-256: The route generation call using generatePath inside the
onClick handler is passing an unnecessary envId param; remove envId from the
params object so only the actual route params (agentId, orgId, projectId,
monitorId) are passed (refer to the onClick handler, generatePath call,
absoluteRouteMap...monitor.children.view.path, and the monitorId value derived
from monitor.name).
- Around line 297-316: The edit route call is passing an unused envId param;
update the IconButton onClick that calls navigate(generatePath(...)) to remove
envId from the params object so it matches the route shape used for edit
(monitorId, agentId, orgId, projectId and any other required ids) — locate the
IconButton onClick block referencing generatePath and
absoluteRouteMap.children.org.children.projects.children.agents.children.evaluation.children.monitor.children.edit.path
and delete the envId property from the params passed (keep monitorId:
monitor.name and the other ids intact).

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: a5ec79cf-f6f4-49ef-b0f9-34b0e1f0d5e7

📥 Commits

Reviewing files that changed from the base of the PR and between 04dda5e and f8df813.

⛔ Files ignored due to path filters (1)

console/common/config/rush/pnpm-lock.yaml is excluded by !**/pnpm-lock.yaml

📒 Files selected for processing (23)

agent-manager-service/api/monitor_routes.go
agent-manager-service/controllers/evaluator_controller.go
agent-manager-service/controllers/monitor_controller.go
agent-manager-service/controllers/monitor_scores_controller.go
agent-manager-service/controllers/monitor_scores_publisher_controller.go
agent-manager-service/db_migrations/005_create_monitors.go
agent-manager-service/docs/api_v1_openapi.yaml
agent-manager-service/models/monitor.go
agent-manager-service/models/score.go
agent-manager-service/services/monitor_manager.go
agent-manager-service/spec/api_default.go
agent-manager-service/spec/model_create_monitor_request.go
agent-manager-service/spec/model_monitor_response.go
agent-manager-service/tests/monitor_scores_test.go
agent-manager-service/utils/constants.go
agent-manager-service/utils/makeresults.go
console/workspaces/libs/api-client/src/apis/monitors.ts
console/workspaces/libs/types/src/api/monitors.ts
console/workspaces/pages/eval/src/CreateMonitor.Component.tsx
console/workspaces/pages/eval/src/form/schema.ts
console/workspaces/pages/eval/src/subComponents/MonitorRunDrawer.tsx
console/workspaces/pages/eval/src/subComponents/MonitorTable.tsx
console/workspaces/pages/eval/src/subComponents/ScoreBreakdownCard.tsx

🚧 Files skipped from review as they are similar to previous changes (3)

console/workspaces/libs/api-client/src/apis/monitors.ts
agent-manager-service/api/monitor_routes.go
console/workspaces/pages/eval/src/CreateMonitor.Component.tsx

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@agent-manager-service/spec/model_trace_scores_response.go`:
- Line 24: Rewrite ConvertToTraceScoresResponse in
agent-manager-service/utils/makeresults.go so it constructs
spec.TraceMonitorGroup instances (not the removed
spec.MonitorTraceGroup/spec.EvaluatorTraceGroup types): iterate your existing
model monitors and for each create a spec.TraceMonitorGroup, omit MonitorId and
RunId (they no longer exist), convert the model's evaluator entries into
spec.TraceEvaluatorScore items (map fields accordingly to the new
TraceEvaluatorScore shape), and populate a new Spans field by mapping model span
data into spec.TraceSpanGroup slices; ensure all references to
spec.MonitorTraceGroup and spec.EvaluatorTraceGroup are removed and replaced
with spec.TraceMonitorGroup, spec.TraceEvaluatorScore and spec.TraceSpanGroup
mappings inside ConvertToTraceScoresResponse.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 2ab0b20d-f712-44c7-b29b-680d219aa838

📥 Commits

Reviewing files that changed from the base of the PR and between f8df813 and f57d083.

📒 Files selected for processing (7)

agent-manager-service/spec/model_evaluator_trace_group.go
agent-manager-service/spec/model_monitor_trace_group.go
agent-manager-service/spec/model_score_item.go
agent-manager-service/spec/model_trace_evaluator_score.go
agent-manager-service/spec/model_trace_monitor_group.go
agent-manager-service/spec/model_trace_scores_response.go
agent-manager-service/spec/model_trace_span_group.go

💤 Files with no reviewable changes (3)

agent-manager-service/spec/model_monitor_trace_group.go
agent-manager-service/spec/model_evaluator_trace_group.go
agent-manager-service/spec/model_score_item.go

nadheesh added 5 commits March 3, 2026 18:14

Add eval level concepts to service scores

426dc06

Improve sdk and job to publish score level context

ea40675

Add dashboard and API capabilities to show eval level-based stats

204cc5a

Improve eval dashboard look and feel

04dda5e

Merge branch 'wso2:main' into main

ffc48da

coderabbitai Bot reviewed Mar 4, 2026

View reviewed changes

nadheesh mentioned this pull request Mar 4, 2026

Fix UI issues triggered during PR check #483

Closed

nadheesh added 2 commits March 4, 2026 15:21

Address PR comments

6e1ebad

Merge branch 'wso2:main' into main

f6c1804

hanzjk reviewed Mar 4, 2026

View reviewed changes

Comment thread agent-manager-service/controllers/monitor_scores_controller.go Outdated

Comment thread agent-manager-service/docs/api_v1_openapi.yaml Outdated

Comment thread agent-manager-service/docs/api_v1_openapi.yaml

Address PR comments

f8df813

hanzjk previously approved these changes Mar 4, 2026

View reviewed changes

nadheesh dismissed hanzjk’s stale review via f8df813 March 4, 2026 10:31

coderabbitai Bot reviewed Mar 4, 2026

View reviewed changes

Comment thread agent-manager-service/db_migrations/005_create_monitors.go

Comment thread agent-manager-service/docs/api_v1_openapi.yaml Outdated

Comment thread console/workspaces/pages/eval/src/subComponents/MonitorTable.tsx

coderabbitai Bot reviewed Mar 4, 2026

View reviewed changes

Comment thread agent-manager-service/spec/model_trace_scores_response.go Outdated

Fix spec generation issue

8d48211

nadheesh force-pushed the feat/eval-level-dashboard branch from f57d083 to 8d48211 Compare March 4, 2026 11:03

hanzjk approved these changes Mar 4, 2026

View reviewed changes

nadheesh merged commit b344225 into wso2:main Mar 4, 2026
18 of 19 checks passed

coderabbitai Bot mentioned this pull request Mar 5, 2026

Add evaluation scores to trace view and improve evaluator naming #493

Merged

5 tasks

Conversation

nadheesh commented Mar 4, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nadheesh commented Mar 4, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Mar 4, 2026 •

edited

Loading