Add level-aware eval dashboard with agent/model breakdowns#481
Add level-aware eval dashboard with agent/model breakdowns#481
Conversation
|
Caution Review failedPull request was closed or merged during review 📝 WalkthroughWalkthroughThis pull request introduces level-aware evaluation dashboards and score breakdown by agent/model. It adds a new Changes
Sequence Diagram(s)sequenceDiagram
participant Client as Client
participant API as API<br/>(Monitor Routes)
participant Controller as GetGroupedScores<br/>Controller
participant Service as MonitorScoresService
participant Repo as ScoreRepository
participant DB as Database
Client->>API: GET /scores/breakdown<br/>(monitorName, startTime, endTime, level)
API->>Controller: Route to GetGroupedScores
Controller->>Controller: Validate timeRange
Controller->>Controller: Extract level param
Controller->>Service: GetGroupedScores(monitorID,<br/>timeRange, level)
Service->>Repo: GetScoresGroupedByLabel<br/>(monitorID, startTime, endTime, level)
Repo->>DB: SELECT span_label, evaluator_name,<br/>AVG(score), COUNT(*), skipped_count<br/>GROUP BY span_label, evaluator_name<br/>WHERE monitor_id = ? AND trace_start_time BETWEEN ?
DB-->>Repo: [LabelAggregation]
Repo-->>Service: []LabelAggregation
Service->>Service: Build GroupedScoresResponse<br/>(groups: ScoreLabelGroup[],<br/>each with evaluators)
Service-->>Controller: GroupedScoresResponse
Controller->>Controller: Convert to spec<br/>representation
Controller-->>Client: JSON GroupedScoresResponse
sequenceDiagram
participant Evaluator as Evaluator<br/>(Base/Builtin)
participant Result as EvalResult
participant Score as EvaluatorScore
participant Runner as Runner
participant Publisher as Publisher
Evaluator->>Evaluator: evaluate(trace, task)
Evaluator->>Result: Create EvalResult<br/>(score, explanation)
Result-->>Evaluator: EvalResult
Evaluator->>Score: EvaluatorScore.from_eval_result<br/>(result, trace_id,<br/>span_context)
Score-->>Evaluator: EvaluatorScore<br/>(with span/task context)
Evaluator-->>Runner: [EvaluatorScore]
Runner->>Runner: Enrich with<br/>task_context if present
Runner->>Publisher: publish([EvaluatorScore])
Publisher->>Publisher: Build payload<br/>with spanContext,<br/>traceStartTime
Publisher-->>Publisher: POST scores
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~40 minutes Possibly related PRs
Poem
✨ Finishing Touches🧪 Generate unit tests (beta)
|
There was a problem hiding this comment.
Actionable comments posted: 18
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
libs/amp-evaluation/src/amp_evaluation/trace/parser.py (1)
99-113:⚠️ Potential issue | 🟠 MajorDirect semantic roots are never counted as orphans
Line 99 already filters to semantic kinds, so the Line 112 infrastructure check is unreachable. This misses direct semantic roots (
parentSpanId is None) and can prevent synthetic-root creation when multiple semantic roots exist.Suggested fix
for span in semantic_spans: parent_id = span.parentSpanId if parent_id: # Walk up to find semantic parent final_parent = remap_map.get(parent_id, parent_id) if final_parent is None: orphans.append(span) - elif parent_id is None: - # Root span - check if it's infrastructure - kind = span.ampAttributes.kind - if kind in INFRASTRUCTURE_KINDS: - orphans.append(span) + else: + # Direct semantic root has no semantic parent + orphans.append(span)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@libs/amp-evaluation/src/amp_evaluation/trace/parser.py` around lines 99 - 113, The loop currently iterates only semantic_spans (filtered by SEMANTIC_KINDS), so the parentSpanId is None branch that checks INFRASTRUCTURE_KINDS is never reached and direct semantic roots are missed; change the loop to iterate over the original spans list (or iterate both semantic and non-semantic as appropriate) and within the loop: if span.ampAttributes.kind in SEMANTIC_KINDS perform the parent_id/remap_map parent-walk logic (appending to orphans when remap_map resolves to None), and separately if span.parentSpanId is None and span.ampAttributes.kind not in INFRASTRUCTURE_KINDS treat it as an orphan semantic root (append to orphans) so synthetic-root creation will work when multiple semantic roots exist.
🧹 Nitpick comments (15)
console/workspaces/pages/eval/src/utils/monitorFormUtils.ts (1)
70-75: Consider avoiding unconditional interval reset on future type switch.Line 74 forces
intervalMinutesto10, which can clobber user-selected values when toggling monitor type. Consider applying10only at initial form defaulting (or preserving existing interval on toggle).Possible adjustment
if (type === "future") { return { traceStart: null, traceEnd: null, - intervalMinutes: 10, + intervalMinutes: undefined, }; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@console/workspaces/pages/eval/src/utils/monitorFormUtils.ts` around lines 70 - 75, The branch handling type === "future" currently unconditionally sets intervalMinutes to 10 and will clobber a user-selected value; update that block in monitorFormUtils.ts (the type === "future" branch) to preserve an existing intervalMinutes if provided (e.g., use values?.intervalMinutes or a passed-in priorForm value) and only fall back to 10 when no existing interval is present, or move the 10 default to the initial form-defaulting path so toggling to "future" does not overwrite the user's interval.console/workspaces/pages/eval/src/subComponents/MonitorRunList.tsx (2)
234-241: Use exhaustive status mapping with TypeScript enforcement instead of casting.The cast on line 334 (
as keyof typeof statusIcons) bypasses TypeScript's type checking. While all currentMonitorRunStatusvalues are covered, if the status type is extended in the future, TypeScript won't warn thatstatusIconsis incomplete. Adopt the exhaustive mapping pattern already used inMonitorRunDrawer:const statusIcons: Record<MonitorRunStatus, React.ReactNode> = { failed: <CircleAlert size={20} color={palette?.error.main} />, success: <CheckCircle size={20} color={palette?.success.main} />, running: <Activity size={20} color={palette?.warning.main} />, pending: <CircularProgress size={20} />, };This enforces TypeScript to verify all status values are handled, eliminating the cast at line 334.
Also applies to: 334-335
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@console/workspaces/pages/eval/src/subComponents/MonitorRunList.tsx` around lines 234 - 241, The statusIcons object is currently created with useMemo and later accessed via a cast which defeats TypeScript exhaustiveness checks; replace the useMemo + cast pattern with an explicit typed mapping to enforce all MonitorRunStatus values are handled (e.g. declare statusIcons as Record<MonitorRunStatus, React.ReactNode> and populate keys failed/success/running/pending), remove the "as keyof typeof statusIcons" cast where the icon is selected, and follow the same exhaustive pattern used in MonitorRunDrawer so future MonitorRunStatus additions will cause a compile-time error if not covered.
124-125: Add error handling to theuseRerunMonitormutation.The mutation already invalidates monitor-runs queries on success, triggering a list refresh automatically. However, there is no error callback to surface failures to users. Add an
onErrorhandler touseRerunMonitoror configure global error handlers on theQueryClientviaMutationCacheto ensure errors are consistently displayed.Also applies to: 387-395
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@console/workspaces/pages/eval/src/subComponents/MonitorRunList.tsx` around lines 124 - 125, The useRerunMonitor mutation call lacks an onError handler so failures aren't surfaced; update the invocation of useRerunMonitor (and the identical pattern at the other occurrence around where lines 387-395 handle rerun) to pass an onError callback that receives the error and displays it to the user (for example by calling your existing notification/toast function or setting local error state), while keeping the existing onSuccess invalidation logic; alternatively, configure a global MutationCache on the QueryClient to funnel mutation errors to the same user-facing handler so all rerun errors are consistently shown.console/workspaces/pages/eval/src/subComponents/MonitorRunDrawer.tsx (2)
113-113: Replace hardcoded pixel sizing with token/flex-based sizing.Fixed pixel offsets (
96px,300px,80) make this drawer brittle across viewport sizes. Prefer theme-token/flex sizing for consistent behavior.As per coding guidelines, "Use theme tokens via the `sx` prop instead of hardcoded colors and spacing values".♻️ Suggested refactor
- <Stack spacing={2} height="calc(100vh - 96px)"> + <Stack spacing={2} sx={{ flex: 1, minHeight: 0 }}> ... - sx={{ height: 80 }} + sx={{ height: (theme) => theme.spacing(10) }} ... - maxHeight="calc(100vh - 300px)" + maxHeight="100%" ... - <Box sx={{ overflowY: "auto", maxHeight: "calc(100vh - 300px)" }}> + <Box sx={{ overflowY: "auto", flex: 1, minHeight: 0 }}>Also applies to: 133-133, 165-171
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@console/workspaces/pages/eval/src/subComponents/MonitorRunDrawer.tsx` at line 113, The component MonitorRunDrawer uses hardcoded pixel sizes (e.g. the Stack height "calc(100vh - 96px)", the drawer width ~300px, and the 80px value) which makes layout brittle; replace these with theme tokens and flexible layout rules by using the sx prop and theme.spacing or responsive breakpoints (e.g. use sx={{ height: `calc(100vh - ${theme.mixins.toolbar?.minHeight || theme.spacing(12)})` }} or better, use flex: 1 / maxHeight and theme.spacing(token) for gaps) and swap fixed width/height properties for flexbox/grow/shrink or theme.breakpoints-based values in the same MonitorRunDrawer component (look for the Stack element and the drawer/containers around it and the places with 300px/80 values) so the drawer sizes adapt to viewport and theme tokens.
72-78: Avoid silent emptyrunIdfallback when fetching logs.Passing
""asrunIdcan produce avoidable invalid fetches/error states. Prefer guarding the query with a validrun.idinstead of coercing to an empty string.♻️ Suggested direction
- const { data, isLoading, error } = useMonitorRunLogs({ + const runId = run.id; + const { data, isLoading, error } = useMonitorRunLogs({ orgName, projName: projectName, agentName, monitorName, - runId: run.id ?? "", + runId: runId ?? "", }); + // If supported by the hook, also gate execution with `enabled: Boolean(runId)`.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@console/workspaces/pages/eval/src/subComponents/MonitorRunDrawer.tsx` around lines 72 - 78, The current call to useMonitorRunLogs passes run.id ?? "" which sends an empty runId and triggers invalid fetches; change the component to only invoke useMonitorRunLogs when a valid run.id exists (or pass undefined and use the hook's enabled option) — locate the call to useMonitorRunLogs and replace the unconditional call with a guarded invocation (e.g., check run.id truthiness before calling or pass { enabled: !!run.id } into the hook) so that useMonitorRunLogs is not invoked with an empty string runId.console/workspaces/pages/eval/src/subComponents/MonitorStartStopButton.tsx (1)
64-69: Minor redundancy in null checks.The check
!orgId || !projectId || !agentIdon line 67 is already covered byisDisabled(lines 50-56). Consider simplifying:♻️ Suggested simplification
const handleClick = useCallback( (event: MouseEvent<HTMLButtonElement>) => { event.stopPropagation(); - if (isDisabled || !orgId || !projectId || !agentId) { + if (isDisabled) { return; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@console/workspaces/pages/eval/src/subComponents/MonitorStartStopButton.tsx` around lines 64 - 69, The handleClick useCallback currently redundantly checks !orgId || !projectId || !agentId even though those conditions are already encapsulated by the isDisabled variable; simplify the handler by removing the redundant null/undefined checks and only early-return on isDisabled (keep event.stopPropagation() and the existing guards relying on isDisabled), referencing handleClick, isDisabled, orgId, projectId, and agentId to locate the code to edit.console/workspaces/pages/eval/src/subComponents/ScoreChip.tsx (1)
26-30: Consider using theme tokens for score colors.The hardcoded hex colors violate the coding guideline to "use theme tokens via
sxprop instead of hardcoded colors." Consider using theme palette colors likesuccess.main,warning.main, anderror.main:♻️ Suggested refactor using theme tokens
-export function scoreColor(score: number): string { - if (score >= 0.8) return "#22c55e"; - if (score >= 0.6) return "#f59e0b"; - return "#ef4444"; +import type { Theme } from "@wso2/oxygen-ui"; + +export function scoreColor(score: number, theme: Theme): string { + if (score >= 0.8) return theme.palette.success.main; + if (score >= 0.6) return theme.palette.warning.main; + return theme.palette.error.main; }This would require updating the component and callers to pass the theme. Alternatively, if these specific colors are intentional brand/semantic colors for score visualization, document this as an exception.
As per coding guidelines: "Use theme tokens via the
sxprop instead of hardcoded colors and spacing values"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@console/workspaces/pages/eval/src/subComponents/ScoreChip.tsx` around lines 26 - 30, Replace the hardcoded hex values in scoreColor with theme palette tokens: change scoreColor(score: number) to accept the theme (e.g., scoreColor(score: number, theme: Theme)) or return a palette key and update callers accordingly; map scores to theme.palette.success.main, theme.palette.warning.main and theme.palette.error.main (or to the string keys "success.main"/"warning.main"/"error.main" if you prefer to let callers resolve via sx), and update all callers of scoreColor (e.g., in ScoreChip component) to pass the theme or use the returned token in the sx prop; if these hex values are intentionally brand-specific, add a short comment documenting this exception instead.console/workspaces/pages/eval/src/subComponents/levelConfig.ts (1)
21-28: Consider using theme tokens for level colors.The hardcoded hex colors (
#7c3aed,#3f8cff,#22c55e) work but deviate from the coding guideline to use theme tokens via thesxprop. If these level colors are intended to be themeable or respect dark/light mode palette variations, consider defining them in the theme's palette (e.g.,theme.palette.levels.trace) and referencing them here.If these are intentionally fixed brand colors for level distinction, this is acceptable as-is.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@console/workspaces/pages/eval/src/subComponents/levelConfig.ts` around lines 21 - 28, LEVEL_CONFIG currently uses hard-coded hex colors for each EvaluationLevel (trace, agent, llm); change it to reference theme tokens so colors respond to theme/dark-mode. Update LEVEL_CONFIG (and any consumers that set sx) to use keys like theme.palette.levels.trace / .agent / .llm (or similarly named tokens you add to the theme) instead of literal hex strings; add the corresponding entries to your theme palette (e.g., palette.levels = { trace: ..., agent: ..., llm: ... }) so components can read theme.palette.levels.<level> when rendering. Ensure EvaluationLevel keys still match and preserve label/unit values.console/workspaces/libs/api-client/src/hooks/monitors.ts (1)
328-350: UseGroupedScoresQueryParamsdirectly to avoid query-shape drift.The hook currently redefines the query contract inline. Reusing the shared API type keeps this aligned with backend schema changes and avoids future silent divergence.
♻️ Suggested refactor
import { type CreateMonitorPathParams, type CreateMonitorRequest, type DeleteMonitorPathParams, type EvaluationLevel, type GetMonitorPathParams, type GroupedScoresPathParams, + type GroupedScoresQueryParams, type GroupedScoresResponse, @@ export function useGroupedScores( params: GroupedScoresPathParams, - query: { level: EvaluationLevel; timeRange?: TraceListTimeRange }, + query: GroupedScoresQueryParams & { timeRange?: TraceListTimeRange }, options?: { enabled?: boolean }, ) { @@ - const { timeRange, ...rest } = query; - let finalQuery = rest as { level: EvaluationLevel; startTime?: string; endTime?: string }; + const { timeRange, ...rest } = query; + let finalQuery: GroupedScoresQueryParams = rest; if (timeRange) { const { startTime, endTime } = getTimeRange(timeRange); finalQuery = { ...finalQuery, startTime, endTime }; } return getGroupedScores(params, finalQuery, getToken); @@ !!params.projName && !!params.agentName && !!params.monitorName && - !!query.timeRange, + (!!query.timeRange || (!!query.startTime && !!query.endTime)), }); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@console/workspaces/libs/api-client/src/hooks/monitors.ts` around lines 328 - 350, Replace the ad-hoc inline query shape with the shared API type GroupedScoresQueryParams to prevent drift: change the hook parameter type from the inline "{ level: EvaluationLevel; timeRange?: TraceListTimeRange }" to GroupedScoresQueryParams, update the local finalQuery typing to use GroupedScoresQueryParams (or a compatible extension with startTime/endTime), and pass that typed finalQuery into getGroupedScores; keep the getTimeRange, getGroupedScores, useQuery and queryKey logic intact but reference GroupedScoresQueryParams wherever the query shape is declared or cast.evaluation-job/test_main.py (1)
123-131: Broaden span-context test helper inputs to cover agent/model propagation.
_make_evaluator_scorehardcodesagent_name/model/vendorasNone, so regressions in those fields won’t be caught by tests even though the payload now supports them.🧪 Suggested helper enhancement
def _make_evaluator_score( trace_id, score, span_id=None, + agent_name=None, + model=None, + vendor=None, explanation=None, timestamp=None, error=None, ): @@ - if span_id is not None: + if any(v is not None for v in (span_id, agent_name, model, vendor)): span_ctx = MagicMock() span_ctx.span_id = span_id - span_ctx.agent_name = None - span_ctx.model = None - span_ctx.vendor = None + span_ctx.agent_name = agent_name + span_ctx.model = model + span_ctx.vendor = vendor s.span_context = span_ctx else: s.span_context = None🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@evaluation-job/test_main.py` around lines 123 - 131, The test helper _make_evaluator_score currently sets span_ctx.agent_name, span_ctx.model, and span_ctx.vendor to None when span_id is provided, which hides regressions; update _make_evaluator_score (and any helper constructing s.span_context) to accept optional parameters agent_name, model, and vendor (defaulting to None) and assign them to span_ctx.agent_name, span_ctx.model, span_ctx.vendor when creating the MagicMock span context; then update tests that rely on propagation to pass concrete agent/model/vendor values (preserving None defaults for existing callers).console/workspaces/pages/eval/src/ViewMonitor.Component.tsx (1)
142-146: Avoid implicit return in forEach callback.Same static analysis flag as in PerformanceByEvaluatorCard — the callback returns the result of
s.add(). Use a block body to silence the linter.♻️ Proposed fix
const levelsPresent = useMemo(() => { const s = new Set<EvaluationLevel>(); - evaluators.forEach((e) => s.add(e.level)); + evaluators.forEach((e) => { + s.add(e.level); + }); return s; }, [evaluators]);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@console/workspaces/pages/eval/src/ViewMonitor.Component.tsx` around lines 142 - 146, The linter flags the implicit return inside the forEach callback in the useMemo that builds levelsPresent; update the callback in the useMemo (the one iterating evaluators and calling s.add) to use a block body (or a for...of loop) so it does not implicitly return s.add(), e.g., change the evaluators.forEach((e) => s.add(e.level)) to a block callback that calls s.add(e.level); and then return the Set as before.agent-manager-service/services/monitor_scores_service.go (1)
394-397: Consider distinguishing "no data" from "zero score" in grouped results.When
agg.MeanScoreis nil, the mean defaults to0.0. This conflates "evaluator returned a score of 0" with "no scores were computed for this label/evaluator combination." Downstream consumers (UI tables) may want to display these differently (e.g., "N/A" vs "0%").If the API contract allows, consider making
Meana pointer or adding an explicit "has data" flag.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@agent-manager-service/services/monitor_scores_service.go` around lines 394 - 397, The current code sets mean := 0.0 when agg.MeanScore is nil which conflates "no data" with a real zero; change the response to preserve nil by making the output Mean a *float64 (or add an explicit boolean like HasMean) and assign Mean = agg.MeanScore (or set HasMean = agg.MeanScore != nil) instead of defaulting to 0.0; update the struct used for grouped results (the response type that contains Mean) and any JSON tags/consumers to handle a null value (or the new flag) so downstream UI can distinguish "N/A" from 0.console/workspaces/pages/eval/src/subComponents/PerformanceByEvaluatorCard.tsx (1)
166-170: Avoid implicit return in forEach callback.Static analysis flags that the callback passed to
forEach()returns a value (the result ofm.set()). While functionally harmless, this can cause linter noise and confusion. Use a block body or explicit void.♻️ Proposed fix
const levelMap = useMemo(() => { const m = new Map<string, EvaluationLevel>(); - evaluators.forEach((e) => m.set(e.name, e.level)); + evaluators.forEach((e) => { + m.set(e.name, e.level); + }); return m; }, [evaluators]);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@console/workspaces/pages/eval/src/subComponents/PerformanceByEvaluatorCard.tsx` around lines 166 - 170, The forEach callback used in computing levelMap implicitly returns the value from m.set(), causing linter noise; update the callback in the useMemo that builds levelMap so it uses a block body (e.g., evaluators.forEach((e) => { m.set(e.name, e.level); });) or explicitly void the call so the callback returns undefined, keeping the rest of the logic in levelMap, EvaluationLevel, evaluators and m.set unchanged.libs/amp-evaluation/src/amp_evaluation/models.py (1)
256-284:from_eval_result()does not propagatetask_context
EvaluatorScorenow carriestask_context, but the primary factory cannot set it. Consider adding it to keep creation paths symmetric and avoid losing dataset-run context.Suggested refactor
def from_eval_result( cls, eval_result: "EvalResult", trace_id: str, trace_start_time: Optional[datetime] = None, span_context: Optional["SpanContext"] = None, + task_context: Optional["TaskContext"] = None, ) -> "EvaluatorScore": @@ return cls( trace_id=trace_id, trace_start_time=trace_start_time, skip_reason=eval_result.skip_reason, span_context=span_context, + task_context=task_context, ) return cls( trace_id=trace_id, trace_start_time=trace_start_time, score=eval_result.score, passed=eval_result.passed, explanation=eval_result.explanation, span_context=span_context, + task_context=task_context, )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@libs/amp-evaluation/src/amp_evaluation/models.py` around lines 256 - 284, The from_eval_result classmethod on EvaluatorScore does not accept or propagate task_context, causing EvaluatorScore instances to lose dataset-run context; update the from_eval_result signature to accept an optional task_context parameter (matching the EvaluatorScore.task_context type), and pass that task_context through into the returned cls(...) in both the is_skipped branch and the normal branch (use eval_result.task_context if the caller doesn't provide one) so task_context is preserved when constructing EvaluatorScore from an EvalResult.agent-manager-service/repositories/evaluation_score_repository.go (1)
289-291: Filter clauses should be applied before GROUP BY.The
EvaluatorNameandLevelfilters (lines 293-298) are added after the.Group()call. While GORM typically builds the query correctly, this ordering can be confusing and may lead to unexpected behavior in some edge cases. Consider moving the filter conditions before the aggregation setup for clarity.♻️ Suggested reordering for clarity
query := r.db.Table("scores s"). - Select(` - mre.evaluator_name, - mre.level, - COUNT(*) as total_count, - COUNT(CASE WHEN s.skip_reason IS NOT NULL THEN 1 END) as skipped_count, - AVG(CASE WHEN s.skip_reason IS NULL THEN s.score END) as mean_score - `). Joins("JOIN monitor_run_evaluators mre ON s.run_evaluator_id = mre.id"). Where("s.monitor_id = ?", monitorID). Where("s.trace_start_time BETWEEN ? AND ?", startTime, endTime) - if filters.EvaluatorName != "" { - query = query.Where("mre.evaluator_name = ?", filters.EvaluatorName) - } - if filters.Level != "" { - query = query.Where("mre.level = ?", filters.Level) - } + if filters.EvaluatorName != "" { + query = query.Where("mre.evaluator_name = ?", filters.EvaluatorName) + } + if filters.Level != "" { + query = query.Where("mre.level = ?", filters.Level) + } + query = query.Select(` + mre.evaluator_name, + mre.level, + COUNT(*) as total_count, + COUNT(CASE WHEN s.skip_reason IS NOT NULL THEN 1 END) as skipped_count, + AVG(CASE WHEN s.skip_reason IS NULL THEN s.score END) as mean_score + `). Group("mre.evaluator_name, mre.level"). Order("mre.evaluator_name") - if filters.EvaluatorName != "" { - query = query.Where("mre.evaluator_name = ?", filters.EvaluatorName) - } - if filters.Level != "" { - query = query.Where("mre.level = ?", filters.Level) - } err := query.Find(&results).Error🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@agent-manager-service/repositories/evaluation_score_repository.go` around lines 289 - 291, The query applies the EvaluatorName and Level filters after the .Group() call which is confusing; move the filter Where clauses for EvaluatorName and Level to precede the .Group() invocation so the sequence is Where(start/end time) -> Where(EvaluatorName) -> Where(Level) -> Group("mre.evaluator_name, mre.level") -> Order(...). Update the query built around the existing Where("s.trace_start_time BETWEEN ? AND ?", startTime, endTime), the subsequent EvaluatorName/Level Where calls, and the Group(...) / Order(...) chain in the function that constructs this GORM query to reflect this ordering.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@agent-manager-service/db_migrations/006_create_evaluation_scores.go`:
- Line 45: The unique constraint currently defined as uq_run_evaluator UNIQUE
(monitor_run_id, evaluator_name) must be changed to use the per-run instance
identity column instead of evaluator_name: alter the migration that creates the
monitor_run_evaluators table to define uq_run_evaluator as UNIQUE
(monitor_run_id, identifier) (or UNIQUE (monitor_run_id, display_name) if your
schema uses display_name as the per-instance key), ensure the identifier column
is created before the constraint, and update any referenced index/constraint
name usages that assume the old evaluator_name uniqueness; locate
monitor_run_evaluators, the uq_run_evaluator constraint, and the
evaluator_name/identifier/display_name column definitions to make this change.
In `@agent-manager-service/docs/api_v1_openapi.yaml`:
- Around line 9229-9251: Update the LabelEvaluatorSummary.mean schema so it can
be null while remaining required: change the mean property to accept nulls
(e.g., add nullable: true or make its type allow ["number","null"] depending on
OpenAPI version) and keep mean listed under required; target the
LabelEvaluatorSummary.mean definition to ensure skipped-only groups can
represent undefined means as null.
In `@agent-manager-service/models/score.go`:
- Around line 194-200: LabelEvaluatorSummary currently defines Mean as float64
which forces NULL DB means to become 0.0; change the Mean field to *float64 (and
add `json:"mean,omitempty"`) so NULL is preserved, then update the conversion
logic in monitor_scores_service.go (the block that maps
LabelAggregation.MeanScore into LabelEvaluatorSummary, around the existing
conversion that dereferences into 0.0) to assign the pointer value directly (or
pass through the existing *float64) instead of converting NULL to 0.0; ensure
any downstream code that reads LabelEvaluatorSummary.Mean handles nil checks.
In `@agent-manager-service/spec/model_time_series_point.go`:
- Line 29: The generated file model_time_series_point.go was manually edited but
will be overwritten; instead add the clarification about the "mean" key to the
OpenAPI spec so it persists: update the TimeSeriesPoint.aggregations field
description in docs/api_v1_openapi.yaml to include the sentence that the "mean"
key is always present and is a number (0–1) when scored data exists or null when
all evaluations were skipped, then regenerate the client so the comment appears
in the generated TimeSeriesPoint type.
In `@console/workspaces/pages/eval/src/CreateMonitor.Component.tsx`:
- Around line 101-114: The payload construction for CreateMonitorRequest omits
the user-entered description (values.description) so user input is dropped;
update the payload object in CreateMonitor.Component (the const payload:
CreateMonitorRequest block) to include a description property populated from
values.description (e.g., description: values.description?.trim() ?? undefined),
or if CreateMonitorRequest truly does not support description remove the
description field from the form and related form state to avoid collecting
unusable input; ensure you modify the CreateMonitorRequest usage and any form
validation in the same component accordingly.
In `@console/workspaces/pages/eval/src/EditMonitor.Component.tsx`:
- Around line 123-125: The sanitization currently only filters out entries whose
config.value is "****" but not those set to undefined; update the sanitation
logic around sanitizedLlmProviderConfigs (using values.llmProviderConfigs) to
filter out both config.value === "****" and config.value === undefined (or
config.value == null) before constructing the update payload so undefined-secret
entries are not included; if the resulting array is empty set
sanitizedLlmProviderConfigs to undefined so the payload omits the field
entirely.
In `@console/workspaces/pages/eval/src/form/schema.ts`:
- Around line 79-94: The samplingRate schema currently allows 0 but the error
message says "between 1 and 100"; update the refine predicate in samplingRate
(the .refine call on the samplingRate union/transform) to require value >= 1
instead of >= 0 and keep the message consistent; ensure the transform and refine
remain intact (transform that converts empty string/null/undefined to undefined
and the refine that checks Number.isInteger(value) && value >= 1 && value <=
100).
- Around line 63-77: The refine on intervalMinutes currently accepts value === 5
but the error message says "greater than 5 minutes"; update them to match:
either change the predicate in the refine on intervalMinutes (inside the .refine
callback) to require value === undefined || (Number.isInteger(value) && value >
5) if you intend to disallow 5, or change the message to "Interval must be at
least 5 minutes" if you want to allow 5; keep the existing .transform behavior
that converts ""/null/undefined to undefined.
In `@console/workspaces/pages/eval/src/index.ts`:
- Around line 51-56: The editMonitor page entry is missing its required path
property which breaks routing; update the editMonitor object (the entry with
component EditMonitorComponent and icon MonitorCheck) to include an appropriate
path string consistent with the other entries — e.g. a static route like
"edit-monitor" or a parameterized route like "edit/:id" if it needs an eval id —
and ensure the chosen path matches navigation and route registration conventions
used elsewhere in this pages map.
In `@console/workspaces/pages/eval/src/subComponents/CreateMonitorForm.tsx`:
- Around line 188-195: The TextField for intervalMinutes in CreateMonitorForm
currently passes event.target.value (a string) into onFieldChange, causing
formData.intervalMinutes to become a string; update the onChange handler in the
TextField with id "intervalMinutes" to parse the input to a number (e.g.,
Number(...) or parseInt) and handle empty string (set undefined/null) before
calling onFieldChange so formData.intervalMinutes remains a number type for API
payloads; update any validation in CreateMonitorForm/onFieldChange accordingly
to accept numeric values only.
In `@console/workspaces/pages/eval/src/subComponents/MonitorRunDrawer.tsx`:
- Around line 139-141: The Alert for failed runs is using severity="warning"
which under-communicates errors; in MonitorRunDrawer update the Alert that
renders when run.errorMessage is present (the JSX referencing run.errorMessage
and the Alert component) to use severity="error" so failed runs are clearly
signalled.
In `@console/workspaces/pages/eval/src/subComponents/MonitorRunList.tsx`:
- Around line 214-228: The IconButton instances in MonitorRunList.tsx are
icon-only and missing accessible names; add an explicit accessible label (e.g.,
aria-label="Refresh runs" or aria-label="Pause run" as appropriate) to each
IconButton usage (the one wrapping CircularProgress/RefreshCcw and the other
occurrence around line ~380) so screen readers can announce the action, and
optionally add a title prop to surface a tooltip for sighted users; keep the
existing onClick/disabled logic intact.
In `@console/workspaces/pages/eval/src/subComponents/MonitorTable.tsx`:
- Line 281: Update the UI copy in MonitorTable.tsx where the extra-evaluators
label is rendered: replace the string literal `{`+${monitor.evaluators.length -
2} more..`}` with the correct punctuation (e.g., `more...` or simply `more`) so
the displayed text uses the intended ellipsis; locate the render that references
`monitor.evaluators.length` in the MonitorTable component and change the text
accordingly.
- Around line 297-343: The IconButton controls for the edit and delete actions
are icon-only and need accessible labels: add an aria-label prop to the edit
IconButton near the Edit icon (the IconButton that calls
navigate(generatePath(...)) with agentId/orgId/projectId/envId/monitorId) and to
the delete IconButton (the IconButton wrapped by Tooltip that calls
addConfirmation and deleteMonitor). Ensure the aria-labels are descriptive
(e.g., "Edit monitor" and "Delete monitor") so screen readers can identify the
buttons while keeping existing onClick behavior unchanged.
- Around line 132-141: The search haystack currently concatenates monitor.name
but omits monitor.displayName, so add monitor.displayName into the array used to
build haystack (alongside monitor.name, monitor.environment, monitor.dataSource,
...monitor.evaluators, monitor.status) and ensure it is included safely (e.g.,
fallback or conditional check) before joining and lowercasing so term matching
will find the visible display name.
In `@console/workspaces/pages/eval/src/subComponents/ScoreBreakdownCard.tsx`:
- Around line 47-58: The title lookup using CARD_TITLES[level as "agent" |
"llm"] is unsafe because ScoreBreakdownCard's level can be "trace" and may yield
undefined; fix by narrowing the ScoreBreakdownCard prop type so level is only
"agent" | "llm" (update the ScoreBreakdownCardProps definition) or add a safe
fallback when deriving title (check level against allowed keys before indexing
CARD_TITLES and use a default like "Score Breakdown" when not matched);
reference CARD_TITLES and the ScoreBreakdownCard level prop when making the
change.
- Around line 63-64: The nested forEach callbacks (data.groups.forEach and
g.evaluators.forEach) currently use concise arrow bodies that return the result
of nameSet.add(...), triggering the Biome useIterableCallbackReturn lint rule;
change both callbacks to block bodies with no returned value (e.g., replace the
concise arrow bodies with { nameSet.add(e.evaluatorName); } ) so the callbacks
perform the side-effect only. Ensure you update the callbacks that reference
data.groups, g.evaluators, nameSet.add, and e.evaluatorName to use
statement-style bodies and do not return anything.
In `@libs/amp-evaluation/src/amp_evaluation/evaluators/builtin/standard.py`:
- Around line 232-235: The current mismatch explanation constructs raw text
previews (variables output_preview, expected_preview) and sets explanation to
include them; replace that with metadata-safe diagnostics: do not embed raw
output/expected excerpts in the EvalResult.explanation, instead include
non-sensitive attributes such as lengths (len(output), len(expected)), a boolean
exact_match flag, and a short stable hash fingerprint (e.g., first 8 chars of a
SHA256) or delta indicator; update the code that assigns explanation (and any
variables output_preview/expected_preview) in standard.py to produce a message
like "Output mismatch: exact_match=False, output_len=..., expected_len=...,
output_hash=..., expected_hash=..." so persisted explanations contain no raw
content.
---
Outside diff comments:
In `@libs/amp-evaluation/src/amp_evaluation/trace/parser.py`:
- Around line 99-113: The loop currently iterates only semantic_spans (filtered
by SEMANTIC_KINDS), so the parentSpanId is None branch that checks
INFRASTRUCTURE_KINDS is never reached and direct semantic roots are missed;
change the loop to iterate over the original spans list (or iterate both
semantic and non-semantic as appropriate) and within the loop: if
span.ampAttributes.kind in SEMANTIC_KINDS perform the parent_id/remap_map
parent-walk logic (appending to orphans when remap_map resolves to None), and
separately if span.parentSpanId is None and span.ampAttributes.kind not in
INFRASTRUCTURE_KINDS treat it as an orphan semantic root (append to orphans) so
synthetic-root creation will work when multiple semantic roots exist.
---
Nitpick comments:
In `@agent-manager-service/repositories/evaluation_score_repository.go`:
- Around line 289-291: The query applies the EvaluatorName and Level filters
after the .Group() call which is confusing; move the filter Where clauses for
EvaluatorName and Level to precede the .Group() invocation so the sequence is
Where(start/end time) -> Where(EvaluatorName) -> Where(Level) ->
Group("mre.evaluator_name, mre.level") -> Order(...). Update the query built
around the existing Where("s.trace_start_time BETWEEN ? AND ?", startTime,
endTime), the subsequent EvaluatorName/Level Where calls, and the Group(...) /
Order(...) chain in the function that constructs this GORM query to reflect this
ordering.
In `@agent-manager-service/services/monitor_scores_service.go`:
- Around line 394-397: The current code sets mean := 0.0 when agg.MeanScore is
nil which conflates "no data" with a real zero; change the response to preserve
nil by making the output Mean a *float64 (or add an explicit boolean like
HasMean) and assign Mean = agg.MeanScore (or set HasMean = agg.MeanScore != nil)
instead of defaulting to 0.0; update the struct used for grouped results (the
response type that contains Mean) and any JSON tags/consumers to handle a null
value (or the new flag) so downstream UI can distinguish "N/A" from 0.
In `@console/workspaces/libs/api-client/src/hooks/monitors.ts`:
- Around line 328-350: Replace the ad-hoc inline query shape with the shared API
type GroupedScoresQueryParams to prevent drift: change the hook parameter type
from the inline "{ level: EvaluationLevel; timeRange?: TraceListTimeRange }" to
GroupedScoresQueryParams, update the local finalQuery typing to use
GroupedScoresQueryParams (or a compatible extension with startTime/endTime), and
pass that typed finalQuery into getGroupedScores; keep the getTimeRange,
getGroupedScores, useQuery and queryKey logic intact but reference
GroupedScoresQueryParams wherever the query shape is declared or cast.
In `@console/workspaces/pages/eval/src/subComponents/levelConfig.ts`:
- Around line 21-28: LEVEL_CONFIG currently uses hard-coded hex colors for each
EvaluationLevel (trace, agent, llm); change it to reference theme tokens so
colors respond to theme/dark-mode. Update LEVEL_CONFIG (and any consumers that
set sx) to use keys like theme.palette.levels.trace / .agent / .llm (or
similarly named tokens you add to the theme) instead of literal hex strings; add
the corresponding entries to your theme palette (e.g., palette.levels = { trace:
..., agent: ..., llm: ... }) so components can read theme.palette.levels.<level>
when rendering. Ensure EvaluationLevel keys still match and preserve label/unit
values.
In `@console/workspaces/pages/eval/src/subComponents/MonitorRunDrawer.tsx`:
- Line 113: The component MonitorRunDrawer uses hardcoded pixel sizes (e.g. the
Stack height "calc(100vh - 96px)", the drawer width ~300px, and the 80px value)
which makes layout brittle; replace these with theme tokens and flexible layout
rules by using the sx prop and theme.spacing or responsive breakpoints (e.g. use
sx={{ height: `calc(100vh - ${theme.mixins.toolbar?.minHeight ||
theme.spacing(12)})` }} or better, use flex: 1 / maxHeight and
theme.spacing(token) for gaps) and swap fixed width/height properties for
flexbox/grow/shrink or theme.breakpoints-based values in the same
MonitorRunDrawer component (look for the Stack element and the drawer/containers
around it and the places with 300px/80 values) so the drawer sizes adapt to
viewport and theme tokens.
- Around line 72-78: The current call to useMonitorRunLogs passes run.id ?? ""
which sends an empty runId and triggers invalid fetches; change the component to
only invoke useMonitorRunLogs when a valid run.id exists (or pass undefined and
use the hook's enabled option) — locate the call to useMonitorRunLogs and
replace the unconditional call with a guarded invocation (e.g., check run.id
truthiness before calling or pass { enabled: !!run.id } into the hook) so that
useMonitorRunLogs is not invoked with an empty string runId.
In `@console/workspaces/pages/eval/src/subComponents/MonitorRunList.tsx`:
- Around line 234-241: The statusIcons object is currently created with useMemo
and later accessed via a cast which defeats TypeScript exhaustiveness checks;
replace the useMemo + cast pattern with an explicit typed mapping to enforce all
MonitorRunStatus values are handled (e.g. declare statusIcons as
Record<MonitorRunStatus, React.ReactNode> and populate keys
failed/success/running/pending), remove the "as keyof typeof statusIcons" cast
where the icon is selected, and follow the same exhaustive pattern used in
MonitorRunDrawer so future MonitorRunStatus additions will cause a compile-time
error if not covered.
- Around line 124-125: The useRerunMonitor mutation call lacks an onError
handler so failures aren't surfaced; update the invocation of useRerunMonitor
(and the identical pattern at the other occurrence around where lines 387-395
handle rerun) to pass an onError callback that receives the error and displays
it to the user (for example by calling your existing notification/toast function
or setting local error state), while keeping the existing onSuccess invalidation
logic; alternatively, configure a global MutationCache on the QueryClient to
funnel mutation errors to the same user-facing handler so all rerun errors are
consistently shown.
In `@console/workspaces/pages/eval/src/subComponents/MonitorStartStopButton.tsx`:
- Around line 64-69: The handleClick useCallback currently redundantly checks
!orgId || !projectId || !agentId even though those conditions are already
encapsulated by the isDisabled variable; simplify the handler by removing the
redundant null/undefined checks and only early-return on isDisabled (keep
event.stopPropagation() and the existing guards relying on isDisabled),
referencing handleClick, isDisabled, orgId, projectId, and agentId to locate the
code to edit.
In
`@console/workspaces/pages/eval/src/subComponents/PerformanceByEvaluatorCard.tsx`:
- Around line 166-170: The forEach callback used in computing levelMap
implicitly returns the value from m.set(), causing linter noise; update the
callback in the useMemo that builds levelMap so it uses a block body (e.g.,
evaluators.forEach((e) => { m.set(e.name, e.level); });) or explicitly void the
call so the callback returns undefined, keeping the rest of the logic in
levelMap, EvaluationLevel, evaluators and m.set unchanged.
In `@console/workspaces/pages/eval/src/subComponents/ScoreChip.tsx`:
- Around line 26-30: Replace the hardcoded hex values in scoreColor with theme
palette tokens: change scoreColor(score: number) to accept the theme (e.g.,
scoreColor(score: number, theme: Theme)) or return a palette key and update
callers accordingly; map scores to theme.palette.success.main,
theme.palette.warning.main and theme.palette.error.main (or to the string keys
"success.main"/"warning.main"/"error.main" if you prefer to let callers resolve
via sx), and update all callers of scoreColor (e.g., in ScoreChip component) to
pass the theme or use the returned token in the sx prop; if these hex values are
intentionally brand-specific, add a short comment documenting this exception
instead.
In `@console/workspaces/pages/eval/src/utils/monitorFormUtils.ts`:
- Around line 70-75: The branch handling type === "future" currently
unconditionally sets intervalMinutes to 10 and will clobber a user-selected
value; update that block in monitorFormUtils.ts (the type === "future" branch)
to preserve an existing intervalMinutes if provided (e.g., use
values?.intervalMinutes or a passed-in priorForm value) and only fall back to 10
when no existing interval is present, or move the 10 default to the initial
form-defaulting path so toggling to "future" does not overwrite the user's
interval.
In `@console/workspaces/pages/eval/src/ViewMonitor.Component.tsx`:
- Around line 142-146: The linter flags the implicit return inside the forEach
callback in the useMemo that builds levelsPresent; update the callback in the
useMemo (the one iterating evaluators and calling s.add) to use a block body (or
a for...of loop) so it does not implicitly return s.add(), e.g., change the
evaluators.forEach((e) => s.add(e.level)) to a block callback that calls
s.add(e.level); and then return the Set as before.
In `@evaluation-job/test_main.py`:
- Around line 123-131: The test helper _make_evaluator_score currently sets
span_ctx.agent_name, span_ctx.model, and span_ctx.vendor to None when span_id is
provided, which hides regressions; update _make_evaluator_score (and any helper
constructing s.span_context) to accept optional parameters agent_name, model,
and vendor (defaulting to None) and assign them to span_ctx.agent_name,
span_ctx.model, span_ctx.vendor when creating the MagicMock span context; then
update tests that rely on propagation to pass concrete agent/model/vendor values
(preserving None defaults for existing callers).
In `@libs/amp-evaluation/src/amp_evaluation/models.py`:
- Around line 256-284: The from_eval_result classmethod on EvaluatorScore does
not accept or propagate task_context, causing EvaluatorScore instances to lose
dataset-run context; update the from_eval_result signature to accept an optional
task_context parameter (matching the EvaluatorScore.task_context type), and pass
that task_context through into the returned cls(...) in both the is_skipped
branch and the normal branch (use eval_result.task_context if the caller doesn't
provide one) so task_context is preserved when constructing EvaluatorScore from
an EvalResult.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 33e77ca3-add9-42e6-bdb8-af2b73054e8f
⛔ Files ignored due to path filters (1)
console/common/config/rush/pnpm-lock.yamlis excluded by!**/pnpm-lock.yaml
📒 Files selected for processing (65)
agent-manager-service/.gitignoreagent-manager-service/api/monitor_routes.goagent-manager-service/controllers/monitor_scores_controller.goagent-manager-service/db_migrations/006_create_evaluation_scores.goagent-manager-service/docs/api_v1_openapi.yamlagent-manager-service/models/score.goagent-manager-service/repositories/evaluation_score_repository.goagent-manager-service/services/monitor_scores_service.goagent-manager-service/spec/api_default.goagent-manager-service/spec/model_evaluation_score_response.goagent-manager-service/spec/model_evaluator_score_summary.goagent-manager-service/spec/model_grouped_scores_response.goagent-manager-service/spec/model_label_evaluator_summary.goagent-manager-service/spec/model_publish_aggregate_item.goagent-manager-service/spec/model_publish_score_item.goagent-manager-service/spec/model_score_label_group.goagent-manager-service/spec/model_span_context.goagent-manager-service/spec/model_time_series_point.goagent-manager-service/tests/monitor_scores_test.goagent-manager-service/tests/score_repository_test.goagent-manager-service/utils/makeresults.goconsole/workspaces/libs/api-client/src/apis/monitors.tsconsole/workspaces/libs/api-client/src/hooks/monitors.tsconsole/workspaces/libs/types/src/api/monitors.tsconsole/workspaces/pages/eval/src/CreateMonitor.Component.tsxconsole/workspaces/pages/eval/src/EditMonitor.Component.tsxconsole/workspaces/pages/eval/src/EvalMonitors.Component.tsxconsole/workspaces/pages/eval/src/ViewMonitor.Component.tsxconsole/workspaces/pages/eval/src/form/schema.tsconsole/workspaces/pages/eval/src/hooks/useValidatedForm.tsconsole/workspaces/pages/eval/src/index.tsconsole/workspaces/pages/eval/src/subComponents/AgentPerformanceCard.tsxconsole/workspaces/pages/eval/src/subComponents/CreateMonitorForm.tsxconsole/workspaces/pages/eval/src/subComponents/EvaluationSummaryCard.tsxconsole/workspaces/pages/eval/src/subComponents/EvaluatorDetailsDrawer.tsxconsole/workspaces/pages/eval/src/subComponents/EvaluatorLlmProviderSection.tsxconsole/workspaces/pages/eval/src/subComponents/MetricsTooltip.tsxconsole/workspaces/pages/eval/src/subComponents/MonitorFormWizard.tsxconsole/workspaces/pages/eval/src/subComponents/MonitorRunDrawer.tsxconsole/workspaces/pages/eval/src/subComponents/MonitorRunList.tsxconsole/workspaces/pages/eval/src/subComponents/MonitorStartStopButton.tsxconsole/workspaces/pages/eval/src/subComponents/MonitorTable.tsxconsole/workspaces/pages/eval/src/subComponents/PerformanceByEvaluatorCard.tsxconsole/workspaces/pages/eval/src/subComponents/RunSummaryCard.tsxconsole/workspaces/pages/eval/src/subComponents/ScoreBreakdownCard.tsxconsole/workspaces/pages/eval/src/subComponents/ScoreChip.tsxconsole/workspaces/pages/eval/src/subComponents/SelectPresetMonitors.tsxconsole/workspaces/pages/eval/src/subComponents/levelConfig.tsconsole/workspaces/pages/eval/src/utils/monitorFormUtils.tsevaluation-job/main.pyevaluation-job/test_main.pylibs/amp-evaluation/src/amp_evaluation/__init__.pylibs/amp-evaluation/src/amp_evaluation/evaluators/base.pylibs/amp-evaluation/src/amp_evaluation/evaluators/builtin/deepeval.pylibs/amp-evaluation/src/amp_evaluation/evaluators/builtin/standard.pylibs/amp-evaluation/src/amp_evaluation/models.pylibs/amp-evaluation/src/amp_evaluation/runner.pylibs/amp-evaluation/src/amp_evaluation/trace/fetcher.pylibs/amp-evaluation/src/amp_evaluation/trace/parser.pylibs/amp-evaluation/tests/test_base_evaluator_coverage.pylibs/amp-evaluation/tests/test_eval_result.pylibs/amp-evaluation/tests/test_evaluators_deepeval_agent.pylibs/amp-evaluation/tests/test_llm_as_judge.pylibs/amp-evaluation/tests/test_span_filtering.pylibs/amp-evaluation/tests/test_trace_parser.py
💤 Files with no reviewable changes (1)
- console/workspaces/pages/eval/src/subComponents/EvaluatorLlmProviderSection.tsx
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (1)
agent-manager-service/docs/api_v1_openapi.yaml (1)
10119-10123:⚠️ Potential issue | 🟠 Major
LabelEvaluatorSummary.meanshould allow null for skipped-only groups.When a group has only skipped evaluations, mean is undefined. Keeping
meanas non-nullable forces coercion and can mislead analytics/UI.Proposed schema fix
LabelEvaluatorSummary: type: object required: - evaluatorName - mean - count - skippedCount properties: evaluatorName: type: string mean: type: number format: double - description: Mean score (0-1) + nullable: true + description: Mean score (0-1), null when all evaluations are skipped🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@agent-manager-service/docs/api_v1_openapi.yaml` around lines 10119 - 10123, The schema for LabelEvaluatorSummary.mean must allow null for cases with only skipped evaluations; update the OpenAPI definition for the property (LabelEvaluatorSummary.mean) to be nullable by adding nullable: true (or change type to ["number","null"]) while keeping format: double and the description, so consumers can distinguish undefined means; ensure generated clients/validators reflect the nullable change.
🧹 Nitpick comments (7)
console/workspaces/pages/eval/src/form/schema.ts (1)
42-54: Consider removing redundant.min(1)checks.The
.min(1)validators fordisplayName(line 45) andname(line 51) are superseded by the subsequent.min(3)checks. An empty string or 1-2 character string will trigger the.min(3)error, so the.min(1)message is unreachable.♻️ Suggested simplification
displayName: z .string() .trim() - .min(1, "Monitor title is required") - .min(3, "Monitor title must be at least 3 characters") + .min(3, "Monitor title is required (at least 3 characters)") .max(120, "Monitor title must be at most 120 characters"), name: z .string() .trim() - .min(1, "Monitor identifier is required") .regex(/^[a-z0-9-]+$/, "Use lowercase letters, numbers, and hyphens only") - .min(3, "Identifier must be at least 3 characters") + .min(3, "Monitor identifier is required (at least 3 characters)") .max(60, "Identifier must be at most 60 characters"),🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@console/workspaces/pages/eval/src/form/schema.ts` around lines 42 - 54, Remove the redundant .min(1) validators from the zod schema for displayName and name in this file: in the displayName chain (symbol displayName) drop the .min(1, "Monitor title is required") call and in the name chain (symbol name) drop the .min(1, "Monitor identifier is required") call so the existing .min(3) and other validators (trim, max, regex) remain and continue to enforce validation.console/workspaces/pages/eval/src/subComponents/MonitorTable.tsx (2)
241-256: Remove unusedenvIdfrom route params.The route template
/org/:orgId/project/:projectId/agents/:agentId/evaluation/monitor/view/:monitorIddoesn't include:envId. The extra param is silently ignored bygeneratePath, but it's unnecessary and may mislead future readers.Suggested fix
navigate( generatePath( absoluteRouteMap.children.org.children.projects.children .agents.children.evaluation.children.monitor.children.view .path, { agentId, orgId, projectId, - envId, monitorId: monitor.name, }, ), )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@console/workspaces/pages/eval/src/subComponents/MonitorTable.tsx` around lines 241 - 256, The route generation call using generatePath inside the onClick handler is passing an unnecessary envId param; remove envId from the params object so only the actual route params (agentId, orgId, projectId, monitorId) are passed (refer to the onClick handler, generatePath call, absoluteRouteMap...monitor.children.view.path, and the monitorId value derived from monitor.name).
297-316: Remove unusedenvIdfrom edit route params.Same as the view path—the edit route
/org/:orgId/project/:projectId/agents/:agentId/evaluation/monitor/edit/:monitorIddoesn't include:envId.Suggested fix
<IconButton onClick={() => navigate( generatePath( absoluteRouteMap.children.org.children.projects .children.agents.children.evaluation.children .monitor.children.edit.path, { agentId, orgId, projectId, - envId, monitorId: monitor.name, }, ), ) } >🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@console/workspaces/pages/eval/src/subComponents/MonitorTable.tsx` around lines 297 - 316, The edit route call is passing an unused envId param; update the IconButton onClick that calls navigate(generatePath(...)) to remove envId from the params object so it matches the route shape used for edit (monitorId, agentId, orgId, projectId and any other required ids) — locate the IconButton onClick block referencing generatePath and absoluteRouteMap.children.org.children.projects.children.agents.children.evaluation.children.monitor.children.edit.path and delete the envId property from the params passed (keep monitorId: monitor.name and the other ids intact).console/workspaces/pages/eval/src/subComponents/MonitorRunDrawer.tsx (2)
188-191: Use a collision-safe key for evaluator rows.
key={ev.identifier}can collide if identifiers repeat (especially with mixed evaluator levels), which can cause incorrect row reconciliation.♻️ Proposed refactor
- {(run.evaluators ?? []).map((ev) => ( + {(run.evaluators ?? []).map((ev, index) => ( <Box - key={ev.identifier} + key={`${ev.identifier}-${index}`} sx={{🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@console/workspaces/pages/eval/src/subComponents/MonitorRunDrawer.tsx` around lines 188 - 191, The evaluator list mapping in MonitorRunDrawer uses a non-unique key (key={ev.identifier}) which can collide; update the key used in the (run.evaluators ?? []).map(...) render to a collision-safe composite key that uniquely identifies each row (for example combine evaluator level, identifier and the iteration index or another unique property) so React can reconcile rows correctly; locate the map in MonitorRunDrawer and replace the single ev.identifier key with a composite key using ev.level, ev.identifier and index (or a true unique id if available).
133-133: Replace hardcoded pixel sizing with theme tokens insx.
height: 80andminWidth: 120hardcode dimensions and make theme consistency harder.♻️ Proposed refactor
- sx={{ height: 80 }} + sx={{ height: (theme) => theme.spacing(10) }} ... - sx={{ minWidth: 120 }} + sx={{ minWidth: (theme) => theme.spacing(15) }}As per coding guidelines:
Use theme tokens via the sx prop instead of hardcoded colors and spacing values.Also applies to: 220-220
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@console/workspaces/pages/eval/src/subComponents/MonitorRunDrawer.tsx` at line 133, In MonitorRunDrawer's JSX (the sx prop where height: 80 and minWidth: 120 are set) replace the hardcoded numeric px values with theme tokens via the sx function callback: use theme spacing or size tokens for height and minWidth (e.g., derive values from theme.spacing(...) or theme.breakpoints/size tokens) so the component follows the design system and scales with the theme; update both occurrences of height: 80 and minWidth: 120 in MonitorRunDrawer.tsx to use theme-based tokens instead of raw numbers.agent-manager-service/models/score.go (1)
83-83: Unifylevelvalidation for individual and aggregated score payloads.
PublishAggregateItem.Levelis constrained totrace|agent|llm, butPublishScoreItem.Levelcurrently accepts any non-empty string. Tightening this prevents invalid level values from entering publish flow.♻️ Suggested change
type PublishScoreItem struct { EvaluatorName string `json:"evaluatorName" validate:"required"` - Level string `json:"level" validate:"required"` + Level string `json:"level" validate:"required,oneof=trace agent llm"` TraceID string `json:"traceId" validate:"required"` Score *float64 `json:"score,omitempty" validate:"omitempty,min=0,max=1"` Explanation *string `json:"explanation,omitempty"` TraceStartTime *time.Time `json:"traceStartTime,omitempty"` SkipReason *string `json:"skipReason,omitempty"` SpanContext *PublishSpanContext `json:"spanContext,omitempty"` }Also applies to: 96-96
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@agent-manager-service/models/score.go` at line 83, Update the validation tags on the Level fields to enforce the same allowed values across both payloads: change PublishScoreItem.Level and PublishAggregateItem.Level to use validate:"required,oneof=trace agent llm" so only "trace", "agent", or "llm" are accepted; leave the json tags as-is and ensure any downstream validation uses these struct tags.agent-manager-service/tests/monitor_scores_test.go (1)
101-101: Add direct test coverage for the new/scores/breakdownendpoint.Line 101 wires the handler, but this suite doesn’t yet assert validation/behavior for that path (e.g., required
startTime,endTime, andlevelconstraints).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@agent-manager-service/tests/monitor_scores_test.go` at line 101, Add direct test cases for the new "/scores/breakdown" route wired by mux.HandleFunc and served by ctrl.GetGroupedScores: create tests that call GET base+"/scores/breakdown" with missing parameters to assert 400 responses for missing required startTime, endTime, and level; test invalid formats for startTime/endTime and out-of-range or unsupported values for level to assert appropriate validation errors; also add a happy-path test providing valid startTime, endTime, and level to assert 200 and validate the response shape (grouped score structure) returned by GetGroupedScores.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@agent-manager-service/db_migrations/005_create_monitors.go`:
- Line 32: Do not modify the historical migration variable migration005 in
agent-manager-service/db_migrations/005_create_monitors.go; instead create a new
forward migration file (next unused ID) named e.g.
db_migrations/00X_add_monitor_description.go that registers a migration
(migration00X) which runs a transactional ALTER TABLE monitors ADD COLUMN IF NOT
EXISTS description VARCHAR(512) NOT NULL DEFAULT ''; ensure the new migration's
ID is unique and the Migrate function uses runSQL(tx, `ALTER TABLE monitors ADD
COLUMN IF NOT EXISTS description VARCHAR(512) NOT NULL DEFAULT ''`) so existing
databases get the new column without changing past migration 005.
In `@agent-manager-service/docs/api_v1_openapi.yaml`:
- Around line 10012-10033: The OpenAPI schema changed
TraceScoresResponse.monitors to use TraceMonitorGroup (which requires
monitorName, evaluators: TraceEvaluatorScore[], and spans: TraceSpanGroup[]) but
the client types still expect the old MonitorTraceGroup/EvaluatorTraceGroup
shape (with monitorId, runId, and nested scores), causing a breaking contract;
either update the client type definitions to match the new schema (replace
MonitorTraceGroup/EvaluatorTraceGroup usages with TraceMonitorGroup and adjust
fields to monitorName, evaluators, spans and their
TraceEvaluatorScore/TraceSpanGroup shapes) or revert the schema change to
restore monitorId/runId/scores; locate and update the client models and any code
referencing TraceScoresResponse.monitors, MonitorTraceGroup,
EvaluatorTraceGroup, monitorId, runId, and scores so the API contract and client
types are consistent.
In `@console/workspaces/pages/eval/src/subComponents/MonitorTable.tsx`:
- Around line 51-60: The getStatusColor function declares its parameter as
MonitorStatus but callers sometimes pass the string "Unknown"; update
getStatusColor to handle "Unknown" explicitly (e.g., add a case returning
"default" or "warning") and/or widen the parameter type to include "Unknown" so
TypeScript is safe; ensure all call sites that may pass a fallback "Unknown"
value use the updated type or map that fallback before calling getStatusColor.
---
Duplicate comments:
In `@agent-manager-service/docs/api_v1_openapi.yaml`:
- Around line 10119-10123: The schema for LabelEvaluatorSummary.mean must allow
null for cases with only skipped evaluations; update the OpenAPI definition for
the property (LabelEvaluatorSummary.mean) to be nullable by adding nullable:
true (or change type to ["number","null"]) while keeping format: double and the
description, so consumers can distinguish undefined means; ensure generated
clients/validators reflect the nullable change.
---
Nitpick comments:
In `@agent-manager-service/models/score.go`:
- Line 83: Update the validation tags on the Level fields to enforce the same
allowed values across both payloads: change PublishScoreItem.Level and
PublishAggregateItem.Level to use validate:"required,oneof=trace agent llm" so
only "trace", "agent", or "llm" are accepted; leave the json tags as-is and
ensure any downstream validation uses these struct tags.
In `@agent-manager-service/tests/monitor_scores_test.go`:
- Line 101: Add direct test cases for the new "/scores/breakdown" route wired by
mux.HandleFunc and served by ctrl.GetGroupedScores: create tests that call GET
base+"/scores/breakdown" with missing parameters to assert 400 responses for
missing required startTime, endTime, and level; test invalid formats for
startTime/endTime and out-of-range or unsupported values for level to assert
appropriate validation errors; also add a happy-path test providing valid
startTime, endTime, and level to assert 200 and validate the response shape
(grouped score structure) returned by GetGroupedScores.
In `@console/workspaces/pages/eval/src/form/schema.ts`:
- Around line 42-54: Remove the redundant .min(1) validators from the zod schema
for displayName and name in this file: in the displayName chain (symbol
displayName) drop the .min(1, "Monitor title is required") call and in the name
chain (symbol name) drop the .min(1, "Monitor identifier is required") call so
the existing .min(3) and other validators (trim, max, regex) remain and continue
to enforce validation.
In `@console/workspaces/pages/eval/src/subComponents/MonitorRunDrawer.tsx`:
- Around line 188-191: The evaluator list mapping in MonitorRunDrawer uses a
non-unique key (key={ev.identifier}) which can collide; update the key used in
the (run.evaluators ?? []).map(...) render to a collision-safe composite key
that uniquely identifies each row (for example combine evaluator level,
identifier and the iteration index or another unique property) so React can
reconcile rows correctly; locate the map in MonitorRunDrawer and replace the
single ev.identifier key with a composite key using ev.level, ev.identifier and
index (or a true unique id if available).
- Line 133: In MonitorRunDrawer's JSX (the sx prop where height: 80 and
minWidth: 120 are set) replace the hardcoded numeric px values with theme tokens
via the sx function callback: use theme spacing or size tokens for height and
minWidth (e.g., derive values from theme.spacing(...) or theme.breakpoints/size
tokens) so the component follows the design system and scales with the theme;
update both occurrences of height: 80 and minWidth: 120 in MonitorRunDrawer.tsx
to use theme-based tokens instead of raw numbers.
In `@console/workspaces/pages/eval/src/subComponents/MonitorTable.tsx`:
- Around line 241-256: The route generation call using generatePath inside the
onClick handler is passing an unnecessary envId param; remove envId from the
params object so only the actual route params (agentId, orgId, projectId,
monitorId) are passed (refer to the onClick handler, generatePath call,
absoluteRouteMap...monitor.children.view.path, and the monitorId value derived
from monitor.name).
- Around line 297-316: The edit route call is passing an unused envId param;
update the IconButton onClick that calls navigate(generatePath(...)) to remove
envId from the params object so it matches the route shape used for edit
(monitorId, agentId, orgId, projectId and any other required ids) — locate the
IconButton onClick block referencing generatePath and
absoluteRouteMap.children.org.children.projects.children.agents.children.evaluation.children.monitor.children.edit.path
and delete the envId property from the params passed (keep monitorId:
monitor.name and the other ids intact).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: a5ec79cf-f6f4-49ef-b0f9-34b0e1f0d5e7
⛔ Files ignored due to path filters (1)
console/common/config/rush/pnpm-lock.yamlis excluded by!**/pnpm-lock.yaml
📒 Files selected for processing (23)
agent-manager-service/api/monitor_routes.goagent-manager-service/controllers/evaluator_controller.goagent-manager-service/controllers/monitor_controller.goagent-manager-service/controllers/monitor_scores_controller.goagent-manager-service/controllers/monitor_scores_publisher_controller.goagent-manager-service/db_migrations/005_create_monitors.goagent-manager-service/docs/api_v1_openapi.yamlagent-manager-service/models/monitor.goagent-manager-service/models/score.goagent-manager-service/services/monitor_manager.goagent-manager-service/spec/api_default.goagent-manager-service/spec/model_create_monitor_request.goagent-manager-service/spec/model_monitor_response.goagent-manager-service/tests/monitor_scores_test.goagent-manager-service/utils/constants.goagent-manager-service/utils/makeresults.goconsole/workspaces/libs/api-client/src/apis/monitors.tsconsole/workspaces/libs/types/src/api/monitors.tsconsole/workspaces/pages/eval/src/CreateMonitor.Component.tsxconsole/workspaces/pages/eval/src/form/schema.tsconsole/workspaces/pages/eval/src/subComponents/MonitorRunDrawer.tsxconsole/workspaces/pages/eval/src/subComponents/MonitorTable.tsxconsole/workspaces/pages/eval/src/subComponents/ScoreBreakdownCard.tsx
🚧 Files skipped from review as they are similar to previous changes (3)
- console/workspaces/libs/api-client/src/apis/monitors.ts
- agent-manager-service/api/monitor_routes.go
- console/workspaces/pages/eval/src/CreateMonitor.Component.tsx
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@agent-manager-service/spec/model_trace_scores_response.go`:
- Line 24: Rewrite ConvertToTraceScoresResponse in
agent-manager-service/utils/makeresults.go so it constructs
spec.TraceMonitorGroup instances (not the removed
spec.MonitorTraceGroup/spec.EvaluatorTraceGroup types): iterate your existing
model monitors and for each create a spec.TraceMonitorGroup, omit MonitorId and
RunId (they no longer exist), convert the model's evaluator entries into
spec.TraceEvaluatorScore items (map fields accordingly to the new
TraceEvaluatorScore shape), and populate a new Spans field by mapping model span
data into spec.TraceSpanGroup slices; ensure all references to
spec.MonitorTraceGroup and spec.EvaluatorTraceGroup are removed and replaced
with spec.TraceMonitorGroup, spec.TraceEvaluatorScore and spec.TraceSpanGroup
mappings inside ConvertToTraceScoresResponse.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 2ab0b20d-f712-44c7-b29b-680d219aa838
📒 Files selected for processing (7)
agent-manager-service/spec/model_evaluator_trace_group.goagent-manager-service/spec/model_monitor_trace_group.goagent-manager-service/spec/model_score_item.goagent-manager-service/spec/model_trace_evaluator_score.goagent-manager-service/spec/model_trace_monitor_group.goagent-manager-service/spec/model_trace_scores_response.goagent-manager-service/spec/model_trace_span_group.go
💤 Files with no reviewable changes (3)
- agent-manager-service/spec/model_monitor_trace_group.go
- agent-manager-service/spec/model_evaluator_trace_group.go
- agent-manager-service/spec/model_score_item.go
f57d083 to
8d48211
Compare
Summary
Test plan
make dev-testpasses foragent-manager-serviceResolves #470
Summary by CodeRabbit
New Features
API Updates
displayName→evaluatorName,traceTimestamp→traceStartTimefor consistency.