fix: make assistant E2E tests model-agnostic and upgrade to qwen3 by peppescg · Pull Request #498 · stacklok/toolhive-cloud-ui

peppescg · 2026-04-30T09:14:03Z

Summary

Replace two flaky assistant E2E tests with a single deterministic arithmetic test ("What is 1 + 1?" → check response contains "2")
Increase beforeAll warmup timeout from 30s to 120s — qwen3:1.7b with thinking mode takes ~40s for first inference on CI runners
Upgrade E2E model from qwen2.5:1.5b to qwen3:1.7b in workflow and all source file defaults

Root cause

The previous tests asserted on specific LLM output patterns:

/(hello|hi|hey|greetings).*username/i — expected the model to greet with a specific username
/\b1\s+2\s+3\b/ — expected sequential numbers separated by whitespace

These patterns break with models that have thinking mode (qwen3 prepends <think>...</think> blocks) or different output formatting. Additionally, test.slow() triples test timeouts but not beforeAll hooks, so the 30s default was too short for qwen3 model loading.

Changes

File	Change
`tests/e2e/assistant.spec.ts`	Replace 2 flaky tests with 1 arithmetic test; add 120s `beforeAll` timeout
`playwright.config.mts`	Default `qwen2.5:1.5b` → `qwen3:1.7b`
`src/app/api/chat/route.ts`	Default `qwen2.5:1.5b` → `qwen3:1.7b`
`.github/workflows/e2e.yml`	Upgrade model + cache key

Test plan

CI E2E tests pass with qwen3:1.7b
Verified in enterprise repo (stacklok/stacklok-enterprise-platform#745 — same fix, CI green)

🤖 Generated with Claude Code

The assistant E2E tests were flaky because they asserted on specific LLM output patterns (regex for sequential numbers, greeting with username). This breaks with models that have thinking mode (qwen3) or different output formatting. Changes: - Replace two flaky tests with a single deterministic arithmetic test ("What is 1 + 1?" → check response contains "2") - Increase beforeAll warmup timeout from 30s to 120s — qwen3 with thinking mode takes ~40s for first inference on CI runners - Upgrade E2E model from qwen2.5:1.5b to qwen3:1.7b in workflow and all source file defaults Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

This PR updates the assistant E2E testing setup to be less sensitive to model-specific output formatting and switches the default E2E Ollama model to qwen3:1.7b, aligning local defaults, runtime defaults (E2E mode), and CI configuration.

Changes:

Replaces two output-pattern-based assistant E2E tests with a single arithmetic-based assertion.
Extends assistant E2E warmup timeout to accommodate slower first inference on CI.
Updates default E2E model to qwen3:1.7b across Playwright config, API route E2E default, and GitHub Actions workflow (including cache key).

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File	Description
`tests/e2e/assistant.spec.ts`	Consolidates flaky assistant tests into a single arithmetic check and adjusts warmup behavior/timeouts.
`src/app/api/chat/route.ts`	Updates the default E2E Ollama model ID used when `USE_E2E_MODEL=true`.
`playwright.config.mts`	Updates Playwright webServer env default `E2E_MODEL_NAME` to `qwen3:1.7b`.
`.github/workflows/e2e.yml`	Switches CI E2E model pull + env to `qwen3:1.7b` and updates the Ollama cache key.

peppescg · 2026-04-30T10:34:07Z

security fix on #497

- Use testInfo.setTimeout in beforeAll with biome-ignore for the required empty destructuring (Playwright 1.58 does not support the { timeout } option on beforeAll) - Scope the /\b2\b/ assertion to the assistant sidebar (data-side="right") to avoid false positives from unrelated page content like pagination or counts Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings April 30, 2026 09:14

github-actions Bot added the size/XS Extra small PR: < 100 lines changed label Apr 30, 2026

Copilot started reviewing on behalf of peppescg April 30, 2026 09:14 View session

peppescg self-assigned this Apr 30, 2026

Copilot AI reviewed Apr 30, 2026

View reviewed changes

Comment thread tests/e2e/assistant.spec.ts

Comment thread tests/e2e/assistant.spec.ts

github-actions Bot added size/XS Extra small PR: < 100 lines changed and removed size/XS Extra small PR: < 100 lines changed labels Apr 30, 2026

samuv approved these changes Apr 30, 2026

View reviewed changes

peppescg merged commit f61c1f3 into main Apr 30, 2026
8 of 10 checks passed

peppescg deleted the fix/assistant-e2e-flaky-qwen3 branch April 30, 2026 10:53

claude Bot mentioned this pull request Apr 30, 2026

Update stacklok/toolhive-cloud-ui to v0.6.0 stacklok/docs-website#834

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: make assistant E2E tests model-agnostic and upgrade to qwen3#498

fix: make assistant E2E tests model-agnostic and upgrade to qwen3#498
peppescg merged 2 commits intomainfrom
fix/assistant-e2e-flaky-qwen3

peppescg commented Apr 30, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

peppescg commented Apr 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

peppescg commented Apr 30, 2026

Summary

Root cause

Changes

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

peppescg commented Apr 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants