Skip to content

fix: make assistant E2E tests model-agnostic and upgrade to qwen3#498

Merged
peppescg merged 2 commits intomainfrom
fix/assistant-e2e-flaky-qwen3
Apr 30, 2026
Merged

fix: make assistant E2E tests model-agnostic and upgrade to qwen3#498
peppescg merged 2 commits intomainfrom
fix/assistant-e2e-flaky-qwen3

Conversation

@peppescg
Copy link
Copy Markdown
Collaborator

Summary

  • Replace two flaky assistant E2E tests with a single deterministic arithmetic test ("What is 1 + 1?" → check response contains "2")
  • Increase beforeAll warmup timeout from 30s to 120s — qwen3:1.7b with thinking mode takes ~40s for first inference on CI runners
  • Upgrade E2E model from qwen2.5:1.5b to qwen3:1.7b in workflow and all source file defaults

Root cause

The previous tests asserted on specific LLM output patterns:

  • /(hello|hi|hey|greetings).*username/i — expected the model to greet with a specific username
  • /\b1\s+2\s+3\b/ — expected sequential numbers separated by whitespace

These patterns break with models that have thinking mode (qwen3 prepends <think>...</think> blocks) or different output formatting. Additionally, test.slow() triples test timeouts but not beforeAll hooks, so the 30s default was too short for qwen3 model loading.

Changes

File Change
tests/e2e/assistant.spec.ts Replace 2 flaky tests with 1 arithmetic test; add 120s beforeAll timeout
playwright.config.mts Default qwen2.5:1.5bqwen3:1.7b
src/app/api/chat/route.ts Default qwen2.5:1.5bqwen3:1.7b
.github/workflows/e2e.yml Upgrade model + cache key

Test plan

  • CI E2E tests pass with qwen3:1.7b
  • Verified in enterprise repo (stacklok/stacklok-enterprise-platform#745 — same fix, CI green)

🤖 Generated with Claude Code

The assistant E2E tests were flaky because they asserted on specific
LLM output patterns (regex for sequential numbers, greeting with
username). This breaks with models that have thinking mode (qwen3)
or different output formatting.

Changes:
- Replace two flaky tests with a single deterministic arithmetic
  test ("What is 1 + 1?" → check response contains "2")
- Increase beforeAll warmup timeout from 30s to 120s — qwen3 with
  thinking mode takes ~40s for first inference on CI runners
- Upgrade E2E model from qwen2.5:1.5b to qwen3:1.7b in workflow
  and all source file defaults

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 30, 2026 09:14
@github-actions github-actions Bot added the size/XS Extra small PR: < 100 lines changed label Apr 30, 2026
@peppescg peppescg self-assigned this Apr 30, 2026
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the assistant E2E testing setup to be less sensitive to model-specific output formatting and switches the default E2E Ollama model to qwen3:1.7b, aligning local defaults, runtime defaults (E2E mode), and CI configuration.

Changes:

  • Replaces two output-pattern-based assistant E2E tests with a single arithmetic-based assertion.
  • Extends assistant E2E warmup timeout to accommodate slower first inference on CI.
  • Updates default E2E model to qwen3:1.7b across Playwright config, API route E2E default, and GitHub Actions workflow (including cache key).

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
tests/e2e/assistant.spec.ts Consolidates flaky assistant tests into a single arithmetic check and adjusts warmup behavior/timeouts.
src/app/api/chat/route.ts Updates the default E2E Ollama model ID used when USE_E2E_MODEL=true.
playwright.config.mts Updates Playwright webServer env default E2E_MODEL_NAME to qwen3:1.7b.
.github/workflows/e2e.yml Switches CI E2E model pull + env to qwen3:1.7b and updates the Ollama cache key.

Comment thread tests/e2e/assistant.spec.ts
Comment thread tests/e2e/assistant.spec.ts
@peppescg
Copy link
Copy Markdown
Collaborator Author

security fix on #497

- Use testInfo.setTimeout in beforeAll with biome-ignore for the
  required empty destructuring (Playwright 1.58 does not support
  the { timeout } option on beforeAll)
- Scope the /\b2\b/ assertion to the assistant sidebar
  (data-side="right") to avoid false positives from unrelated page
  content like pagination or counts

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added size/XS Extra small PR: < 100 lines changed and removed size/XS Extra small PR: < 100 lines changed labels Apr 30, 2026
@peppescg peppescg merged commit f61c1f3 into main Apr 30, 2026
8 of 10 checks passed
@peppescg peppescg deleted the fix/assistant-e2e-flaky-qwen3 branch April 30, 2026 10:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/XS Extra small PR: < 100 lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants