fix(tests): reduce flakiness in LLM-dependent tests#171
Conversation
There was a problem hiding this comment.
The PR effectively addresses flaky LLM-dependent tests through retry logic and deterministic temperature settings. The changes are well-structured and should significantly reduce test flakiness.
🤖 Automated review complete. Please react with 👍 or 👎 on the individual review comments to provide feedback on their usefulness.
🛡️ Jit Security Scan Results✅ No security findings were detected in this PR
Security scan by Jit
|
There was a problem hiding this comment.
Pull request overview
This PR reduces flakiness in LLM-dependent test suites by adding retry/skip behavior around memory extraction calls and making LLM-as-a-judge scoring more deterministic via temperature=0.0, while re-enabling previously skipped flaky tests.
Changes:
- Added an
extract_with_retry()helper in LLM-dependent tests to retry extraction up to 3 times before skipping when results are empty. - Updated
LLMContextualGroundingJudgeandMemoryExtractionJudgeto accept atemperatureparameter (default0.0) and pass it intoLLMClient.create_chat_completion(). - Re-enabled
test_multi_entity_conversationandtest_judge_comprehensive_grounding_evaluation.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
tests/test_thread_aware_grounding.py |
Adds extraction retry helper and re-enables multi-entity grounding test using it. |
tests/test_contextual_grounding_integration.py |
Adds extraction retry helper and uses it across grounding integration tests; sets judge temperature for determinism. |
tests/test_llm_judge_evaluation.py |
Sets extraction judge temperature and re-enables comprehensive grounding judge test. |
Comments suppressed due to low confidence (1)
tests/test_llm_judge_evaluation.py:392
test_judge_comprehensive_grounding_evaluationdoesn’t callskip_if_timeout(evaluation)like the other judge tests. BecauseLLMContextualGroundingJudge.evaluate_grounding()returns default mid-range scores on timeout, this test can silently pass when the LLM call fails. Addskip_if_timeout(evaluation)after the evaluation call (before assertions) to keep behavior consistent and avoid false positives.
async def test_judge_comprehensive_grounding_evaluation(self):
"""Test LLM judge on complex example with multiple grounding types"""
judge = LLMContextualGroundingJudge()
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
- Move extract_with_retry() to conftest.py to avoid duplication - Remove redundant len assertions (extract_with_retry guarantees >= 1) - Add skip_if_timeout() to test_judge_comprehensive_grounding_evaluation
Address review feedback — callers can now opt into receiving empty results instead of auto-skipping, useful for tests that assert extraction correctly returns nothing.
Summary
extract_with_retry()helper that retries LLM extraction up to 3 times beforepytest.skip(), handling non-deterministic empty resultstemperature=0.0onLLMContextualGroundingJudgeandMemoryExtractionJudgefor consistent scoringtest_multi_entity_conversationandtest_judge_comprehensive_grounding_evaluationFixes #74, #54, #48
Test plan
uv run pre-commit run --all-filespassesuv run pytest tests/test_contextual_grounding_integration.py tests/test_thread_aware_grounding.py tests/test_llm_judge_evaluation.py -v --run-api-tests— previously-skipped tests now runuv run pytest— all tests passNote
Medium Risk
Medium risk because it changes behavior of LLM-integration tests (retries, skipping, and re-enabling previously skipped tests), which may affect CI stability and runtime depending on external LLM variability.
Overview
Reduces flakiness in LLM-dependent tests by adding
extract_with_retry()(retries thread-aware extraction and skips the test if results stay empty) and switching several tests to use it instead of single-shot extraction/assertions.Makes LLM-as-a-judge scoring more deterministic by threading a
temperatureparameter (default0.0) throughLLMContextualGroundingJudgeandMemoryExtractionJudgecalls.Re-enables previously skipped LLM tests (
test_multi_entity_conversationand the comprehensive judge grounding test) and adds timeout-based skipping to avoid failures from external service timeouts.Written by Cursor Bugbot for commit 8c4fccc. This will update automatically on new commits. Configure here.