Skip to content

fix(tests): reduce flakiness in LLM-dependent tests#171

Merged
bsbodden merged 3 commits intomainfrom
fix/flaky-llm-tests
Feb 26, 2026
Merged

fix(tests): reduce flakiness in LLM-dependent tests#171
bsbodden merged 3 commits intomainfrom
fix/flaky-llm-tests

Conversation

@bsbodden
Copy link
Copy Markdown
Collaborator

@bsbodden bsbodden commented Feb 26, 2026

Summary

  • Add extract_with_retry() helper that retries LLM extraction up to 3 times before pytest.skip(), handling non-deterministic empty results
  • Set temperature=0.0 on LLMContextualGroundingJudge and MemoryExtractionJudge for consistent scoring
  • Re-enable previously skipped test_multi_entity_conversation and test_judge_comprehensive_grounding_evaluation

Fixes #74, #54, #48

Test plan

  • uv run pre-commit run --all-files passes
  • uv run pytest tests/test_contextual_grounding_integration.py tests/test_thread_aware_grounding.py tests/test_llm_judge_evaluation.py -v --run-api-tests — previously-skipped tests now run
  • uv run pytest — all tests pass

Note

Medium Risk
Medium risk because it changes behavior of LLM-integration tests (retries, skipping, and re-enabling previously skipped tests), which may affect CI stability and runtime depending on external LLM variability.

Overview
Reduces flakiness in LLM-dependent tests by adding extract_with_retry() (retries thread-aware extraction and skips the test if results stay empty) and switching several tests to use it instead of single-shot extraction/assertions.

Makes LLM-as-a-judge scoring more deterministic by threading a temperature parameter (default 0.0) through LLMContextualGroundingJudge and MemoryExtractionJudge calls.

Re-enables previously skipped LLM tests (test_multi_entity_conversation and the comprehensive judge grounding test) and adds timeout-based skipping to avoid failures from external service timeouts.

Written by Cursor Bugbot for commit 8c4fccc. This will update automatically on new commits. Configure here.

Add extraction retry helper to handle non-deterministic empty results
from LLM extraction, and set temperature=0.0 on LLM judge calls for
consistent scoring. Re-enables previously skipped tests.

Fixes #74, #54, #48
Copilot AI review requested due to automatic review settings February 26, 2026 15:43
Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR effectively addresses flaky LLM-dependent tests through retry logic and deterministic temperature settings. The changes are well-structured and should significantly reduce test flakiness.


🤖 Automated review complete. Please react with 👍 or 👎 on the individual review comments to provide feedback on their usefulness.

Comment thread tests/test_contextual_grounding_integration.py Outdated
Comment thread tests/test_contextual_grounding_integration.py Outdated
@jit-ci
Copy link
Copy Markdown

jit-ci Bot commented Feb 26, 2026

🛡️ Jit Security Scan Results

CRITICAL HIGH MEDIUM

✅ No security findings were detected in this PR


Security scan by Jit

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR reduces flakiness in LLM-dependent test suites by adding retry/skip behavior around memory extraction calls and making LLM-as-a-judge scoring more deterministic via temperature=0.0, while re-enabling previously skipped flaky tests.

Changes:

  • Added an extract_with_retry() helper in LLM-dependent tests to retry extraction up to 3 times before skipping when results are empty.
  • Updated LLMContextualGroundingJudge and MemoryExtractionJudge to accept a temperature parameter (default 0.0) and pass it into LLMClient.create_chat_completion().
  • Re-enabled test_multi_entity_conversation and test_judge_comprehensive_grounding_evaluation.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
tests/test_thread_aware_grounding.py Adds extraction retry helper and re-enables multi-entity grounding test using it.
tests/test_contextual_grounding_integration.py Adds extraction retry helper and uses it across grounding integration tests; sets judge temperature for determinism.
tests/test_llm_judge_evaluation.py Sets extraction judge temperature and re-enables comprehensive grounding judge test.
Comments suppressed due to low confidence (1)

tests/test_llm_judge_evaluation.py:392

  • test_judge_comprehensive_grounding_evaluation doesn’t call skip_if_timeout(evaluation) like the other judge tests. Because LLMContextualGroundingJudge.evaluate_grounding() returns default mid-range scores on timeout, this test can silently pass when the LLM call fails. Add skip_if_timeout(evaluation) after the evaluation call (before assertions) to keep behavior consistent and avoid false positives.
    async def test_judge_comprehensive_grounding_evaluation(self):
        """Test LLM judge on complex example with multiple grounding types"""

        judge = LLMContextualGroundingJudge()


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tests/test_thread_aware_grounding.py Outdated
Comment thread tests/test_contextual_grounding_integration.py Outdated
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

Comment thread tests/test_llm_judge_evaluation.py
- Move extract_with_retry() to conftest.py to avoid duplication
- Remove redundant len assertions (extract_with_retry guarantees >= 1)
- Add skip_if_timeout() to test_judge_comprehensive_grounding_evaluation
@bsbodden bsbodden self-assigned this Feb 26, 2026
Copy link
Copy Markdown
Contributor

@vishal-bala vishal-bala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

Comment thread tests/conftest.py
Address review feedback — callers can now opt into receiving empty
results instead of auto-skipping, useful for tests that assert
extraction correctly returns nothing.
@bsbodden bsbodden merged commit b160585 into main Feb 26, 2026
23 checks passed
@bsbodden bsbodden deleted the fix/flaky-llm-tests branch February 26, 2026 18:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flaky test: test_temporal_grounding_integration_last_year

3 participants