fix(tests): reduce flakiness in LLM-dependent tests by bsbodden · Pull Request #171 · redis/agent-memory-server

bsbodden · 2026-02-26T15:43:34Z

Summary

Add extract_with_retry() helper that retries LLM extraction up to 3 times before pytest.skip(), handling non-deterministic empty results
Set temperature=0.0 on LLMContextualGroundingJudge and MemoryExtractionJudge for consistent scoring
Re-enable previously skipped test_multi_entity_conversation and test_judge_comprehensive_grounding_evaluation

Test plan

uv run pre-commit run --all-files passes
uv run pytest tests/test_contextual_grounding_integration.py tests/test_thread_aware_grounding.py tests/test_llm_judge_evaluation.py -v --run-api-tests — previously-skipped tests now run
uv run pytest — all tests pass

Note

Medium Risk
Medium risk because it changes behavior of LLM-integration tests (retries, skipping, and re-enabling previously skipped tests), which may affect CI stability and runtime depending on external LLM variability.

Overview
Reduces flakiness in LLM-dependent tests by adding extract_with_retry() (retries thread-aware extraction and skips the test if results stay empty) and switching several tests to use it instead of single-shot extraction/assertions.

Makes LLM-as-a-judge scoring more deterministic by threading a temperature parameter (default 0.0) through LLMContextualGroundingJudge and MemoryExtractionJudge calls.

Re-enables previously skipped LLM tests (test_multi_entity_conversation and the comprehensive judge grounding test) and adds timeout-based skipping to avoid failures from external service timeouts.

^{Written by Cursor Bugbot for commit 8c4fccc. This will update automatically on new commits. Configure here.}

Add extraction retry helper to handle non-deterministic empty results from LLM extraction, and set temperature=0.0 on LLM judge calls for consistent scoring. Re-enables previously skipped tests. Fixes #74, #54, #48

github-actions

The PR effectively addresses flaky LLM-dependent tests through retry logic and deterministic temperature settings. The changes are well-structured and should significantly reduce test flakiness.

🤖 Automated review complete. Please react with 👍 or 👎 on the individual review comments to provide feedback on their usefulness.

jit-ci · 2026-02-26T15:44:51Z

🛡️ Jit Security Scan Results

✅ No security findings were detected in this PR

^{Security scan by Jit}

Copilot

Pull request overview

This PR reduces flakiness in LLM-dependent test suites by adding retry/skip behavior around memory extraction calls and making LLM-as-a-judge scoring more deterministic via temperature=0.0, while re-enabling previously skipped flaky tests.

Changes:

Added an extract_with_retry() helper in LLM-dependent tests to retry extraction up to 3 times before skipping when results are empty.
Updated LLMContextualGroundingJudge and MemoryExtractionJudge to accept a temperature parameter (default 0.0) and pass it into LLMClient.create_chat_completion().
Re-enabled test_multi_entity_conversation and test_judge_comprehensive_grounding_evaluation.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File	Description
`tests/test_thread_aware_grounding.py`	Adds extraction retry helper and re-enables multi-entity grounding test using it.
`tests/test_contextual_grounding_integration.py`	Adds extraction retry helper and uses it across grounding integration tests; sets judge temperature for determinism.
`tests/test_llm_judge_evaluation.py`	Sets extraction judge temperature and re-enables comprehensive grounding judge test.

Comments suppressed due to low confidence (1)

tests/test_llm_judge_evaluation.py:392

test_judge_comprehensive_grounding_evaluation doesn’t call skip_if_timeout(evaluation) like the other judge tests. Because LLMContextualGroundingJudge.evaluate_grounding() returns default mid-range scores on timeout, this test can silently pass when the LLM call fails. Add skip_if_timeout(evaluation) after the evaluation call (before assertions) to keep behavior consistent and avoid false positives.

    async def test_judge_comprehensive_grounding_evaluation(self):
        """Test LLM judge on complex example with multiple grounding types"""

        judge = LLMContextualGroundingJudge()

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.}

- Move extract_with_retry() to conftest.py to avoid duplication - Remove redundant len assertions (extract_with_retry guarantees >= 1) - Add skip_if_timeout() to test_judge_comprehensive_grounding_evaluation

vishal-bala

LGTM 👍

Address review feedback — callers can now opt into receiving empty results instead of auto-skipping, useful for tests that assert extraction correctly returns nothing.

fix(tests): reduce flakiness in LLM-dependent tests

97c5012

Add extraction retry helper to handle non-deterministic empty results from LLM extraction, and set temperature=0.0 on LLM judge calls for consistent scoring. Re-enables previously skipped tests. Fixes #74, #54, #48

Copilot AI review requested due to automatic review settings February 26, 2026 15:43

Copilot started reviewing on behalf of bsbodden February 26, 2026 15:44 View session

github-actions Bot reviewed Feb 26, 2026

View reviewed changes

Comment thread tests/test_contextual_grounding_integration.py Outdated

Comment thread tests/test_contextual_grounding_integration.py Outdated

Copilot AI reviewed Feb 26, 2026

View reviewed changes

Comment thread tests/test_thread_aware_grounding.py Outdated

Comment thread tests/test_contextual_grounding_integration.py Outdated

cursor Bot reviewed Feb 26, 2026

View reviewed changes

Comment thread tests/test_llm_judge_evaluation.py

fix(tests): address PR review feedback

f46de16

- Move extract_with_retry() to conftest.py to avoid duplication - Remove redundant len assertions (extract_with_retry guarantees >= 1) - Add skip_if_timeout() to test_judge_comprehensive_grounding_evaluation

bsbodden self-assigned this Feb 26, 2026

bsbodden requested review from vishal-bala February 26, 2026 16:23

vishal-bala approved these changes Feb 26, 2026

View reviewed changes

Comment thread tests/conftest.py

fix(tests): add allow_empty param to extract_with_retry

8c4fccc

Address review feedback — callers can now opt into receiving empty results instead of auto-skipping, useful for tests that assert extraction correctly returns nothing.

bsbodden merged commit b160585 into main Feb 26, 2026
23 checks passed

bsbodden deleted the fix/flaky-llm-tests branch February 26, 2026 18:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(tests): reduce flakiness in LLM-dependent tests#171

fix(tests): reduce flakiness in LLM-dependent tests#171
bsbodden merged 3 commits intomainfrom
fix/flaky-llm-tests

bsbodden commented Feb 26, 2026 •

edited

Loading

Uh oh!

github-actions Bot left a comment

Uh oh!

Uh oh!

Uh oh!

jit-ci Bot commented Feb 26, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

vishal-bala left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

bsbodden commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jit-ci Bot commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🛡️ Jit Security Scan Results

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

vishal-bala left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bsbodden commented Feb 26, 2026 •

edited

Loading

jit-ci Bot commented Feb 26, 2026 •

edited

Loading