Skip to content

test: fix integration test failures#110

Merged
Kamilbenkirane merged 3 commits intomainfrom
ci/skip-xai-grok-3-mini
Jan 15, 2026
Merged

test: fix integration test failures#110
Kamilbenkirane merged 3 commits intomainfrom
ci/skip-xai-grok-3-mini

Conversation

@Kamilbenkirane
Copy link
Copy Markdown
Member

Summary

  • Convert max_tokens assertions to warnings (xAI reasoning tokens can exceed limit)
  • Fix images test_sync_generate to handle list content from Google Imagen

Changes

  • test_stream_generate.py: assert → warnings.warn for max_tokens
  • test_stream_analyze_image.py: assert → warnings.warn for max_tokens
  • test_generate.py (text): assert → warnings.warn for max_tokens
  • test_analyze_image.py: assert → warnings.warn for max_tokens
  • test_generate.py (images): handle list content in test_sync_generate

@Kamilbenkirane Kamilbenkirane merged commit b849276 into main Jan 15, 2026
11 checks passed
@claude
Copy link
Copy Markdown

claude Bot commented Jan 15, 2026

PR Review: Test Integration Test Fixes

Summary

This PR addresses two important issues in the integration test suite:

  1. Converts strict max_tokens assertions to warnings to accommodate reasoning models (xAI Grok) that can exceed token limits due to hidden reasoning tokens
  2. Fixes the test_sync_generate image test to handle both single artifact and list responses from Google Imagen

Code Quality: ✅ Good

Strengths:

  • Clean, consistent changes across all affected test files
  • Well-documented PR description explaining the reasoning
  • Follows existing code patterns in the codebase
  • Maintains test coverage while making tests more resilient

Minor Observations:

  • The warning messages are clear and informative
  • Consistent use of stacklevel=1 for warnings is appropriate for test files

Logic & Correctness: ✅ Excellent

1. Token Limit Handling (Warnings Instead of Assertions)

The change from assertions to warnings for max_tokens is the correct approach. This is correct because:

  • Reasoning models (like xAI's Grok) can legitimately exceed max_tokens due to internal reasoning tokens that don't count toward the limit
  • I verified in src/celeste/providers/xai/responses/client.py:115 and other providers that reasoning_tokens is tracked separately via UsageField.REASONING_TOKENS
  • The tests already handle empty content from reasoning models (comments like "Empty/None content is valid for reasoning models")

2. Image Content List Handling

The fix in tests/integration_tests/images/test_generate.py correctly handles the type union. This is correct because:

  • I verified in src/celeste/types.py:13 that ImageContent = ImageArtifact | list[ImageArtifact]
  • Looking at src/celeste/modalities/images/providers/google/imagen.py:68-74, the Imagen provider returns a single ImageArtifact when num_images == 1 or a list[ImageArtifact] when num_images > 1
  • The test now explicitly sets num_images=1 but correctly handles the list case defensively

Best Practices: ✅ Good

Follows Repository Standards:

  • Aligns with tests/testing_guidelines.md principles: tests remain deterministic, clear assertion messages, appropriate use of warnings for non-critical violations
  • Uses warnings module correctly (already imported at top of files)

Good practices demonstrated:

  • Defensive programming in the image test (handles both possible types)
  • Informative warning messages include provider, model, and actual vs. expected values
  • Consistent application of the fix across all affected test files

Potential Issues: ⚠️ Minor Considerations

1. Warning Visibility in CI

  • Warnings may not be visible in CI output unless pytest is configured to display them
  • Recommendation: Verify warnings are visible in CI output, or document that these are expected behavior

2. Test Coverage for List Case

  • The image test now handles lists defensively, but only tests with num_images=1
  • Suggestion: Consider adding a test case for num_images > 1 to explicitly test the list path

3. Async Test Coverage

  • The test_generate async test in the same file doesn't have the list handling fix
  • Looking at line 49-50, it asserts isinstance(response.content, ImageArtifact) directly
  • Question: Should this test also handle the list case, or is it guaranteed to return a single artifact based on the parametrization?

Performance: ✅ No Impact

The changes have no performance implications.

Security: ✅ No Concerns

No security implications from these changes.

Testing: ✅ Adequate

Test Coverage:

  • Changes affect test files themselves
  • The fixes allow tests to pass that were previously failing
  • Integration tests will validate the behavior across actual providers

Recommendation:

  • Run the integration test suite to confirm all tests pass
  • Verify that warnings are actually emitted for reasoning models that exceed tokens

Documentation: ✅ Good

  • PR description clearly explains the changes and rationale
  • Inline comments in the test code explain the conditional logic
  • Warning messages are self-documenting

Overall Assessment: ✅ Approve with Suggestions

Verdict: This is a solid, well-reasoned PR that correctly addresses real integration test failures.

Required Changes: None

Recommended Improvements:

  1. Verify warnings are visible in CI output
  2. Consider adding explicit test coverage for num_images > 1 case
  3. Review if test_generate async test needs the same list handling as test_sync_generate

Nice work on:

  • Identifying the root cause (reasoning tokens behavior)
  • Choosing warnings over silently ignoring the issue
  • Consistent application across all affected files
  • Clear documentation in the PR

Action Items for Reviewer:

  • Verify integration tests pass in CI
  • Check if warnings are displayed in test output
  • Decide if additional test coverage for num_images > 1 is needed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant