You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This PR addresses two important issues in the integration test suite:
Converts strict max_tokens assertions to warnings to accommodate reasoning models (xAI Grok) that can exceed token limits due to hidden reasoning tokens
Fixes the test_sync_generate image test to handle both single artifact and list responses from Google Imagen
Code Quality: ✅ Good
Strengths:
Clean, consistent changes across all affected test files
Well-documented PR description explaining the reasoning
Follows existing code patterns in the codebase
Maintains test coverage while making tests more resilient
Minor Observations:
The warning messages are clear and informative
Consistent use of stacklevel=1 for warnings is appropriate for test files
Logic & Correctness: ✅ Excellent
1. Token Limit Handling (Warnings Instead of Assertions)
The change from assertions to warnings for max_tokens is the correct approach. This is correct because:
Reasoning models (like xAI's Grok) can legitimately exceed max_tokens due to internal reasoning tokens that don't count toward the limit
I verified in src/celeste/providers/xai/responses/client.py:115 and other providers that reasoning_tokens is tracked separately via UsageField.REASONING_TOKENS
The tests already handle empty content from reasoning models (comments like "Empty/None content is valid for reasoning models")
2. Image Content List Handling
The fix in tests/integration_tests/images/test_generate.py correctly handles the type union. This is correct because:
I verified in src/celeste/types.py:13 that ImageContent = ImageArtifact | list[ImageArtifact]
Looking at src/celeste/modalities/images/providers/google/imagen.py:68-74, the Imagen provider returns a single ImageArtifact when num_images == 1 or a list[ImageArtifact] when num_images > 1
The test now explicitly sets num_images=1 but correctly handles the list case defensively
Best Practices: ✅ Good
Follows Repository Standards:
Aligns with tests/testing_guidelines.md principles: tests remain deterministic, clear assertion messages, appropriate use of warnings for non-critical violations
Uses warnings module correctly (already imported at top of files)
Good practices demonstrated:
Defensive programming in the image test (handles both possible types)
Informative warning messages include provider, model, and actual vs. expected values
Consistent application of the fix across all affected test files
Potential Issues: ⚠️ Minor Considerations
1. Warning Visibility in CI
Warnings may not be visible in CI output unless pytest is configured to display them
Recommendation: Verify warnings are visible in CI output, or document that these are expected behavior
2. Test Coverage for List Case
The image test now handles lists defensively, but only tests with num_images=1
Suggestion: Consider adding a test case for num_images > 1 to explicitly test the list path
3. Async Test Coverage
The test_generate async test in the same file doesn't have the list handling fix
Looking at line 49-50, it asserts isinstance(response.content, ImageArtifact) directly
Question: Should this test also handle the list case, or is it guaranteed to return a single artifact based on the parametrization?
Performance: ✅ No Impact
The changes have no performance implications.
Security: ✅ No Concerns
No security implications from these changes.
Testing: ✅ Adequate
Test Coverage:
Changes affect test files themselves
The fixes allow tests to pass that were previously failing
Integration tests will validate the behavior across actual providers
Recommendation:
Run the integration test suite to confirm all tests pass
Verify that warnings are actually emitted for reasoning models that exceed tokens
Documentation: ✅ Good
PR description clearly explains the changes and rationale
Inline comments in the test code explain the conditional logic
Warning messages are self-documenting
Overall Assessment: ✅ Approve with Suggestions
Verdict: This is a solid, well-reasoned PR that correctly addresses real integration test failures.
Required Changes: None
Recommended Improvements:
Verify warnings are visible in CI output
Consider adding explicit test coverage for num_images > 1 case
Review if test_generate async test needs the same list handling as test_sync_generate
Nice work on:
Identifying the root cause (reasoning tokens behavior)
Choosing warnings over silently ignoring the issue
Consistent application across all affected files
Clear documentation in the PR
Action Items for Reviewer:
Verify integration tests pass in CI
Check if warnings are displayed in test output
Decide if additional test coverage for num_images > 1 is needed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Changes
test_stream_generate.py: assert → warnings.warn for max_tokenstest_stream_analyze_image.py: assert → warnings.warn for max_tokenstest_generate.py(text): assert → warnings.warn for max_tokenstest_analyze_image.py: assert → warnings.warn for max_tokenstest_generate.py(images): handle list content in test_sync_generate