Add configurable judge model support to cody-bench #7979
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
🎯 Problem
The cody-bench command currently hardcodes the LLM judge model to anthropic/claude-3-5-sonnet-20240620, limiting flexibility for users who want to experiment with different models for evaluation or reduce costs by using smaller models like Claude Haiku.
💡 Solution
This PR adds a new --judge-model CLI option that allows users to specify which model to use for LLM-as-a-judge evaluations, while maintaining backward compatibility with the existing default
🔄 Backward Compatibility
✅ No breaking changes - existing code continues to work unchanged
✅ Default behavior preserved - same model used when option not specified
✅ Constructor backward compatibility - existing LlmJudge instantiation works
📋 Validation Checklist
[x] CLI option parsing works correctly
[x] Default model behavior maintained
[x] Custom models passed through correctly
[x] Strategy integration functions properly
[x] TypeScript types are correct
[x] All tests pass
[x] Linting passes
🎯 Benefits
Cost optimization - Users can choose cheaper models like Claude Haiku for large-scale evaluations
Quality tuning - Users can select Claude Opus for highest quality judging when needed
Experimentation - Researchers can compare different models' judging capabilities
Future-proofing - Easy to add support for new models as they become available
Test plan