Evaluate performance of LLM models for Q&A in any domain
benchmarking reproducible-research evaluation embeddings gemini data-analysis code-execution rag huggingface openai-api llm prompt-engineering runpod anthropic openrouter langsmith togetherai evaluate-llm smolagents automated-evaluation
-
Updated
Jul 3, 2025 - Python