ai-benchmark

Star

Here are 7 public repositories matching this topic...

microsoft / WindowsAgentArena

Star

Windows Agent Arena (WAA) 🪟 is a scalable OS platform for testing and benchmarking of multi-modal AI agents.

windows ai computer ai-research ai-agent agentic ai-benchmark desktop-agent computer-use

Updated Apr 30, 2025
Python

TheAgentCompany / TheAgentCompany

Star

An agent benchmark with tasks in a simulated software company.

agent benchmark ai ai-research llm ai-benchmark

Updated Jun 13, 2025
Python

kaykycampos / gta-benchmark

Star

GTA (Guess The Algorithm) Benchmark - A tool for testing AI reasoning capabilities

python docker flask benchmark machine-learning puzzle reverse-engineering educational pattern-recognition ctf binary-analysis algorithm-analysis computational-thinking algorithmic-reasoning ai-benchmark

Updated Mar 23, 2025

Habitante / gta-benchmark

Star

GTA (Guess The Algorithm) Benchmark - A tool for testing AI reasoning capabilities

Updated Jan 12, 2025
Python

chaosync-org / awesome-ai-agent-testing

Star

🤖 A curated list of resources for testing AI agents - frameworks, methodologies, benchmarks, tools, and best practices for ensuring reliable, safe, and effective autonomous AI systems

testing qa benchmark machine-learning evaluation chaos artificial-intelligence chaos-monkey testing-tools awesome-list quality-assurance ai-safety ai-agents chaos-engineering llm llm-evaluation agentic-ai ai-benchmark agent-evaluation

Updated May 28, 2025

PlayBench is a platform that evaluates AI models by having them compete in various games and creative tasks. Unlike traditional benchmarks that focus on text generation quality or factual knowledge, PlayBench tests models on skills like strategic thinking, pattern recognition, and creative problem-solving.

svg chess ai rock-paper-scissors ai-benchmark

Updated Jun 16, 2025
Blade

petmal / MindTrial

Star

MindTrial: Evaluate and compare AI language models (LLMs) on text-based tasks with optional file/image attachments. Supports multiple providers (OpenAI, Google, Anthropic, DeepSeek), custom tasks in YAML, and HTML/CSV reports.

Updated Jun 18, 2025
Go

Improve this page

Add a description, image, and links to the ai-benchmark topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the ai-benchmark topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ai-benchmark

Here are 7 public repositories matching this topic...

microsoft / WindowsAgentArena

TheAgentCompany / TheAgentCompany

kaykycampos / gta-benchmark

Habitante / gta-benchmark

chaosync-org / awesome-ai-agent-testing

playsaurus-inc / play-bench

petmal / MindTrial

Improve this page

Add this topic to your repo