System Intelligence Benchmark is a comprehensive benchmark suite for evaluating the performance of Large Language Models (LLMs) and AI systems across critical system capabilities. It features tutorial, example benchmarks and offers both CLI tools and an SDK for further development.
A benchmark is a standard or point of reference against which things may be compared or assessed. In the context of AI and LLMs, benchmarks are essential for evaluating model capabilities, guiding research directions, and measuring progress.
To advance benchmark development, we propose the System Intelligence Benchmark, a modular and extensible framework designed to support diverse research domains and problem types. As shown in the below figure, the framework comprises four abstractions: task set, environment, executor, and evaluator. Each task is associated with a specific environment, wherein the executor generates a solution that is subsequently assessed by the evaluator, which returns the evaluation metrics. This design enables the flexible integration of heterogeneous agents and their systematic evaluation. Additionally, the framework includes built-in executors (agents), evaluators (methodologies and grading rubrics), and tutorials. In an ideal case, users need only supply tasks that represent specific capabilities, select an evaluator, and quickly create and run a new benchmark. You can see benchmark_abstraction.md for details.
The benchmark framework is still under development. If you have any questions, feel free to open an issue or contact us directly.
System Intelligence Benchmark currently includes the following example benchmarks. Each benchmark assesses specific capabilities across multiple levels within a given research direction. Some benchmarks are still under development — we're actively updating them. Stay tuned!
- System Exam Benchmark (benchmarks/course_exam_bench/) - Tests LLM understanding of system concepts through university course exams (54 questions across 4 exams)
- System Lab Benchmark (benchmarks/course_lab_bench/) - Assesses AI capability on practical system course labs and projects
- System Artifact Benchmark (benchmarks/arteval_bench/) - Evaluates AI performance on artifact evaluation
- System Modeling Benchmark (benchmarks/sysmobench/) - Evaluates an agent's ability to produce correct TLA+ models for real-world concurrent and distributed systems, covering system capabilities across system comprehension, abstraction, and potentially tool fluency.
- Example Benchmark (benchmarks/example_bench/) - Template and reference implementation for creating new benchmarks
- Benchmarks (
benchmarks/) - Contains individual benchmark implementations, each with its own source code, tests, and configuration - CLI Tools (
cli/) - Command-line interface for running benchmarks and managing evaluations - SDK (
sdk/) - Software development kit providing evaluators, LLM interfaces, and utility functions - Documentation (
doc/) - Guides and documentation for using and contributing to SysCapBench
- Python 3.9+
- Docker (optional, for containerized execution)
Docker images currently only support x86_64/AMD64 architecture. ARM64 (Apple Silicon M1/M2/M3) is not yet supported
-
Clone the repository:
git clone https://github.com/sys-intelligence/system-intelligence-benchmark.git cd system-intelligence-benchmark -
Install dependencies for a specific benchmark:
cd cli ./install.sh -
Each benchmark includes an
env.tomlfile for configuration. You should add your own llm endpoint url and key there.
To run all benchmarks sequentially:
cd cli
./run_all_local.sh <model_name>To run just one benchmark locally:
cd benchmarks/<benchmark_name>
./install.sh # Only needed the first time
./run.sh <model_name>Benchmarks generate standardized outputs in cli/outputs/{benchmark_name}__{model_name}__{agent}_{timestamp}/:
result.jsonl: Detailed evaluation resultssummary.json: Aggregated performance metrics- Test-specific breakdowns and comparisons
You can find more detailed usage guides in the CLI README.md.
We welcome community contributions to enrich existing benchmarks (e.g., by adding more exam problems to the System Exam benchmark and more system artifacts to System Artifact and System Modeling benchmark), port your existing benchmarks, and more importantly to create new system intelligence benchmarks with our framework. See below for detailed instructions. We believe that such collective community efforts will advance AI to its next level and help realize System Intelligence, unlocking the potential of AI-driven computing system innovations. If you are interested in contributing or already have good system benchmarks, please let us know. We have set up a slack channel at sys-intelligence.slack.com.
Note
We suggest getting starting by walking through the basic concept of a AI benchmark: Benchmark Abstraction. After understanding the basic concept, you can decide whether to Contribute to Existing Benchmarks, Porting Existing Benchmarks, or Creating New Benchmarks.
The easiest way to contribute is to add more tasks to existing benchmarks. Currently, the following two are highly recommended. You can simply follow the provided guidelines to submit your data—once that’s done, you’re all set.
- SystemExam: If you are a professor teaching one or more courses, we highly recommend contributing more exam problems to SystemExam (see this doc for step-by-step guidance).
- SystemArtifact: If you are a researcher submitting artifacts, or an AE chair involved in artifact evaluation, we highly recommend contributing more system artifacts to SystemArtifact (see this doc for step-by-step guidance).
In addition, you can also help review the existing benchmarks to propose improvement ideas or directly enhance them—for example, by adding more advanced evaluators or incorporating improved metrics.
Note
See porting_benchmark.md for step-by-step guidelines.
For integrating existing, independently-developed benchmark projects while maintaining synchronization with upstream:
- Use Git Subtree/Submodule to incorporate upstream code
- Write a bridge layer to connect upstream evaluators with framework SDK
- Configure bidirectional sync for pulling updates and contributing fixes
Example: SysMoBench - ported from SysSpecBench
Note
See custom_benchmark.md for step-by-step guidelines.
To create a new benchmark, follow these steps:
- Create a new benchmark directory in
benchmarks/- Based on your specific requirements, select and copy an example benchmark as a starting point
- Update the
src/main.pyfile with your specific evaluation logic (your executor and evaluator) - Add test cases in the
tests/directory
- Update the README.md with benchmark-specific details
- Implement
install.shandrun.shscripts - Update the benchmark list in
run_all_local.shandrun_docker.shif needed
This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.
This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.
