SysMoBench: Evaluating AI on Formally Modeling Complex Real-World Systems

Overview

Formal models are essential to specifying large, complex computer systems and verifying their correctness, but are notoriously expensive to write and maintain. Recent advances in generative AI show promise in generating certain forms of specifications. However, existing work mostly targets small code, not complete systems. It is unclear whether AI can deal with realistic system artifacts, as this requires abstracting their complex behavioral properties into formal models.

SysMoBench presents a benchmark that evaluates AI's ability to formally model large, complex systems. We focus on concurrent and distributed systems, which are keystones of today's critical computing infrastructures, encompassing operating systems and cloud infrastructure. We use TLA+, the de facto specification language for concurrent and distributed systems, though the benchmark can be extended to other specification languages.

We address the primary challenge of evaluating AI-generated models by automating metrics like syntactic and runtime correctness, conformance to system code, and invariant correctness. SysMoBench currently includes nine diverse system artifacts: the Raft implementation of Etcd and Redis, the Spinlock and Mutex in Asterinas OS, etc.; more artifacts are being actively added.

SysMoBench sources its tasks from real-world systems and automatically evaluates AI-generated TLA+ models with different metrics, as illustrated in the following figure.

Key Features

Automated Quality Metrics: Four automated phases evaluate AI-generated models from multiple dimensions: syntax correctness, runtime correctness, conformance to system implementation, and invariant correctness.
Real-World System Artifacts: Nine diverse widely used systems, including distributed consensus systems (Etcd Raft, Redis Raft), concurrent synchronization systems (Asterinas Spinlock/Mutex/Rwmutex), distributed replication systems (Xline CURP), and PGo-compiled systems.
Extensible Framework: Adding new system artifacts requires no reference model—only system code, instrumentation for trace collection, and invariant templates, making the benchmark easy to extend with additional systems.

Installation

Prerequisites

Python 3.8+
Java 11+ (for TLA+ tools)

Setup

Navigate to the project directory and install:

cd SysMoBench
pip install -e .

Download TLA+ tools:

sysmobench-setup

Configure API keys:

export OPENAI_API_KEY="your-key"
export ANTHROPIC_API_KEY="your-key"
export GEMINI_API_KEY="your-key"
export DEEPSEEK_API_KEY="your-key"

Verify installation:

sysmobench --list-tasks

Quick Start

Running Your First Evaluation

General usage:

sysmobench --task <task> --method <method> --model <model> --metric <metric>

This example demonstrates how to evaluate an AI-generated TLA+ model for the Asterinas Spinlock system using compilation_check metric.

sysmobench --task spin --method direct_call --model claude --metric compilation_check

Expected Output:

[INFO] Starting benchmark evaluation for task: spin
[INFO] Using method: direct_call, model: claude
[INFO] Generating TLA+ specification...
[INFO] Running syntax correctness evaluation with SANY...

Syntax Evaluation Results: ✓ PASS
Generation time: 12.34s
Compilation time: 0.52s
Syntax errors: 0
Semantic errors: 0
Compilation success rate: PASS (100.0%)

Results saved to: output/spin/direct_call/claude/

For detailed usage instructions, see Usage Guide.

Benchmark Tasks

SysMoBench includes 9 diverse real-world system artifacts from concurrent and distributed systems:

System	Type	Description	Source Lang.	Source LoC	TLA+ LoC
Asterinas Spinlock	Concurrent	Synchronization	Rust	213	151
Asterinas Mutex	Concurrent	Synchronization	Rust	186	219
Asterinas Rwmutex	Concurrent	Synchronization	Rust	395	250
Etcd Raft	Distributed	Consensus (Raft)	Go	2,159	385
Redis Raft	Distributed	Consensus (Raft)	C	2,394	349
Xline CURP	Distributed	Replication (CURP)	Rust	4,064	100
PGo dqueue	Distributed	Distributed Queue	Go	175	75
PGo locksvc	Distributed	Lock Server	Go	281	93
PGo raftkvs	Distributed	Consensus (Raft)	Go	3,163	508

To list all available tasks:

sysmobench --list-tasks

Evaluation Metrics

SysMoBench provides four automated phases to evaluate AI-generated TLA+ models with different metrics: syntax correctness, runtime correctness, conformance to system implementation, and invariant correctness.

Reproducing Paper Results

The paper uses the following metrics for evaluation:

Syntax Correctness: Weighted combination (50% each) of compilation_check and action_decomposition
Runtime Correctness: runtime_coverage
Conformance to System Implementation:
- Non-PGo systems (spin, mutex, rwmutex, etcd, redis, xline): trace_validation
- PGo systems (raftkvs, locksvc, dqueue): pgo_trace_validation
Invariant Correctness: invariant_verification

1. Syntax Correctness

Validates that the generated TLA+ model adheres to TLA+ grammar rules, correct operator usage, and proper module structure using the SANY Syntactic Analyzer. The evaluation includes both full-model compilation and per-action analysis to provide fine-grained feedback on syntax errors.

Available metrics:

compilation_check - Basic TLA+ compilation checking using SANY parser
action_decomposition - Evaluate individual actions separately for better granularity

2. Runtime Correctness

Evaluates whether the syntactically correct model can execute without runtime errors. The TLC Model Checker performs bounded model checking and simulation to explore the state space without invariant checking, recording covered actions and any runtime errors encountered (e.g., type mismatches, invalid operations).

Available metrics:

runtime_check - Model checking with TLC
coverage - TLA+ specification coverage analysis using TLC coverage statistics
runtime_coverage - Runtime coverage using simulation mode to identify successful vs error-prone actions

3. Conformance to System Implementation

Measures whether the model conforms to actual system behavior through trace validation. Execution traces collected from instrumented system code are validated against the model to verify if each trace corresponds to a valid path in the model's state space, ensuring behavioral consistency between model and implementation.

Available metrics:

trace_validation - Full trace generation and validation pipeline
pgo_trace_validation - Full trace generation and validation pipeline (PGo version)

4. Invariant Correctness

Verifies that the model satisfies system-specific safety and liveness properties. Using expert-written invariant templates that are automatically concretized for the generated model, TLC model checking validates correctness properties such as mutual exclusion, log consistency, and progress guarantees across the explored state space.

Available metrics:

invariant_verification - Model checking with invariants

Adding New Systems

SysMoBench is designed to be extensible. To add a new system artifact:

Prepare system artifact: Collect repository links, branch names, and any relevant materials
Create task definition: Specify modeling requirements, task configuration and related files in task.yaml and define invariant templates for correctness properties
Instrument for trace collection: Add logging statements to system code to collect execution traces for conformance validation

For detailed instructions, see Adding New Systems Guide.

Project Structure

SysMoBench/
├── scripts/
│   └── run_benchmark.py          # Main entry script for running benchmarks
├── tla_eval/
│   ├── tasks/                    # Task definitions for each system artifact
│   │   ├── spin/                 # Spinlock task with prompts, configs
│   │   │   ├── prompts/          # System-specific prompts
│   │   │   └── task.yaml         # Task configuration (system info)
│   │   ├── mutex/
│   │   └── ...
│   ├── models/                   # LLM model interfaces and wrappers
│   ├── evaluation/               # Evaluator implementations organized by metric type
│   └── config.py                 # Configuration management (API keys, LLM model configs)
├── data/
│   ├── invariant_templates/      # Expert-written invariant templates for each system
│   └── traces/                   # System execution traces for conformance evaluation
└── lib/                          # TLA+ toolchain (tla2tools.jar for SANY and TLC)

Name		Name	Last commit message	Last commit date
Latest commit History 210 Commits
.devcontainer		.devcontainer
config		config
data		data
docs		docs
scripts		scripts
tests		tests
tla_eval		tla_eval
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
go.mod		go.mod
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SysMoBench: Evaluating AI on Formally Modeling Complex Real-World Systems

Overview

Key Features

Installation

Prerequisites

Setup

Quick Start

Running Your First Evaluation

Benchmark Tasks

Evaluation Metrics

Reproducing Paper Results

1. Syntax Correctness

2. Runtime Correctness

3. Conformance to System Implementation

4. Invariant Correctness

Adding New Systems

Project Structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 6

Uh oh!

Languages

License

specula-org/SysMoBench

Folders and files

Latest commit

History

Repository files navigation

SysMoBench: Evaluating AI on Formally Modeling Complex Real-World Systems

Overview

Key Features

Installation

Prerequisites

Setup

Quick Start

Running Your First Evaluation

Benchmark Tasks

Evaluation Metrics

Reproducing Paper Results

1. Syntax Correctness

2. Runtime Correctness

3. Conformance to System Implementation

4. Invariant Correctness

Adding New Systems

Project Structure

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 6

Uh oh!

Languages

Packages