formal VM benchmark and inspectable reasoning runtime for testing whether language models can follow machine-like execution semantics on synthetic tasks.
benchmark vm semantics mcp language-models copilot vm-benchmark mcp-server copilot-coding-agent dataset-factory eval-toolkit reasoning-runtime execution-semantics
-
Updated
Mar 20, 2026 - Python