A New Stan-to-C++ Compiler
This repo contains work in progress on a new compiler for Stan, written in OCaml. To read more about why we're building this, see this introductory blog post. For some discussion as to how we chose OCaml, see this accidental flamewar. We're currently able to successfully compile, link, and run these models(listed under Test Results), but not much else.
High-level concepts, invariants, and 30,000-ft view
Stanc3 has 3 main src packages:
stan_math_backend. The Middle contains the MIR and currently any types or functions used by the two ends.
The entrypoint for the compiler is in
src/stanc/stanc.ml which sequences the various components together.
Distinct Stanc Phases
- Lex the Stan language into tokens.
- Parse Stan language into AST that represents the syntax quite closely and aides in development of pretty-printers and linters.
stanc --debug-astto print this out.
- Typecheck & add type information Semantic_check.ml.
- Desugaring phase (AST -> AST).
- Lower into Middle Intermediate Representation (AST -> MIR)
- Analyze & optimize (MIR -> MIR)
- Backend MIR transform (MIR -> MIR) Transform_Mir.ml
- Hand off to a backend to emit C++ (or LLVM IR, or Tensorflow, or interpret it!).
The two central data structures
src/frontend/Ast.mldefines the AST. The AST is intended to have a direct 1-1 mapping with the syntax, so there are things like parentheses being kept around. The pretty-printer in the frontend uses the AST and attempts to keep user syntax the same while just adjusting whitespace (maybe that's the wrong idea and we should move to a canonicalizer like
go fmtsoon; TBD). The AST uses a particular functional programming trick to add metadata to the AST (and its other tree types), sometimes called the "two-level types" pattern. Essentially, many of the tree variant types are parameterized by something that ends up being a placeholder not for just metadata but for the recursive type including metadata, sometimes called the fixed point. So instead of recursively referencing
expressionyou would instead reference type parameter
'e, which will later be filled in with something like
type expr_with_meta = metadata expression. The AST intends to keep very close to Stan-level semantics and syntax in every way.
src/middle/Mir.mlcontains the MIR (Middle Intermediate Language - we're saving room at the bottom for later).
src/frontend/Ast_to_Mir.mlperforms the lowering and attempts to strip out as much Stan-specific semantics and syntax as possible, though this is still something of a work-in-progress. The MIR uses the same two-level types pattern to add metadata, notably expression types and autodiff levels as well as locations on many things. The MIR is used as the output data type from the frontend and the input for dataflow analysis, optimization (which also outputs MIR), and code generation.
Getting development on stanc3 up and running locally
To build, test, and run
To be able to build the project, make sure you have GNU make installed.
To install OCaml and the dependencies we need to build and do development, run
make. The binary will be built in
To run tests, run
dune runtest (and if there are changes you think are correct now, use
dune promote to accept them).
To run e.g. only the integration tests, run
dune runtest test/integration.
There are some git hooks in
scripts/hooks; install with
To auto-format the OCaml code (sadly, this does not work for the two ocamllex
and menhir files), run
dune build @fmt or
To accept the changes proposed by ocamlformat, run
./_build/default/src/stanc/stanc.exe on individual .stan file to compile it. Use
-? to get command line options.
dune build @update_messages to see if your additions to the parser have added any new error message possibilities, and
dune promote to accept them.
Development on Windows
Having tried both native Windows development and development through Ubuntu on WSL, the Ubuntu on WSL route seems vastly smoother and it is what we recommend as a default. It's only downside seems to be that it builds Ubuntu, rather than Windows binaries. If Windows binaries are preferred, OCaml for Windows can be used.
For working on this project, we recommend using either VSCode or Emacs as an editor, due to their good OCaml support through Merlin: syntax highlighting, auto-completion, type inference, automatic case splitting, and more. For people who prefer a GUI and have not memorized all Emacs or Vim keystrokes, VSCode might have the less steep learning curve. Anything with Merlin support and keyboard shortcuts should be okay.
Setting up VSCode
Install instructions for VSCode can be found here.
For Windows users: please note that we advise to follow the Linux install instructions through WSL.
Seeing that VSCode is a GUI application, you will need to install an XServer and add
export DISPLAY=:0.0 to
We recommend Mobaxterm.
In case you are using a high-res display, it may be worth overriding the high DPI setting of Mobaxterm (right click Mobaxterm binary > properties > Compatibility > Change high DPI settings > Override high DPI scaling behaviour > Application) and adding
export GDK_SCALE=3 or
export GDK_SCALE=2 to
We also advise setting
"window.titleBarStyle": "native" in VSCode under settings to be able to have proper control over the window.
Once in VSCode (on any platform), simply install the OCaml extension and you should be ready to go.
- Multiple phases, each with human-readable intermediate representations for easy debugging and optimization design.
- Optimizing - takes advantage of info known at the Stan language level. Minimize information we must teach users for them to write performant code.
- Holistic- bring as much of the code as possible into the MIR for whole-program optimization.
- Research platform- enable a new class of optimizations based on probability theory.
- Modular - architect & build in a way that makes it easy to outsource things like symbolic differentiation to external libraries and to use parts of the compiler as the basis for other tools built around the Stan language.
- Simplicity first - When making a choice between correct simplicity and a perceived performance benefit, we want to make the choice for simplicity unless we can show significant (> 5%) benchmark improvements to compile times or run times. Premature optimization is the root of all evil.