Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
tree: a6600997b1
Fetching contributors…

Cannot retrieve contributors at this time

167 lines (140 sloc) 9.224 kB

Synopsis

compiler.pod - overview of Niecza compiler pipeline

Setup

The Perl 6 script src/niecza is the command line wrapper that controls the startup of Niecza. After reading command line options it constructs a compiler object, tweaks its properties and calls one of its compile methods - compile_string, compile_file, or compile_module.

The compiler object is defined in src/NieczaCompiler.pm6. Each compile method translates its arguments into a common format and then delegates to !compile. This sets up some important contextuals like $*backend and delegates to parse in src/NieczaFrontendSTD.pm6. These two methods may be worth combining. Between them these two methods perform necessary setup and teardown and start the compilation process.

The compilation process proper is best thought of as a succession of stages, although for BEGIN reasons the stages are actually interleaved - a sub must be completely ready to run soon after the closing brace because a BEGIN might need to call it.

Data types

The binary representation of the program being passed between the components is an abstract syntax tree (AST) of some type. Four types of AST are used in Niecza. "Op" tree nodes are objects in subclasses of the base class Op from src/Op.pm6. "CgOp" tree nodes are actually Array instances for some combination of flexibility and historical reasons; they are constructed by the methods in class CgOp from src/CgOp.pm6. CgOp tree nodes can be flattened into a JSON-like form, which is used to move them from Perl 6 space to C# space, and also as part of the bootstrap procedure (see below).

In C# space there are two more AST types we are concerned with, CpsOp and ClrOp, both defined in lib/CodeGen.cs. CpsOp nodes exist in direct correspondence with sections of the CgOp tree and can do any Perl 6 task. ClrOp nodes are created as the output of the ANF-converter, and are restricted to only the kinds of control flow that the CLR natively supports. After ClrOp data is handed off to Mono and is no longer of our concern.

It should be noted that all of these data types are used only for executable statements and expressions. Structural information from the source code is directly passed to the backend where it is used to create the ClassHOW, SubInfo, etc objects that will be used at runtime.

The pipeline

The first stage is the parser, which accepts source code (read in by src/NieczaPathSearch.pm6) and converts it into a tree of Match objects while calling action methods. The parser exists mostly in src/STD.pm6 with some Niecza extensions coded in src/NieczaGrammar.pm6 and src/NieczaFrontendSTD.pm6. The grammar itself is a branch of Larry Wall's standard Perl 6 grammar, which continues to evolve at https://github.com/perl6/std, but Niecza tries to track relevant changes. It is worth noting that there is a significant degree of feedback into the parser, especially for disambiguating types and function calls.

The actions module in src/NieczaActions.pm6 accepts Match objects and method calls from the grammar; it uses a collection of other modules other modules (Op, RxOp, Sig, CClass, OpHelpers, Operator) to create the Op AST from Perl 6 source code. The parser triggers each action when it matches the corresponding token in the grammar. The actions system directly calls into the backend to create metaobjects for non-code grammatical constructs, and can even run code for BEGIN and constants.

While constructing subs, NieczaActions uses two external tree walkers that are perhaps best regarded as stages in their own right. These perform specific optimizations; it should be noted that they are applied to one sub at a time, again for BEGIN reasons.

The two external walkers are in src/NieczaPassSimplifier.pm6 and src/OptRxSimple.pm6. They use a combination of top-down and bottom-up analysis to convert certain expressions or regex subterms, respectively, into simpler forms. In particular NieczaPassSimplifier handles inlining of some simple functions that just wrap a single runtime operator, like return and infix:<+>.

After simplification the Op tree must be converted into a CgOp tree. This is done recursively by the code and cgop methods on Op and RxOp objects. You should implement code in your subclasses but call cgop; cgop should not be overridden because it is responsible for adding line number annotations, and possibly more stuff later. (Think of it as the augment/inner emulation pattern.)

Once the code is converted to CgOp it is passed to the backend via the finish method on static sub objects. The code is then marshalled over to the C# side of the fence and saved.

The final code generation step is postponed to after UNITCHECK time in order that as much information as possible be available for optimizations. However, the code generator is integrated with the runloop, so if a function needs to be invoked early it can be code-generated early. Due to limitations of the CLR (essentially, a class must be closed before it can be used), functions which are used early will need to be code-generated a total of twice if they are to be saved - one copy going into the saved class, which cannot be used yet, and one copy to be used immediately, which cannot be very usefully saved.

The code generation process is controlled by NamProcessor.Scan in lib/CodeGen.cs, which walks over the C# version of the CgOp tree bottom-up mapping it into CpsOp. Simultaneously, the CpsOp tree is converted into a ClrOp tree by the smart constructors in the CpsOp class. (The CpsOp tree only exists in a notional sense, as the data flow graph of the constructor calls.) What the smart constructors do is to rearrange the code so that it can be used in a context with language-defined control flow, such as resumable exceptions and gather/take. The process is often inaccurately referred to as "Continuation Passing Style" (CPS) conversion; a better term would most likely be "Applicative Normal Form" (ANF), since the functions are not actually being split into separate continuation blocks.

At last the ClrOp data is made executable by the CodeGen methods on the various ClrOp subclasses, which produce MSIL in the form of calls to methods on a System.Reflection.Emit.ILGenerator object. Actually this is a two-step process. Language-defined control flow requires the use of a master switch statement at the beginning of each CLR-level function. ListCases methods on the op nodes are called first to calculate the correct indexes into the switch.

The IL generated is then dealt with appropriately by the underlying runtime. If we are precompiling a module, it will be saved (by the call to AssemblyBuilder.Save in lib/Kernel.cs) into a .dll or .exe file. Otherwise, it will be converted into native code by the JIT (involving several more intermediate stages, Mono method IR, SSA forms, possibly even LLVM IR).

The corresponding metaobjects and constants are then saved alongside the module into a .ser file.

Metacircularity concerns and bootstrapping

There is an important subtlety with regards to the Perl 6 / C# transition. The compiler is, itself, compiled using (an earlier version of) Niecza, and is running on top of a C# kernel with a copy of CodeGen. However, it cannot be used. Why not? In order to continue evolving Niecza we need the flexibility to make incompatible changes to the runtime library! So the compiler must access a kernel from the current Niecza, simultaneously with running on the old Niecza.

This is accomplished by a renaming trick. All files associated with the current Niecza are prefixed with Run.. The CLR will of course happily load two files, one named Kernel.dll and one named Run.Kernel.dll, and allow them to be independent; although just renaming the file isn't enough, it has to be compiled twice to get the "assembly name" correct. (Previous versions of Niecza used a more general feature called application domains instead. This was changed because it was too slow.)

So the compiler, running on Kernel.dll, can compile user code using Run.Kernel.dll while referencing Run.CORE.dll, etc. Works fine but there is one remaining catch. When the compiler is to compile a new version of itself, it needs to generate an assembly linked against a new version of Kernel.dll, which it cannot load. The workaround used here is to allow the compiler to only partially compile itself, generating .ser files; then the newly-compiled Kernel.dll can finish the job, creating Niecza.exe and Niecza.ser linked against itself, without loading the old compiler or old Kernel.dll. It turned out to be more convenient to merge all modules into a single output file at the same time.

It would also be worth pointing out src/CompilerBlob.cs, which is a C# module that extends the compiling compiler's kernel, but can be updated along with the current compiler. I try not to think about it too hard, but it's very useful.

Jump to Line
Something went wrong with that request. Please try again.