Both lexer and parser are hand-written.
The lexer is written as a "generator". Generator is not available in stable rust yet, so it is emulated by an async pipe instead.
The parser is a recursive descent parser. Each parse_xxx(...)
function roughly corresponds to the non-terminal xxx
in the grammer. Left recursion is implemented using loop instead of actual recursive functions. Operator precedence is implemented by specifying unambiguous grammer with expression "levels". Each parse_expr
n(...)
handles only operators in the n-th level. The longest look-ahead is k = 2 that happens in distinguish variable declarations and statements.
chocopy-rs supports intermediate AST (typed or untyped) in JSON format that conforms to the course project specification. The internal format of the AST, however, is different from the one provided in the Java starter code. The type hierarchy from the java code is devirtualized into trees of struct
s and enum
s for idiomatic Rust code. Similarly, the dispatch pattern in the semantic analyser is devirtualized into pattern matching on enum
s.
The code generation part was worked on before consulting the implementation guide from the course. Together with the fact that chocopy-rs targets a different architecture, the implementation detail is different from the one in the guide in many ways.
Symbol names for functions and classes follow a similar rule to the guide, except that user-defined functions are not prefixed with $
. The $
prefix is generally reserved for built-in / hidden functions and data blocks.
- The main procedure is named
$chocopy_main
. - The global variable section is named
$global
. - Global functions use their function names directly as symbol names.
- Constructors use type names as symbol names.
- Methods use
<TypeName>.<FuncName>
as symbol names. - Prototypes use
<TypeName>.$proto
as symbol names. - Nested functions use
<ParentSymbolName>.<FuncName>
as symbol names. - Functions from the standard library all have symbol names starts with
$
, except for the program entry pointmain
.
All the names above are visible to the linker, and also appear as debugging symbols. There are also symbols for debugging only:
- All variables and attributes are named as-is.
- Hidden attributes are prefixed with
$
. See sections below for detail.
RSP
and RBP
are used in their conventional ways: stack pointer and frame base pointer. Other registers are for general use.
Since chocopy-rs targets a 64-bit architecture, objects are represented by 64-bit address, with address 0 reserved for None
as usual.
This compiler also make use of unwrapped values for int
and bool
, which are 4-byte and 1-byte, respectively. They are always stored in 8-byte slots on the stack. However, in the global section and in object attributes, they are stored in a losely "packed" way: they are only aligned to their own size. Because of this, object layout does not have uniform slots any more.
An object is a record containing a 24-byte header followed by object attributes. The object header contains a 8-byte pointer to the object prototype ($proto
), and a 16-byte reserved space for garbage collection ($gc_count
and $gc_next
).
Array-like objects (str
and [T]
) has a 8-byte $len
attribute after the object header, followed by the array data. Note that [int]
and [bool]
has packed layout where each element is only 4 or 1 byte. str
is also packed with 1-byte ASCII characters. Strings in str
are not null-terminated, and \0
is allowed as a valid ASCII character in strings.
For each type with name C
(including all primitive types), the global symbol C.$proto
points to a prototype object of that class. All objects of type C
also has its $proto
attribute pointed to this prototype object. Prototype objects in chocopy-rs share two purposes with prototype objects in the reference implementation: the type tag and the dispatch table. The third purpose of prototype objects in the reference implementation, object initial values, is not included in the prototype here. Please refer the next section Constructors for this.
A prototype object starts with a 4-byte signed integer $size
that describes the memory layout of objects of type C
. When it is a positive number, it is the object size without the object header. When it is a negative number, it indicates C
is an array-like type and the value without the sign is the size of one element.
Next is a 4-byte signed integer $tag
for type tag. The type tag value follows the a similar convention to the one in the implementation guide, except that
object
and user-defined classes usesOther = 0
, and- List objects is separated into
PlainList = -1
(for[int]
or[bool]
) andRefList = -2
(for other list), making it easier for the tracing GC to distinguish.
Next is the pointer to reference map $map
, which points to a bit string, indicating whether each 8 bytes in the attribute is a reference for GC tracing.
Next is the list of function pointers to methods. The first function pointer points to the __init__
method, and so on for other user-defined methods.
For each non-list type with name C
, the global symbol C
is the constructor for the type. The constructor allocates memory for the object, initialize its attributes, and call its __init__
method. Instead of copying initial value from a "prototype" object as described in the implementation guide, constructors generated from chocopy-rs simply assign value for each attribute one by one.
chocopy-rs compiles program using the following calling convention:
- push all arguments to the stack. The rightmost argument is placed at the highest address.
- pass static link in
R10
if it is a nested function. - ensure the stack pointer is aligned at 8 mod 16 at the beginning of a function.
- pass return value in
RAX
. - caller restores stack.
RAX
,RCX
,RDX
,RSI
,RDI
,R8
,R9
,R10
,R11
are volatile across function call.
All functions are called using this calling convention, except for the followings:
- the main procedure
$chocopy_main
- functions in the standard library (all
$
-prefixed functions andmain
)
These functions are called using platform's default C ABI (System V ABI for linux, and Microsoft Windows ABI for Windows).
Stack frame layout is similar to the one described in the implementation guide, except for the position of the static link: instead of being pushed below all parameters, it is passed in R10
upon function call, and pushed immediately above the dynamic link, and below all local variables. So the stack frame layout in top-to-bottom order is:
- actual parameters to the function it is calling (if applicable),
- padding bytes to align the callee stack frame to 16 (if applicable),
- temporary storage for intermediate results of expression evaluation or other operations,
- the values of the local variables of the function,
- the static link of the current function (if applicable),
- the saved frame pointer of its caller,
- the saved return address. Its own actual parameters are in the frame immediately below it.
The local variables are stored in the same order as described in the implementation guide: vairables declared first has the highest address towards the bottom of the stack. However, the parameters are stored in a different order: the leftmost parameter has the lowest address towards the top of the stack.
The final executable produced by the compiler can be run directly under the target system. The executable is made from linking 3 major components:
- the compiled ChocoPy object (
program.o
) - the standard library (
chocopy_rs_std
) - the system C runtime (
libc
)
Their dependency on each other is shown in the following example call chain:
libc
--(`main`)-> chocopy_rs_std
--(`$chocopy_main`)-> program.o
--(`print`)-> program.o
--(`$print`)-> chocopy_rs_std
--(`fwrite`)-> libc
All ChocoPy programs are linked against a small standard library chocopy_rs_std
. The library contains the implementation for built-in functions, object (de)allocation, and error reporting. Contrary to the implementation guide, all functions from the standard library are $
-prefixed, while user-defined ChocoPy functions don't. Note that for each user-invocable built-in function such as print
, the standard library provides $print
, and the compiler also generates a wrapper function print
in the compiled object program.o
. The purpose of the wrapper is to convert the calling convention from ChocoPy convention to system C convention.
The standard library also provides the program entry point main
, which directly calls into the main procedure $chocopy_main
from the compiled object. Doing so is to avoid conflict when the ChocoPy program contains a user-defined function named main
, which in the current implementation will be a local symbol only visible to the compiled object itself.
chocopy-rs implements simple mark-and-sweep tracing garbage collection. When the program allocates new object by calling $alloc
and a certain threshold is reached, the garbage collector will walk through all objects and free unreachable ones.
$alloc
uses native system allocator to allocate memory, and chains all objects into a linked list using the $gc_next
field in the object header. On garbage collection, all live objects are marked as 1 in $gc_count
, and then all objects with 0 in $gc_count
are removed from the linked list and deallocated. All live objects resets $gc_count
to 0 in the end.
To determine live objects, garbage collector walks through the following live reference paths:
- Global references
- Local references
- Member references
Because there are also non-reference (i.e. unboxed) values, the garbage collector needs the information of which variable is a reference. This information is embedded in the program code, know as "reference map". A reference map encodes reference locations by a bit string. The garbage collector finds reference maps by:
- For global references, the map is stored in a global section and provided to the library during
$init
. - For local references, the map is attached to after every function call instruction (unless it can be proven the function call can't lead to
$alloc
). It is done by insertingPREFETCHNTA
instruction, which points to the reference map data. One local reference map only describes the current stack frame. The garbage collector does a full stack backtrace to gather all local reference maps. - For member reference, each class prototype contains a pointer to the reference map
$map
.