

# The Conversion of Source Code to Machine Code

Explaining the Basics of Compiler Construction Using a Self-Made Compiler

Silas Groh, Mik Müller Supported by Sonja Sokolović

March 25, 2023

#### **Abstract**

Programming languages are undoubtedly of great importance for various aspects of modern-day life. Even if they remain unnoticed, digital systems running programs written in some sort of programming language are ubiquitous. However, there are numerous ways of implementing a programming language. A language designer could choose an interpreted or a compiled approach for their language's implementation. Both ways of program execution come with their own advantages and disadvantages.

This paper aims to inform the reader about different means of program execution, focusing on compiler construction. However, we will only focus on the basics since implementing a programming language is often a demanding task. In order to include practical examples, we will explain concepts on the basis of *rush*, our own programming language, which was purposefully designed for this paper. During the implementation of rush, the focus for this paper has shifted slightly. As the title suggests, we originally planned to only implement and explain one compiler. However, there are numerous architectures which a compiler could target and settling on just one felt like the reader would miss out on too much. Therefore, we have implemented rush using two interpreters machine, one transpiler, and four compilers.

In Chapter 1, we will give an introduction to implementing a programming language. Moreover, the rush programming language and its characteristics are presented. In Chapter 2, the process of analyzing the program's syntax and semantics is explained. Chapter 3 focuses on how interpreters can be used in order to implement a programming language. Here, we will differentiate between a tree-walking interpreter and a virtual machine. Chapter 4 illustrates how compilation to high-level targets¹ works. As examples for high-level targets, we will present a compiler targeting the rush virtual machine, a compiler targeting WebAssembly, and a compiler which uses the LLVM framework. Chapter 5 focuses on how compilers targeting low-level targets² can be implemented. For this, we will present a compiler targeting RISC-V assembly and another compiler targeting x86\_64 assembly. Lastly, Chapter 6 presents final thoughts and conclusions.

<sup>&</sup>lt;sup>1</sup>High-level targets: in this case machine independent; many abstractions are provided.

<sup>&</sup>lt;sup>2</sup>Low-level targets: specific to one CPU architecture and operating system; little abstraction is provided.

# **Contents**

| 1. | Intro | oductio  | <b>n</b> 1                                       |
|----|-------|----------|--------------------------------------------------|
|    | 1.1.  | Stages   | of Compilation                                   |
|    | 1.2.  | The ru   | sh Programming Language                          |
|    |       | 1.2.1.   | Features                                         |
| 2. | Ana   | lyzing t | the Source 6                                     |
|    |       | -        | l and Syntactical Analysis                       |
|    |       | 2.1.1.   | Formal Syntactical Definition by a Grammar       |
|    |       | 2.1.2.   | Grouping Characters Into Tokens                  |
|    |       | 2.1.3.   | Constructing a Tree                              |
|    |       |          | Operator Precedence                              |
|    |       |          | Pratt Parsing                                    |
|    |       |          | Parser Generators                                |
|    | 2.2.  | Seman    | tic Analysis                                     |
|    |       | 2.2.1.   | Defining the Semantics of a Programming Language |
|    |       | 2.2.2.   | The Semantic Analyzer                            |
|    |       |          | Implementation                                   |
|    |       |          | Early Optimizations                              |
|    |       |          |                                                  |
| 3. |       |          | g the Program 22                                 |
|    | 3.1.  |          | Valking Interpreters                             |
|    |       | 3.1.1.   | 1                                                |
|    |       | 3.1.2.   | How the Interpreter Executes a Program           |
|    |       | 3.1.3.   | Supporting Pointers                              |
|    | 3.2.  |          | a Virtual Machine                                |
|    |       | 3.2.1.   | Defining a Virtual Machine                       |
|    |       | 3.2.2.   | Register-Based and Stack-Based Machines          |
|    |       | 3.2.3.   | The rush Virtual Machine                         |
|    |       | 3.2.4.   | How the Virtual Machine Executes a rush Program  |
|    |       | 3.2.5.   | Fetch-Decode-Execute Cycle of the VM             |
|    |       | 3.2.6.   | Comparing the VM to the Tree-Walking Interpreter |
| 4. | Com   | piling t | to High-Level Targets 35                         |
|    | 4.1.  | How a    | Compiler Translates the AST                      |
|    | 4.2.  | The C    | ompiler Targeting the rush VM                    |
|    | 4.3.  |          | ilation to WebAssembly                           |
|    |       | 4.3.1.   | WebAssembly Modules                              |
|    |       | 4.3.2.   | The WebAssembly System Interface                 |
|    |       | 4.3.3.   | Implementation                                   |
|    |       |          | Function Calls                                   |
|    |       |          | Logical Operators                                |
|    |       | 4.3.4.   | Considering an Example rush Program              |
|    | 4.4.  |          | LLVM for Code Generation                         |
|    |       | _        | The Role of LLVM in a Compiler                   |

|     |        | 4.4.2.   | The LLVM Intermediate Representation       |     |
|-----|--------|----------|--------------------------------------------|-----|
|     |        |          | Structure of a Compiled rush Program       |     |
|     |        | 4.4.3.   | The rush Compiler Using LLVM               |     |
|     |        | 4.4.4.   | Final Code Generation: The Linker          |     |
|     |        | 4.4.5.   | Conclusions                                |     |
|     | 4.5.   | Transp   | pilers                                     | 57  |
| 5.  | Com    | piling   | to Low-Level Targets                       | 58  |
|     |        |          | Level Programming Concepts                 | 58  |
|     |        | 5.1.1.   | Sections of an ELF File                    |     |
|     |        | 5.1.2.   | Assemblers and Assembly Language           |     |
|     |        | 5.1.3.   | Registers                                  |     |
|     |        | 5.1.4.   | Using Memory: The Stack                    |     |
|     |        |          | Alignment                                  |     |
|     |        | 5.1.5.   | Calling Conventions                        |     |
|     |        | 5.1.6.   | Referencing Variables Using Pointers       |     |
|     | 5.2.   | RISC-    | V: Compiling to a Modern RISC Architecture |     |
|     |        | 5.2.1.   | Register Layout                            |     |
|     |        | 5.2.2.   | Memory Access Through the Stack            |     |
|     |        | 5.2.3.   | Calling Convention                         |     |
|     |        | 5.2.4.   | The Core Library                           |     |
|     |        | 5.2.5.   | RISC-V Assembly                            |     |
|     |        | 5.2.6.   | Supporting Pointers                        |     |
|     |        | 5.2.7.   | Implementation                             |     |
|     |        |          | Struct Fields                              |     |
|     |        |          | Data Flow and Register Allocation          |     |
|     |        |          | Functions                                  |     |
|     |        |          | Let Statements                             |     |
|     |        |          | Function Calls and Returns                 |     |
|     |        |          | Loops                                      |     |
|     | 5.3.   | x86 6    | 34: Compiling to a CISC Architecture       |     |
|     |        |          | x64 Assembly                               | 84  |
|     |        | 5.3.2.   | Registers                                  |     |
|     |        | 5.3.3.   | Stack Layout and Calling Convention        |     |
|     |        | 5.3.4.   | Implementation                             |     |
|     |        |          | Struct Fields                              |     |
|     |        |          | Memory Management                          |     |
|     |        |          | Register Allocation                        |     |
|     |        |          | Functions                                  |     |
|     |        |          | Function Calls                             |     |
|     |        |          | Control Flow                               |     |
|     |        |          | Integer Division and Float Comparisons     |     |
|     |        |          | Pointers                                   |     |
|     | 5.4.   | Conclu   | usion: RISC vs. CISC Architectures         |     |
| 6   | Fina   | l Thou   | ights and Conclusions                      | 97  |
| Ο.  | ı ıııa | 1 11100  | gitts and Conclusions                      | 31  |
| Lis | t of l | Figures  | 5                                          | 100 |
| Lis | t of   | Tables   |                                            | 101 |
| Lis | t of l | Listings | S                                          | 102 |

| Bibliography                                          | 105        |
|-------------------------------------------------------|------------|
| Appendix A. Complete Grammar of rush in EBNF Notation | <b>107</b> |

# 1. Introduction

Nowadays, computer programs are often written in formal, purposefully designed languages. These languages introduce many constraints, such as syntactic and semantic rules, in order to allow programmers to implement algorithms in a structured and precise manner. Advantages of high-level languages are that program development is faster and easier, that programs are easier to maintain, and that programs are portable, meaning that the program can be executed on different architectures [Dan05a, p. 9]. Since programming languages should be easy to write for a human while being easy to understand for a computer, they are often regarded as complicated. The fundamental challenge is that a computer is only able to interpret a sequence of CPU instructions, instead of a written program. Therefore, the source program has to be translated into such a sequence of instructions before it can be executed by the computer. Because programming languages are strictly defined by formal constraints, the translation process can be defined formally as well. Therefore, this translation can be automated and implemented as an algorithm on its own. This process is referred to as compilation, and is usually performed by a program called a compiler. It is apparent that compilation requires significant effort and must obey complex rules since it should translate the source program precisely, without altering its meaning.

Another common method of program execution is to implement a program, often referred to as an *interpreter*, which evaluates the source code directly. Although compilers and interpreters share some of their core principles, the major difference is that the interpreter omits translation. The implementation of an interpreter is often significantly easier and smaller since the interpreter only has to comprehend the source program in order to execute it. In other words, the step of translating the source program into another form can be avoided completely. However, implementing an interpreter is only sensible if it is written using a high-level language like C or Rust, since the implementation language is able to save the programmer a lot of work. If an interpreter was to be implemented using a low-level language like assembly, there would not be much work done by the implementation language. Compared to interpreters, compilers played an essential role in the early days of computing as high-level languages were yet to be developed.

The first compiler was implemented around 1956 and aimed to translate *Fortran* to computer instructions. However, the success of this programming endeavor was not assured until the program was completed. In total, the project involved roughly 18 man-years of work and is thereby regarded as one of the largest programming projects of the time. To this day, new compilers are created, and innovations in the field of programming languages can be observed regularly. Therefore, compiler construction can still be considered a fundamental and relevant topic in computer science [Wir05, p. 6].

# 1.1. Stages of Compilation

In order to minimize intricacy and to maximize modularity, compilation often involves several individual steps. Here, the output of a step serves as the input for the next one. However, partitioning the compiler into too many steps is prone to cause inefficiencies during compilation. Separating the process of compilation into individual steps was the predominant technique until about 1980. Due to limited memory of the computers at the time, a single-step compiler could not be implemented. Therefore, only the individual steps of the

compiler would fit, as each step occupied a considerate amount of machine memory. These types of compilers are called *multi-pass compilers*. However, the output of each step would be serialized and written to disk, ready to be read by the next step. It is obvious that this partitioning leads to a lot of performance overhead, since disk access is significantly slower than memory access. Nowadays, one can mitigate these performance issues by implementing the compiler as a single program. Therefore, the compiler can avoid slow disk operations by keeping intermediate structures solely in memory.



Figure 1.1 – Different steps of compilation [Wir05, pp. 6–7].

Usual steps of modern compilers, as shown in Figure 1.1, are as follows [Wir05, pp. 6–7]:

- 1. The lexical analysis (*lexing*) translates sequences of characters of the source program into their corresponding symbols (*tokens*). Tokens, such as identifiers, operators, and delimiters are recognized by examining each character of the source program in sequential order.
- 2. The syntactical analysis (parsing) transforms the previously generated sequence of tokens into a tree data structure which directly represents the structure of the source program.
- 3. The semantic analysis (analyzing) is responsible for validating that the source program follows the semantic rules of the language. Furthermore, this step often generates a new, similar tree which contains additional type annotations and is affected by early optimizations.
- 4. Code generation traverses the previously generated tree in order to emit a sequence of target-machine instructions. Due to likely constraints considering the target instruction set, the code generation is often considered to be the most involved step of compilation.

Many modern compilers tend to combine steps 1 and 2 into a single step. Using this approach, the parser accesses a lexer directly, instructing it to return the next token when the parser demands it. Using this approach, memory usage is minimized since the parser only considers a few tokens instead of a complete sequence [Mog17, Chapter 1]. Figure 1.2 shows an altered chart considering this change.



Figure 1.2 – Different steps of compilation. (altered)

## 1.2. The rush Programming Language

For this paper, we have developed and implemented a simple programming language called  $rush^1$ . The language features a *static type system*, six different data types<sup>2</sup>, *arithmetic operators*, *logical operators*, *local and global variables*, *pointers*, *if-else expressions*, *loops*, and *functions*. In order to introduce the language, we will now consider the code in Listing 1.1.

```
fn main() {
2
        exit(fib(10));
3
4
    fn fib(n: int) -> int {
5
        if n < 2 {
6
7
             n
        } else {
8
             fib(n - 2) + fib(n - 1)
9
10
    }
11
```

**Listing 1.1** – Generating Fibonacci numbers using rush.

This rush program can be used to generate numbers included in the Fibonacci sequence. In the code, a function named 'fib' is defined using the 'fn' keyword. This function accepts the parameter 'n', which denotes the position in the Fibonacci sequence of the number to be calculated. Since 'int' is specified as the type of the parameter, the function may be called using any integer value as its argument. However, the constraint  $n \in \mathbb{N}_0$  must be valid in order for this function to return the correct result<sup>3</sup>. In this example, the 'main' function calls the 'fib' function using the natural number '10'. In rush, every valid program must contain exactly one 'main' function since that is where program execution will start. Even though rush provides a 'return' statement, the body of the 'fib' function does not contain one. This is because blocks, like the one of the 'fib' function, return the result of their last expression. The if-else construct is also an expression since it represents the last entity in the block, and is not followed by a semicolon. If 'n' is less than '2', it is returned without modification. Otherwise, the function calls itself recursively in order to calculate the sum of the preceding Fibonacci numbers 'n-2' and 'n-1'. The result value of the entire if-expression is calculated by using one of the two branches. Since the if-else construct is also an expression, there is no need for a redundant 'return' statement. In line 2, the 'exit' function is called. Even though this function is not defined anywhere, the code still executes without any errors. This is due to the fact that the 'exit' function is a built-in function that is used to exit a program with the specified exit code.

#### 1.2.1. Features

As outlined previously, rush implements various features, which are also found in many other relevant programming languages today. Generally, rush is a *procedural* programming language, meaning that code is partitioned into *procedures*, which are commonly referred to as "functions". This paradigm has been selected as it is also common across most of today's popular programming languages [TN07, p. 278]. The language provides many features which are meant to increase developer productivity. For instance, a program might use a 'loop' to

<sup>&</sup>lt;sup>1</sup>Capitalization of the name is omitted intentionally.

<sup>&</sup>lt;sup>2</sup>Including the 'int', 'float', 'bool', 'char', unit, and never type.

<sup>&</sup>lt;sup>3</sup>Assuming the function should comply with the original Fibonacci definition.

implement code repetition easily. For reference, Table 1.1 shows some features of the rush programming language.

Just like most other compiled languages, rush also includes a *static type system*, meaning that every variable has a type, which cannot change during runtime. For instance, after a new variable, which can hold integers, was defined, only integer values may be assigned to it. This way, the compiler can validate the semantic correctness of the program, showing error messages if an erroneous program is provided. Apart from mandating a program's correctness, a static type system also makes the program easier to read for humans. Therefore, the rush programming language is designed in a way that promotes correctness and programmer efficiency. Table 1.2 provides an overview of all the data types rush provides.

Table 1.3 – Lines of code of the project's components in commit '93ee068'.

| Component                | Lines |
|--------------------------|-------|
| lexer and parser         | 2737  |
| tree-walking interpreter | 578   |
| VM compiler and runtime  | 1288  |
| Wasm compiler            | 1584  |
| LLVM compiler            | 1450  |
| RISC-V compiler          | 2234  |
| x64 compiler             | 2751  |

While the first four types are self-explanatory, the last two demand special explanation. The unit type is used in order to denote that a function returns no value. This concept is comparable to *void* in other languages like C or Java. In rush however, the unit type is treated like a value at runtime. Even though the unit type can be saved inside a variable or used as a function parameter, all rush compilers simply ignore it since it holds no value. In Listing 1.1, the 'main' function also returns the unit type. A function named 'foo' can be defined like 'fn foo() -> () {}', as well as 'fn foo() {}', because the unit type is the default. The never type is used in order to denote that an expression, or statement, interrupts the normal control flow, so that the expression never yields a value. Because the built-in 'exit' function terminates

a program, its result will never be of relevance at runtime. However, the fact that the never type is returned is relevant as this information is valuable for the semantic analysis and the compilers. Furthermore, 'break', 'continue', and 'return' also result in the never type. It is to be noted that the type cannot be used explicitly. That means the usage of syntax like 'let a: ! = exit(0);' is illegal.

In the rush project, most of the previously presented stages of compilation are implemented as their own individual crates<sup>4</sup>. This way, each component of the programming language can be developed, tested, and maintained separately. In the Git commit '93ee068', the entire rush project consisted of 17460 lines of source code<sup>5</sup>. On the first sight, this might seem like a large number for a simple programming language. However, the rush project includes a lexer, a parser, a semantic analyzer, two interpreters, four compilers, and several other tools like a language server for editor support. Table 1.3 shows the lines of code of the project's individual components. Here, it becomes apparent that the tree-walking interpreter contains the least amount of code, as it is the simplest. The compilers targeting high level targets both require around 1500 lines. As for the low-level compilers, they both require over 2000 lines of code, with x64 requiring around 500 lines more than RISC-V. By considering this table, the complexity of the individual components becomes apparent.

<sup>&</sup>lt;sup>4</sup>A crate is a software library in Rust terms.

<sup>&</sup>lt;sup>5</sup>Blank lines and comments are not considered for all specifications of line count.

**Table 1.1** – Most important features of the rush programming language.

| Name                | Description                                                           | Example                       |
|---------------------|-----------------------------------------------------------------------|-------------------------------|
| unconditional loop  | Executes code repeatedly.                                             | loop { }                      |
| while-loop          | Like 'loop', but checks a condition before each iteration.            | while x < 5 { }               |
| for-loop            | Like 'while', but executes an update-expression after each iteration. | for i = 0; i < 5; i += 1 { }  |
| if-expression       | Executes different code based on a condition.                         | if true { } else { }          |
| function definition | Defines a function, which can be called later.                        | <pre>fn foo(n: int) { }</pre> |
| infix-expression    | Performs mathematical and logical operations using two values.        | 1 + n; 5 ** 2                 |
| prefix-expression   | Performs mathematical and logical operations using one value.         | !false; -n                    |
| let-statement       | Defines a variable with an initial value.                             | let mut answer = 42;          |
| cast-expression     | Converts a value into a different type.                               | 42 as float                   |

 ${\bf Table} \ {\bf 1.2} - {\rm Data} \ {\rm types} \ {\rm in} \ {\rm the} \ {\rm rush} \ {\rm programming} \ {\rm language}.$ 

| Notation | Example                      | Description                                                                                                                                                                                                                               |
|----------|------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 'int'    | let a: int = 0;              | Values of the <i>int</i> type can represent 64-bit signed integers, that is, $[-2^{63}; 2^{63} - 1]$ .                                                                                                                                    |
| 'float'  | let b: float = 3.14;         | Values of the <i>float</i> type can represent 64-bit double precision floating-point numbers, as defined by the <i>IEEE 754–2008</i> standard.                                                                                            |
| 'bool'   | let c: bool = true;          | Values of the <i>bool</i> type can represent either 'true' or 'false'.                                                                                                                                                                    |
| 'char'   | let d: char = 'a';           | Values of the <i>char</i> type can be any integer in the range [0; 127], representing any valid ASCII character.                                                                                                                          |
| '()'     | let e: () = main();          | The <i>unit</i> type is an empty tuple representing nothing, comparable to <i>void</i> in other languages. It is the implicit return type of functions in rush and is commonly used in functional languages like Haskell [Ser19, p. 208]. |
| '!'      | <pre>let f = exit(42);</pre> | The <i>never</i> type can never be materialized, it exists to express that an expression diverges and therefore yields no value.                                                                                                          |

# 2. Analyzing the Source

# 2.1. Lexical and Syntactical Analysis

As previously mentioned, the first step during compilation or program execution consists of the *lexical* and *syntactical analysis*. Program source text is, without previous processing, just *text*, i.e., a sequence of characters. Before the computer can even begin to analyze the semantics and meaning of a program, it first has to *parse* the program source text into an appropriate data structure. This is done in two steps that are closely related and often combined. The *lexical analysis*, performed by a *lexer*, and the *syntactical analysis*, performed by a *parser*.

#### 2.1.1. Formal Syntactical Definition by a Grammar

Just like every natural language, most programming languages also conform to a grammar. However, grammars for programming languages are most often type 2 or 3 in the Chomsky hierarchy, that is, context-free and regular languages [Wat17, pp. 23f]. Additionally, it is not uncommon for parser writers to formally define the grammar using some notation. Popular options include  $BNF^1$  and  $EBNF^2$ , the latter of which is used here. Although this paper does not fully explain these notations, Listing 2.1 shows a short example grammar notated using EBNF. For reference, Appendix A contains the full grammar of rush.

**Listing 2.1** – Grammar for basic arithmetic in EBNF notation.

One thing worth explaining is the '-' at the end of line 5. It is used to exclude a set of symbols from the preceding rule. As in this case, there is no symbol following the dash, the empty word, called  $\varepsilon$ , is excluded from the preceding repetition. Thus, the repetition cannot be repeated zero times here, as that would produce the empty word, which is excluded. So the notation '{ ... }-' is used to represent repetitions of one or more times, whereas the same without the trailing dash represents repetitions of zero or more times.

## 2.1.2. Grouping Characters Into Tokens

Before the syntax of a program is validated, it is common to have a lexer group certain sequences of characters into *tokens*. The set of tokens a language provides is the union of the set of all terminal symbols used in context-free grammar rules and the set of regular

<sup>&</sup>lt;sup>1</sup>"Backus-Naur Form", named after the two main inventors [Bac+60].

<sup>&</sup>lt;sup>2</sup>"Extended Backus-Naur Form", an extended version of *BNF* with added support for repetitions and options without relying on recursion, first proposed by Niklaus Wirth in 1977 [Wir77] followed by many slight alterations. The version used in this paper is defined by the ISO/IEC 14977 standard.

grammar rules. For the language defined in Listing 2.1 these are the five operators '+', '-', '\*', '/', '\*\*', and the 'integer' non-terminal.

Although, the specifics of implementing a lexer are not explored in this paper, a basic overview is still provided. The base principle of a lexer is to iterate over the characters of the input to produce tokens. Depending on the target language, it might be required to scan the input using an *n*-sized window, i.e., observing *n* characters at a time. In the case of rush, this *n* is '2', resulting in the 'Lexer' struct not only storing the current character, but also the next one, as seen in Listing 2.2. For clarity, Table 2.1 shows the values of 'curr\_char' and 'next\_char' while processing of the input '1+2\*\*3'. Here, every row in the

```
crates/rush-parser/src/lexer.rs

pub struct Lexer<'src> {
    input: &'src str,
    reader: Chars<'src>,
    location: Location<'src>,
    curr_char: Option<char>,
    next_char: Option<char>,
}
```

**Listing 2.2** – The rush 'Lexer' struct definition.

table represents one point in time displaying the lexer's current state.

| Calls | State       | curr_char | next_char | Output Token |
|-------|-------------|-----------|-----------|--------------|
| 0     | 1 + 2 * * 3 | None      | None      |              |
| 0     | 1 + 2 * * 3 | None      | Some('1') |              |
| 0     | 1 + 2 * * 3 | Some('1') | Some('+') |              |
| 1     | 1 + 2 * * 3 | Some('+') | Some('2') | Int(1)       |
| 2     | 1 + 2 * * 3 | Some('2') | Some('*') | Plus         |
| 3     | 1 + 2 * * 3 | Some('*') | Some('*') | Int(2)       |
| 4     | 1 + 2 * * 3 | Some('*') | Some('3') |              |
| 4     | 1 + 2 * * 3 | Some('3') | None      | Pow          |
| 5     | 1 + 2 * * 3 | None      | None      | Int(3)       |

Table 2.1 – Advancing window of a lexer.

As explained in Section 1.1, many modern language implementations have the lexer produce tokens on demand. Thus, a lexer requires one public method, 'next\_token' for instance, reading and returning the next token. In Table 2.1, the column on the left displays how many times the 'next\_token' method has been called by the parser. In the first three rows, this count is still '0', as this happens during initialization of the lexer in order to fill the 'curr\_char' and 'next\_char' fields with sensible values before the first token is requested. The 'Pow' token, composed of two '\*' characters, requires the lexer to advance two times before it can be returned, which is represented by the two rows in which the call count is '4'. A simplified 'Token' struct definition for the example language from Listing 2.1 is shown as part of Listing 2.3.

In addition to the current and next character, a lexer also has to keep track of the current position in the source text for it to generate helpful messages with locations in case the program is syntactically malformed. This is done in the 'location' field, which is incremented every time the lexer advances to the next character. While producing a token, the lexer can read this field, once at the start, and once after having read the token, in order to save the two values as the token's span.

A special case worth mentioning are comments. As explained later in Section 2.1.3, depending on the parser, comments may simply be ignored and skipped during lexical analysis, or could get their own token kind and be treated similarly to string literals.

#### 2.1.3. Constructing a Tree

The parser uses the generated tokens in order to construct a tree which represents the program's syntactic structure. Depending on how the parser should be used, this can either be a *Concrete Syntax Tree* (CST) or an *Abstract Syntax Tree* (AST). The former still contains

```
struct Token {
         kind: TokenKind,
2
3
         span: Span,
    }
4
5
    enum TokenKind {
6
7
         Int(u64),
8
         Plus,
9
         Minus, //
10
         Star,
11
         Slash,
12
         Pow,
13
    }
14
15
16
    struct Span {
17
         start: Location,
         end: Location,
18
    }
19
```

**Listing 2.3** – Simplified 'Token' struct definition.

information about all input tokens with their respective locations, while the latter only stores the abstract program structure, focusing on just the relevant information for basic analysis and execution. Therefore, a CST is usually used for development tools like formatters and intricate linters<sup>3</sup> where it is important to preserve stylistic choices made by the programmer, for which knowing the exact location of every token is beneficial. However, an AST is enough for interpretation and compilation as it preserves the semantic meaning of the program. Figure 2.1 shows an AST for the program '1+2\*\*3' in the language notated in Listing 2.1 on page 6. In the case of rush, an AST with limited location information is used because rush's semantic analyzer is basic enough to not require precise locations of every token, and, as discussed, execution and compilation requires no CST.

However, not every parser is the same, and there are different strategies for implementing one, depending on the requirements of the language. These strategies are categorized into top-down and bottom-up parsers. The main difference between them is the order in which they construct the tree. Top-down parsers construct the syntax tree from top to bottom, starting with the root node. Bottom-up parsers instead construct the

tree from the leafs at the bottom upwards to the root.

Top-down and bottom-up parsers are further categorized into many more subcategories. The two discussed here are so-called LL(k) parsers and LR(k) parsers. These are named after the direction in which the tokens are read, L being from left to right, and the kind of derivation they perform. L is the leftmost derivation and R the rightmost derivation [Wat17, pp. 75f]. The parenthesized k represents a natural number with  $k \in \mathbb{N}_0$ , describing the number of tokens for lookahead. Often, k is either '1' or '0', forming a window of width two or one re-



Figure 2.1 – Abstract syntax tree for '1+2\*\*3'.

spectively. This window moves just like previously explained for the lexer, and observes k tokens, not characters, simultaneously. Since in most cases k is '1', it is common to omit specifying it and to just speak of LL and LR parsers [Wat17, pp. 86–88].

Alternatively to using lookahead tokens, it is possible to create parsers utilizing backtracking. With that approach, a parser has to guess which syntax construct follows if it cannot decide based on the first token. Later, when the parser detects an unexpected symbol, instead of throwing a syntax error, it cancels and returns to the point of decision-making to try the next possible construct. This usually comes with overhead and unclear error messages, which is why this method is rarely used and the superior lookahead method is preferable.

<sup>&</sup>lt;sup>3</sup>A linter is a tool which suggests how code quality might be improved.

An example for LR parsing is the *shift-reduce* parsing approach, which is outlined later. However, it is to be noted that LR parsers are generally very complicated to implement manually. On the contrary, LL parsers are usually much simpler to implement, but come with a limitation. By design, they must recognize a node by its first n = k + 1 tokens, where n is the window size. However, due to that restriction, not every context-free language can be parsed by an LL parser. An example for that is given in Listing 2.4.

```
Expression = Expression , ( '+' | '-' | '*' | '/' | '**' ) , Expression
| '(' , Expression , ')'
| integer ;
| integer = { '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' }-;
```

**Listing 2.4** – Example language a traditional LL(1) parser cannot parse.

Here, the problem lies in the left recursion introduced by the 'Expression' rule. For instance, the input '1+2\*3' would already cause difficulties. LL parsers must be able to determine which variant of a rule follows based on the first token they encounter. However, when encountering the 'Int(1)' token, both the 'integer' and 'Expression' variants would present valid substitutions, therefore introducing ambiguity. The left recursion causes the 'Expression' rule to include the same start tokens as the 'integer' rule since an 'Expression' can also start with an 'integer'. In other words, the parser cannot determine whether the '1' is an entire 'Expression' on its own, or whether it is just the start of another 'Expression' [Wat17, p. 85].

Most LL parsers, including rush's, are recursive-descent parsers. Implementation of such a parser is rather uncomplicated. Assuming the grammar respects the mentioned limitation and that an object-oriented approach is used, every context-free grammar rule is mapped to one method on a 'Parser' struct. In the example grammar from Listing 2.1 on page 6, these are all the capitalized rules highlighted in yellow. Additionally, a matching node type is defined for each context-free rule, holding the relevant information for execution. For Rust, mapping from EBNF grammar notation to type definitions is very simple. Some examples are displayed in Table 2.2.

| <b>Table 2.2</b> – Mapping from EBNF grammar to Rust type definitio | Table 2.2 - | - Mapping | from EE | 3NF gramma | r to Rust | type definition |
|---------------------------------------------------------------------|-------------|-----------|---------|------------|-----------|-----------------|
|---------------------------------------------------------------------|-------------|-----------|---------|------------|-----------|-----------------|

```
EBNF
                                       Rust
A = B , C ;
                                       struct A { b: B, c: C }
A = B , [ C ] ;
                                       struct A { b: B, c: Option<C> }
A = B , \{C\};
                                       struct A { b: B, c: Vec<C> }
                                       struct A { b: B, c: Vec<C> }
A = B , { C }-;
A = B \mid C;
                                       enum A \{ B(B), C(C) \}
                                       struct A { b: B, op: Op, c: C }
A = B , ( '+' | '-' ) , C ;
                                       enum Op { Plus, Minus }
                                       struct A { b: B, c: Option<(XOrY, C)> }
A = B , [ ( X | Y ) , C ] ;
                                       enum XOrY { X(X), Y(Y) }
```

#### **Operator Precedence**

As discussed previously, a traditional LL parser cannot parse the language described by the grammar in Listing 2.4. However, when comparing the grammar to the one shown in Listing 2.1, it might become obvious that the two grammars notate the same language. The first one simply provides additional information about the order of nesting different kinds of expressions, called *precedence*. For example, when parsing the expression '1+2\*3', the '2\*3' part should be nested deeper in the tree for it to be evaluated first. In Listing 2.1, this is achieved by recognizing multiplicative expressions as 'Term's and having additive expressions be composed of multiple 'Term's. Listing 2.4 does not indicate this order itself, so it must be provided externally.

Additionally, a precedence may be either left- or right-associative. The input '1\*2\*3' should be evaluated from left to right, so first '1\*2', and then the result times three. Now considering '1\*\*2\*\*3'4, the '2\*\*3' should be evaluated first, and afterward '1' should be raised to the power of the result. That means, while most operators are evaluated from left to right, that is, they are left-associative, some operators, like the power operator, are evaluated from right to left and are therefore right-associative. In Listing 2.1, left-associativity is achieved by allowing simple repetition of the operator for an indefinite amount of times. Right-associativity instead uses recursion on the right-hand side of the operator.

For LR parsers, the precedence and associativity for each operator is encoded within the parser table, which is explained later. However, there is also a technique called *Pratt parsing*<sup>5</sup> that allows slightly modified recursive-descent LL parsers to parse such languages, given a map of tokens with their respective precedence and associativity. Often, the grammars without included precedence are preferred, because they usually result in a simpler structure of the resulting syntax tree. This can be seen when compar-



Figure 2.2 – Abstract syntax tree for '1+2\*\*3' using Pratt parsing.

ing Figure 2.1 from earlier to Figure 2.2 which shows the resulting AST for the same input using the alternative language representation. Most notably, the rather long sequences of nodes with just a single child, like the path on the left simply resolving to a single 'Int(1)' token, are gone in Figure 2.2.

#### **Pratt Parsing**

As the rush parser makes use of Pratt parsing, the following code snippets are taken from there. First, a mapping from a token kind to its precedence must be defined. The one for rush is found in Listing 2.5. It shows the 'prec' method implemented on the 'TokenKind' enumeration. The return type is a tuple of two integers, one for left and one for right precedence. For all but one token kind, the left precedence is lower than the right one, resulting in left-associativity. The higher a precedence is, the deeper in the tree the resulting expression will be, and the earlier it will be evaluated. All unrelated tokens are simply assigned a precedence of '0' for left and right.

The 'expression' method on the 'Parser' struct is then modified to take a parameter for the current precedence as seen in Listing 2.6. It then first considers the current token kind to decide which expression to parse, and stores the result in the 'Lhs' variable. Afterward it checks whether the left precedence of the now current token is larger than the 'prec' argument. When called from elsewhere, like in the condition of a while-loop or in a grouped expression, the 'prec' argument has its minimum value of '0', as shown in Listing 2.7. In

<sup>4&#</sup>x27;\*\* is the power operator here, so the input would be written as 1<sup>23</sup> using mathematical notation.

<sup>&</sup>lt;sup>5</sup>First introduced by Vaughan Pratt in 1973 [Pra73].

<sup>&</sup>lt;sup>6</sup>Short for "left-hand side".

```
crates/rush-parser/src/token.rs _
    pub(crate) fn prec(&self) -> (u8, u8) {
172
173
         match self {
             // ...
194
             TokenKind::Plus | TokenKind::Minus => (19, 20),
             TokenKind::Star | TokenKind::Slash | TokenKind::Percent => (21, 22),
195
196
             TokenKind::As \Rightarrow (23, 24),
             TokenKind::Pow => (26, 25), // inverse order for right associativity
197
             TokenKind::LParen => (28, 29), // for calls
198
199
             _ => (0, 0),
200
    }
201
```

**Listing 2.5** – Token precedences in rush.

that case, this check will only fail when the whole expression is over, as every non-operator token is assigned a precedence higher than 0. If it does not fail, the 'infix\_expr' method is called with the matching operator and 'lhs'. Afterward, 'lhs' is overwritten with the returned value.

```
crates/rush-parser/src/parser.rs
     fn expression(&mut self, prec: u8) -> Result<'src, Expression<'src>> {
551
         let start_loc = self.curr_tok.span.start;
552
553
554
         let mut lhs = match self.curr_tok.kind {
555
             TokenKind::Int(num) => Expression::Int(self.atom(num)?),
             TokenKind::LParen => Expression::Grouped(self.grouped_expr()?),
568
             invalid => {
569
                 return Err(Error::new_boxed(/* ... */));
570
575
         };
577
578
         while self.curr_tok.kind.prec().0 > prec {
579
             lhs = match self.curr_tok.kind {
                 TokenKind::Plus => self.infix_expr(start_loc, lhs, InfixOp::Plus)?,
                 TokenKind::Star => self.infix_expr(start_loc, lhs, InfixOp::Mul)?,
                 TokenKind::Slash => self.infix_expr(start_loc, lhs, InfixOp::Div)?,
582
                 TokenKind::Pow => self.infix_expr(start_loc, lhs, InfixOp::Pow)?,
583
                 // ...
                 _ => return Ok(lhs),
613
             };
614
         }
615
616
         Ok(lhs)
617
     }
618
```

**Listing 2.6** – Pratt-Parser: Implementation for expressions.

The 'infix\_expr' method in Listing 2.8 simply stores the precedence for the right side of the operator token, advances to the next token, and calls the 'expression' method again for its right-hand side, but this time with the stored 'right\_prec' as the minimum precedence. These simple calls and checks of precedences automatically result in correct grouping and nesting of the expressions.

In order to understand the concept better, Figure 2.3 can be considered. It shows the tokens with their respective left and right precedences for the input '(1+2\*3)/4\*\*5', which represents the mathematical expression  $\frac{1+2\cdot3}{4^5}$ . Precedences which are irrelevant for this example have been colored in gray. Starting from the left, the parser first encounters an 'LParen' token and therefore decides to parse a grouped expression. Inside the 'grouped\_expr'

```
crates/rush-parser/src/parser.rs -
     fn grouped_expr(&mut self) -> Result<'src, Spanned<'src, Box<Expression<'src>>>> {
733
734
         let start_loc = self.curr_tok.span.start;
735
         // skip the opening parenthesis
736
         self.next()?;
737
         let expr = self.expression(0)?;
738
739
         self.expect_recoverable(
             TokenKind::RParen,
740
             "missing closing parenthesis",
741
             self.curr_tok.span,
742
         )?;
743
         // ...
    }
749
```

**Listing 2.7** – Pratt-Parser: Implementation for grouped expressions.

```
fn infix_expr(/* ... */) -> Result<'src, Expression<'src>> {
   let right_prec = self.curr_tok.kind.prec().1;
   self.next()?;
   let rhs = self.expression(right_prec)?;
   // ...
}
```

**Listing 2.8** – Pratt-Parser: Implementation for infix-expressions.

method, 'expression' is called again, with 'prec' being '0' once more. The difference is that now the current token is 'Int(1)'. That token is then parsed as a simple integer expression and stored in 'lhs'. Now, the left precedence of the 'Plus' token, '19', is higher than the value of 'prec', '0', so the while-loop is entered. In there, the 'infix\_expr' method is called, which queries the right precedence of the 'Plus' token and calls 'expression' again, this time with a precedence of '20'. Skipping to the precedence check, the current token's left precedence is higher than 'prec' again, that is, '21' for the 'Star' token is higher than '20' for 'prec'. Therefore, 'infix\_expression' is called once again, which calls 'expression' again, but now during the next precedence check, the left precedence of the following 'RParen' token, '0', is not higher than the current precedence of '22'. That means, the innermost 'expression' call returns with a single '3'. This '3' is then used as the right-hand side for the multiplicative infix-expression. The same 'RParen' precedence check fails again for the 'expression' call with a precedence of '20'. Thus, the '2\*3' part together forms the right-hand side of the addition expression. Once more, the left precedence of 'RParen', '0', is not larger than the initial precedence of '0', hence the parser returns to the 'grouped\_expr' call with the entire contents of the parentheses. Here, the right parenthesis is skipped, and the method returns to the outermost 'expression' call, assigning the entire grouped expression to 'lhs'.

In order to understand the right-associativity of the 'Pow' token, the reader is advised to continue going through this procedure for the rest of the example input. A curious reader might additionally want to parse the inputs '1\*2\*3', '1\*\*2\*\*3', and '1+(2+3)' by hand. It



Figure 2.3 – Token precedences for the input '(1+2\*3)/4\*\*5'.

should then become clear how Pratt parsing achieves parsing with precedences without them being encoded in the grammar, and with that, the node types.

#### **Parser Generators**

For most purposes, it is generally not necessary to implement parsers, and with that, lexers, manually. Instead, there are so-called *parser generators* that generate parsers based on some specification of the desired syntax and the required precedences. Often, parser generators define a domain specific grammar notation for the syntax specification, although some parser generators also accept pre-existing grammar notations as their input. Most parser generators target some variation of *table-driven* LR parsers. Table-driven parsers use a table, mapping all possible combinations of tokens on the parse stack and lookahead tokens to an action. For shift-reduce parsers, the possible actions are shifting, that is, reading the next input token and pushing it to the stack, and reducing, which pops some number of elements off the stack, and in place pushes a node to it. The input syntax must therefore either be unambiguous or augmented by precedence rules, so that a complete parser table can be generated [Jef21, pp. 76–81].

# 2.2. Semantic Analysis

Before compilation can begin, both the syntax and the semantics of the program have to be validated. The *semantic analysis* is responsible for validating that the structure and logic of the program complies with the rules of the programming language. Often, semantic analysis directly follows the syntax analysis since the parser generates the input for the semantic analysis step.

#### 2.2.1. Defining the Semantics of a Programming Language

Often, a programming language is not just defined by its grammar since it cannot specify how the programming language should behave. This behavior can be defined through a so-called semantic specification. Apart from describing behavior, the specification regularly states which semantic rules discriminate a valid program from an invalid one. Common rules include type checking, context of statements, or integer overflow behavior. Another example of a semantic rule is that a variable has to be declared before it is used. Defining the semantic rules of a programming language is often a demanding task since not all requirements are clear from the beginning. Furthermore, they usually cannot be defined formally without significant effort [Hol12, pp. 5f]. A language designer typically chooses to write their specification in a natural language. However, due to the specification being written in such a language, it can sometimes be ambiguous. Therefore, a well-written semantic specification should avoid ambiguity as much as possible. Since those rules define when a program is valid, they have to be checked and enforced before program compilation can start [Wat17, p. 21].

## 2.2.2. The Semantic Analyzer

Because rush shares its semantic rules across all backends, it would be cumbersome to implement semantic validation in each backend individually. Therefore, it is reasonable to implement a separate compilation step which is responsible for validating the source program's semantics. Among other checks, the so-called *semantic analyzer*<sup>7</sup> validates types and variable references while adding type annotations to the AST. The last aspect is of particular importance since all compiler backends rely on type information. For obtaining

<sup>&</sup>lt;sup>7</sup>Later referred to as "analyzer".

type information, the AST of the source program has to be traversed so that the analyzer can examine and validate the program thoroughly. In order to preserve a clear boundary between the individual compilation steps, the parser only validates the program's syntax without performing further validation.

```
fn main() {
    let two = 2;
    let three = 3;
    exit(two + three);
}
```

**Listing 2.9** – A rush program which adds two integers.

In order to understand why type information is required at compile time, Listing 2.9 should be considered. It displays a basic rush program calculating the sum of two integers, using the result as its exit code. In this example, the exit code of the program is '5', as '2' and '3' are added together. The analyzer will first check whether the program contains a 'main' function. If this is not the case, the analyzer rejects the program because it violates rush's semantic specification. Furthermore, the analyzer assures that the 'main' function takes no parameters and returns no value. In this example, there is a valid 'main' function which complies with the previously

explained constraints. Now, the analyzer traverses the function body of the 'main' function. First, the analyzer examines the statements in lines 2 and 3. Since 'let' statements are used to define variables, the analyzer will add the variables 'two' and 'three' to its current scope. However, unlike an interpreter, the analyzer does not insert the variable's value into its scope. Instead of concrete values, the analyzer only considers types. Therefore, in this example, the analyzer remembers that the variables 'two' and 'three' store integer values. In line 4, the analyzer checks that the identifiers 'two' and 'three' refer to valid variables. Just like most other programming languages, rush does not allow the addition of two boolean values for example. Therefore, the analyzer checks that the operands of the '+' operator have the same type, and that this type is valid in additions. Because this validation requires information about types, the analyzer accesses its scope when looking up the identifiers 'two' and 'three'. Since those names were previously associated with the 'int' type, the analyzer is now aware of the operand types and can check their validity. Calculating the sum of two integers is allowed and results in another integer value. Since rush's semantic specification states that the 'exit' function requires exactly one integer parameter, the analyzer has to check that it is called correctly. Apart from this call to the 'exit' function, the analyzer validates all function calls and definitions, not just the ones of built-in functions.

As indicated previously, most compilers require type information while generating target code. For simplicity, we will consider a fictional compiler which can compile both integer and float additions. However, the fictional target machine requires different instructions for addition depending on the type of the operands. For instance, integer addition uses the 'intadd' instruction while float addition uses the 'floatadd' instruction. Here, type ambiguity would cause difficulties. If there was no semantic analysis step, the compilation step would have to implement its own way of determining the types of the operands. However, determining these types requires a complete tree traversal of the operand expressions. Due to the recursive design of the AST, implementing this tree traversal would require a large amount of source code in the compiler. However, the implementation of this algorithm would be nearly identical across all of rush's compiler backends. In order to prevent this duplication, rush's semantic analyzer also annotates the AST with type information which is utilized by later steps of compilation.

#### **Implementation**

In order to obtain a deeper understanding of how the analyzer works, parts of its implementation and how they behave when analyzing the previous example from Listing 2.9 should be considered. Firstly, the most important struct fields should be discussed.

```
pub struct Analyzer<'src> {
12
        functions: HashMap<&'src str, Function<'src>>,
13
        diagnostics: Vec<Diagnostic<'src>>
14
        scopes: Vec<HashMap<&'src str, Variable<'src>>>,
15
        curr_func_name: &'src str,
16
        /// Specifies the depth of loops, `break` / `continue` legal if > 0.
17
18
        loop_count: usize,
        // ...
    }
26
```

**Listing 2.10** - Fields of the 'Analyzer' struct.

Listing 2.10 displays the struct fields of the analyzer. The field 'functions' in line 13 associates a function name with the function's sig-Therefore, if a function is nature. called at a later point in time, the analyzer checks whether the function exists so that it can validate the arguments. The next field, 'diagnostics', contains a list of diagnostics. Α 'Diagnostic' is a struct which represents a message; it is intended to be displayed to a programmer using the language. Each diagnostic has a severity, such as warning and error. After the analyzer has finished the tree traversal, all diagnostics are displayed in a user-friendly manner. The 'scopes' field in line 15 is responsible

**Listing 2.11** – Output when compiling an invalid rush program.

for managing variables. In rush, blocks created with braces ('{}') also introduce new scopes. If the analyzer enters such a block, a new scope is pushed onto the 'scopes' stack. Each scope maps a variable identifier to some variable-specific data. For instance, the analyzer keeps track of variable types, whether variables have been used or mutated later, and the location of their definition. By saving this much information about each declared variable, the analyzer can produce very helpful and accurate error messages or warnings. For reference, compiler output which incorporates diagnostics is displayed in Listing 2.11, it occurs when a new value is assigned to an immutable variable.

Moreover, the field 'curr\_func\_name' saves the name of the current function. This field is used in order to highlight unused functions. If a function is called from a different function, it is marked as used. However, if the call originates from the same function, it is still considered unused. Furthermore, the 'loop\_count' field is used to validate the uses of the 'break' and 'continue' statements. Because these statements are only valid inside loop bodies, the value of 'loop\_count' must be greater than '0' at the point when the analyzer encounters such a statement. This counter is incremented as soon as the analyzer begins traversal of a loop body, and is decremented after the body has been traversed. This field is implemented as a counter, rather than a boolean, in order to support nested loops.

Now that important attributes have been highlighted, the example in Listing 2.9 can be considered. First, the analyzer traverses and examines all functions and their bodies. For every rush function, the analyzer invokes an internal method responsible for validating them. Among other tasks, this method inserts a new entry into the 'functions' map. Because a

'main' function is mandatory in every rush program, the analyzer simply checks that the 'functions' map contains an entry for it after all functions have been traversed. Listing 2.12 shows how validating the 'main' function's signature works.

```
crates/rush-analyzer/src/analyzer.rs
     fn function_definition(
396
         &mut self,
397
         node: FunctionDefinition<'src>,
398
399
     ) -> AnalyzedFunctionDefinition<'src> {
         // set the function name
400
         self.curr_func_name = node.name.inner;
401
402
         if node.name.inner == "main" {
403
             // the main function must have 0 parameters
404
             if !node.params.inner.is_empty() {
405
                 self.error(/* ... */);
406
419
420
             // the main function must return `()`
421
             if let Some(return_type) = node.return_type.inner {
422
                 if return_type != Type::Unit {
423
                      self.error(/* ... */);
424
433
             }
434
435
         let block = self.block(node.block, false);
490
535
         AnalyzedFunctionDefinition {
             used: true, // is modified in Self::analyze()
536
537
             name: node.name.inner,
538
             params,
             return_type: node.return_type.inner.unwrap_or(Type::Unit),
539
540
             block.
         }
541
    }
542
```

**Listing 2.12** – Analyzer: Validation of the 'main' function's signature.

This code displays the 'function\_definition' method of the analyzer. In this listing, only the code relevant for analyzing the 'main' function is shown. However, this method is used to analyze any function, not just 'main'. The method takes a 'FunctionDefinition' as its input and returns an 'AnalyzedFunctionDefinition'. Since a rush file might contain multiple functions, this method is invoked for each defined function.

In line 401, the method updates the current function name. The if-clause in line 403 checks whether the method is currently analyzing the 'main' function. If this is the case, additional checks are performed. The next if-expression in line 405 checks whether the node's 'params' vector contains any items. If this is the case, the 'main' function's declaration includes parameters, and the analyzer generates an appropriate error message. However, error handling in the analyzer has not been discussed so far. As the method's signature states, the method cannot return any errors. Instead, the 'error' method is invoked in line 406. This method uses information about the error to be generated in order to append a new 'Diagnostic' to the 'diagnostics' vector.

Secondly, in rush, the 'main' function's return type always has to be '()'. The analyzer validates this constraint in lines 422–423. The first if-expression checks whether the function contains a manually defined result type. In rush, every function definition defaults to the unit type. Therefore, the inner if-expression is only executed if the user has manually specified a result type for their 'main' function. The analyzer then checks whether the manually specified

type differs from the required unit type. In case it does, the analyzer generates another error which describes the issue in line 424.

After the signature of the 'main' function has been validated, the method begins traversal of the function body. In line 490, the 'block' method of the analyzer is invoked with the function body as its first argument. The second argument specifies that the method should not push another scope onto the stack since this is already handled by the current method. The return value of this method call is assigned to a variable called 'block'. It represents the completely analyzed and annotated function body. Lastly, in lines 535–541, an 'AnalyzedFunctionDefinition' is returned. Here, the function's attributes, like its analyzed parameters, return type, name, and body are specified.

During the traversal of the 'main' function's body, the analyzer encounters two let-statements in lines 2–3. For analyzing this type of statement, the 'let\_stmt' method, which is shown in Listing 2.13, is invoked.

```
_ crates/rush-analyzer/src/analyzer.rs -
     fn let_stmt(&mut self, node: LetStmt<'src>) -> AnalyzedStatement<'src> {
612
         let expr = self.expression(node.expr);
617
         if let Some(shadowed) = self.scope mut().insert(
644
             node.name.inner,
645
             Variable {/* ... */},
646
         ) {/* ... */}
657
         AnalyzedStatement::Let(AnalyzedLetStmt {
682
             name: node.name.inner,
683
684
             expr,
685
             mutable: node.mutable,
686
             used: true,
687
         })
     }
688
```

**Listing 2.13** – Analyzer: The 'let\_stmt' method.

In line 617, the initializing expression of the let-statement is analyzed first in order to obtain information about its result data type. The analyzer now inserts a new entry for the variable's name (e.g., 'two') into its current scope in line 644. Even though the contents of the pushed variable are hidden, among other information, it includes the variable's type and span. Since the span includes the location of where the variable was defined, it can later be used in error messages, like the one previously displayed in Listing 2.11. Furthermore, the inserted information includes whether the variable was declared as mutable. If it is, reassignments are allowed, meaning that it can be updated to hold another value at a later point. In case a variable was not declared as mutable and reassigning to it was attempted, output comparable to the one in Listing 2.11 would be generated as a result. However, it seems very unusual that the insertion happens in the condition of an if-expression. If the insertion returns 'true', the variable's name was already present in the current scope, and its previously associated data has now been overwritten. The process of overwriting variables by redefining them is sometimes called variable shadowing [KN19, p. 34]. Here, the analyzer should display some additional hints or warnings, depending on whether the shadowed variable has been referenced before it was shadowed.

In order to understand how type determination and annotation work, the traversal of expressions should be considered. The code in Listing 2.14 is part of the method responsible for analyzing expressions.

```
crates/rush-analyzer/src/analyzer.rs
     fn expression(&mut self, node: Expression<'src>) -> AnalyzedExpression<'src> {
1007
1008
          let res = match node {
              Expression::Int(node) => AnalyzedExpression::Int(node.inner),
1009
1010
              Expression::Float(node) => AnalyzedExpression::Float(node.inner),
              Expression::Bool(node) => AnalyzedExpression::Bool(node.inner),
1011
              Expression::If(node) => self.if_expr(*node),
1032
              Expression::Block(node) => self.block_expr(*node),
1033
              Expression::Grouped(node) => {
1034
                  let expr = self.expression(*node.inner);
1035
1040
1046
1047
1048
          res
```

**Listing 2.14** – Analysis of expressions during semantic analysis.

This method consumes a non-analyzed expression and transforms it into an analyzed one. For simple types of expressions, like integer, float, or boolean literals, further analysis is omitted, and an 'AnalyzedExpression' can be constructed directly. For more complex types of expressions, like if-expressions, this method calls appropriate methods responsible for analyzing specific types of expressions.

In this function, the recursive tree traversal algorithm used in the analyzer is clearly visible. For instance, if the current expression is a grouped expression, like '(1 + 2)', the code in lines 1034–1040 is called. In line 1035, the 'expression' method calls itself recursively using the inner expression of the grouped expression as the call argument. Most of the other tree traversing methods implement a similar recursive behavior.

```
crates/rush-analyzer/src/ast.rs
    pub fn result_type(&self) -> Type {
130
         match self {
131
             Self::Int(_) => Type::Int(0),
132
133
             Self::Float(_) => Type::Float(0),
             Self::Bool(_) => Type::Bool(0),
134
             Self::If(expr) => expr.result_type,
142
             Self::Block(expr) => expr.result_type,
143
             Self::Grouped(expr) => expr.result_type(),
144
         }
145
    }
146
```

**Listing 2.15** – Obtaining the type of expressions.

Listing 2.15 shows how the type of any analyzed expression can be obtained. For constant expressions, like 'Int(\_)', the determination of its type is straight-forward, as seen in line 132. Here, the 'result\_type' method returns 'Type::Int(0)'. In this implementation, the 'Type' enumeration saves a count which specifies the amount of pointer indirection. For instance, the rush type '\*\*int' is represented as 'Type::Int(2)' because there are two levels of pointer indirection. However, if the method is called on an integer literal, the resulting level of pointer indirection is zero. For more complex constructs, like if-expressions, the corresponding analyzed AST node saves its result type on its own. In this case, the type can then be accessed as seen in lines 142–143. For instance, during analysis of block expressions, the responsible function checks whether the block contains a trailing expression, and if it does, the result type of the block is identical to the one of its trailing expression. The last

}

1049

case can be seen in line 144. Here, the result type of a grouped expression is obtained by calling the 'result\_type' method recursively. By using this method, the analyzer is able to determine type information about each node of the tree, assuming that it has been analyzed previously.

In the case of a semantically malformed program, the analyzer must continue tree traversal. Otherwise, only one error could be reported since every traversing method could potentially return an error which would terminate the tree traversal. To mitigate this issue, the 'Unknown' type was implemented. For instance, the rush expression 'm + 42' would cause the 'm' variable to have the 'Unknown' type if it was undefined. If the analyzer encounters a type conflict where one of the conflicting types is unknown, it does not report another error since the unknown type has to originate from a previous error. Therefore, errors do not cascade, meaning that an undeclared variable will not cause another type error.

In Listing 2.9 on page 14, below the let-statements in the source program, the 'exit' function is called. Here, the analyzer uses the 'call\_expr' method in order to analyze the validity of this function call. For this, the analyzer iterates over the provided arguments, validating several constraints during each iteration. Among others, these include that each argument matches its corresponding parameter.

In this example, the argument expression 'two + three' is traversed during this analysis. Since the identifiers on the left- and right-hand side have been declared by the two let-statements previously, obtaining their data types merely involves a lookup of the identifier names inside the current scope. If an unknown variable was provided, the lookup would yield no value, thus causing an error message to be generated.

Because the two variables should be added, the 'infix\_expr' method is called. It is responsible for analyzing any kind of infix-expression by validating several constraints. For instance, the operands must both be of the same type. In this example, both operands of the addition are integers. Therefore, the analyzer accepts this infix-expression and is now aware that it yields another integer. After the infix-expression's result type has been determined, it is saved in its own 'result\_type' struct field. Infix-expressions are another example for tree nodes that save their result type as a struct field on their own. Now that the analysis of the argument expression has completed, its compatibility with the declared parameter must be validated.

```
crates/rush-analyzer/src/analyzer.rs
     fn arg(/* ... */) -> AnalyzedExpression<'src> {
1807
          let arg_span = arg.span();
1814
          let arg = self.expression(arg);
1815
1816
1817
          match (arg.result_type(), param_type) {
1818
              (Type::Unknown, _) | (_, Type::Unknown) => {}
1819
              (Type::Never, _) => {
                  self.warn_unreachable(call_span, arg_span, true);
1820
                  *result_type = Type::Never;
1821
1822
              (arg_type, param_type) if arg_type != param_type => self.error(/* ... */),
1823
              => {}
1829
1830
1831
1832
          arg
     }
1833
```

**Listing 2.16** – Validation of argument type compatibility in the analyzer.

Listing 2.16 shows the 'arg' function, which is responsible for validating that a function call argument is compatible with the declared parameter. This code will produce an er-

ror message if the type of the call argument deviates from the expected one. In order to validate the compatibility between the provided argument and the declared parameter, the method differentiates between several possible scenarios. In line 1818, the method detects the scenario in which the type of either the argument or the parameter is 'Unknown'. Here, the method should ignore this argument without producing another error.

The next match-arm in line 1819 presents the scenario in which the provided argument has the 'Never' type. In that case, the analyzer should only add a warning that the call-expression is unreachable. Furthermore, the result type of the entire call-expression is updated to reflect the never type. Line 1823 handles the scenario in which the type of the argument differs from the expected type of the parameter. In that case, the method will generate an error describing the situation. Again, the concrete error message is omitted for better readability.

In the case of the example program, the analyzer did not generate any error messages since the code presents both a syntactically and semantically valid rush program. Therefore, the analyzer accepts it and returns its syntax-tree with type annotations.

#### **Early Optimizations**

Another task of the analyzer can be to perform early optimizations. In compiler design, most of the optimizations are often performed with the target machine in mind. Therefore, the effects of these target-machine dependent optimizations can excel the ones caused by earlier optimizations. However, it is still sensible to perform trivial optimizations, such as *constant folding* and *loop conversion* inside the analyzer. For instance, the rush expression '2 + 3' evaluates to '5' during compile time instead of run time. This evaluation of expressions during compile time is referred to as constant folding. It is typically used in order to avoid the emission of otherwise redundant arithmetic instructions. As a result of this, the compiled program will run slightly faster since less computation is being performed when the program is executed [Wir05, p. 54].

**Listing 2.17** – Method for determining whether an expression is constant.

```
3 while true {
4    a += 1
5 }
```

**Listing 2.18** – Redundant 'while' loop inside a rush program.

In order to make such optimizations possible, each expression node in the analyzed AST has a method named 'constant'. It is shown in Listing 2.17 and is responsible for determining whether an expression is constant. It returns 'true' if the expression is an integer, float, boolean, or character literal. Other types of expressions, such as call-expressions, cannot be constant since such a function call may cause side effects which cannot be determined during compile time. This method is vital for constant

folding since both the left- and right-hand side of infix-expressions need to be constant in order to allow compile-time evaluation, and the analyzer has to check whether they are.

Among other optimizations implemented in the analyzer, loop transformation can also have a positive effect on the program's performance during runtime. Listing 2.18 displays part of a rush program which uses a while-loop even though an unconditional loop would suffice. A 'loop' implementation is more efficient since the condition check before each

iteration is omitted. However, this is only the case because the condition of the while-loop is a constant 'true'. If the analyzer detects such a scenario during analysis of a while-loop, the output node will be converted into an unconditional loop. Detection of this scenario is implemented in line 855 of Listing 2.19. In line 853, the match-arm is called if the analyzed loop will never iterate; it deletes the entire loop from the AST. This optimization improves runtime efficiency by a small amount since the code performing the very first condition check will not be compiled. Furthermore, the resulting output code will also be slightly smaller in size since the entire loop compilation can be omitted. The last match-arm in line 860 represents the unoptimized fallback case in which the loop is returned without modification.

```
crates/rush-analyzer/src/analyzer.rs
     fn while_stmt(/* ... */) -> Option<AnalyzedStatement<'src>> {
771
         match (never_loops, condition_is_const_true) {
851
             // if the condition is always `false`, return nothing
852
             (true, _) => None,
853
             // if the condition is always `true`, return an `AnalyzedLoopStmt
854
             (false, true) => Some(AnalyzedStatement::Loop(AnalyzedLoopStmt {
855
856
857
                  never_terminates,
858
859
             // otherwise, return an `AnaluzedWhileStmt`
             (false, false) => Some(AnalyzedStatement::While(AnalyzedWhileStmt {
860
                  cond,
861
                  block,
862
                 never_terminates,
863
             })),
864
865
    }
866
```

**Listing 2.19** – Loop optimization in the analyzer.

Implementing such trivial optimizations can significantly contribute to a more efficient output program without relying on the nuances of a target architecture. However, compiler writers often implement significantly more of those early optimizations than the aforementioned examples. In order to understand how early optimizations may impact the resulting AST, the two trees in Figure 2.4 should be considered.



Figure 2.4 – How semantic analysis affects the abstract syntax tree.

Both trees in Figure 2.4 represent the rush expression '1 + 42 - n'. The left tree has been generated by the parser, closely representing the structure of the code. On the right, the same tree after it has been transformed by the analyzer is shown. Most notably, the sub-expression '1 + 42' has been evaluated to '43' during constant folding. Furthermore, all nodes of the right tree now contain type annotations, *int* in this case.

# 3. Interpreting the Program

After the syntactical and semantic analysis, there are now multiple different ways to continue. The first one shown here is an *interpreter*, which, contrary to a compiler, executes the analyzed program directly. That means, no output to be run by another process is produced and written to disk. Instead, everything is kept in memory and immediately evaluated by the interpreter [Mak09, Chapter 1]. Therefore, compared to a compiler, the *interpreter* not only has to be installed on the developer's machine, but also on the target machine. This chapter presents two fundamentally different ways of implementing an interpreted programming language.

# 3.1. Tree-Walking Interpreters

A tree-walking interpreter is probably the simplest form of programming language implementation, since it accepts the AST as its input directly, which is why it is the first one explained here. No further intermediate steps after the analysis are required. Just like the parser and analyzer, it walks (traverses) the tree, hence the name, and therefore requires one method per node type.

As seen in Listing 3.1, the struct definition is also rather small. It stores a list of 'scopes', that being maps of variable names to their value at runtime, starting with the global scope at index '0' and ending with the current scope at any point in time. Why the 'Value' is wrapped inside an 'Rc<RefCell<\_>>' is explained later in Section 3.1.3. Additionally, a map of function names to their tree node is saved. The 'Rc<\_>' here is only necessary to conform to Rust's borrowing rules without unnecessary cloning.

```
crates/rush-interpreter-tree/src/interpreter.rs -
   type ExprResult = Result<Value, InterruptKind>;
   type StmtResult = Result<(), InterruptKind>;
9
   type Scope<'src> = HashMap<&'src str, Rc<RefCell<Value>>>;
10
11
   #[derive(Debug, Default)]
12
   pub struct Interpreter<'src> {
13
        scopes: Vec<Scope<'src>>,
14
        functions: HashMap<&'src str, Rc<AnalyzedFunctionDefinition<'src>>>,
15
16
```

**Listing 3.1** – Tree-walking interpreter: Type definitions.

In addition to the struct itself, result types for expressions and statements are defined. Statements either return '()' on success or an 'InterruptKind' on failure. Expressions use the same error type, but return a 'Value', holding the evaluated result, on success. The definitions for both 'Value' and 'InterruptKind' are visible in Listing 3.2. It is clearly visible that the interpreter simply makes use of Rust's types and does not need to implement hidden details, like comparisons between floats, manually.

The enumeration 'InterruptKind' describes the different ways the interpreter can be interrupted. The first three variants, 'return', 'break', and 'continue', are only partial interrupts, i.e., they do not necessarily stop execution of the whole program. For example, when the interpreter encounters a 'break' statement, it creates an 'InterruptKind::Break' and tracks

back until it reached the innermost loop, where the interrupt is then caught and execution is continued after the loop. The second to last variant is for runtime errors, produced by events like division by zero, and the last one, 'InterruptKind::Exit(i64)', is constructed by rush's built-in 'exit' function and terminates the program. They cause the interpreter to backtrack all the way to the AST's root, and with that, cancel execution, as this is the desired behavior for errors and 'exit' calls.

The way 'Value' is implemented also makes it very easy to support dynamic typing<sup>1</sup>, since it can be a value of any type. However, the rush interpreter can make use of the analyzer's guarantees of result types in order to omit certain checks internally, which could not be done with dynamic typing.

#### 3.1.1. Implementation

Listing 3.3 displays the 'run' method of the interpreter that serves as its entry point. Since the order of functions in rush is irrelevant and a function can be called before its definition, the interpreter must first populate its 'functions' map before any function code is executed. Afterward, the global variables are assigned their initial values. Only then, the code inside the 'main' function is called via the 'call\_func' method. After it returns, the entire interpreter either returns a runtime error, an explicit exit code, or the default exit code '0' indicating success.

The 'call\_func' method, visible in Listing 3.4, first checks whether a built-in function was called. As the only built-in function in rush is 'exit', a simple equality check is sufficient. In case the called function is the 'exit' function, the first call argument is as-

```
crates/rush-interpreter-tree/src/value.rs
pub enum Value {
                                                  6
    Int(i64),
                                                  7
    Float(f64),
                                                  8
    Char(u8),
                                                  9
    Bool(bool),
                                                  10
    Unit,
                                                  11
    Ptr(Rc<RefCell<Value>>),
                                                  12
                                                  13
pub enum InterruptKind {
                                                  22
    Return(Value),
                                                  23
    Break,
                                                  24
    Continue,
                                                  25
    Error(interpreter::Error),
                                                  26
    Exit(i64),
                                                  27
                                                  28
```

Listing 3.2 - Tree-Walking Interpreter: 'Value' and 'InterruptKind' Definitions

serted to be an integer via the 'unwrap\_int' method on 'Value', and an exit interrupt kind containing that integer is returned immediately. Otherwise, the previously saved tree node of the function to be executed is retrieved from the 'functions' map. A new variable scope for the function body is then initialized and filled with the passed arguments. Afterward, the function's block is evaluated by calling the 'visit\_block' method in a scoped environment. If the block returns a partial interrupt, it is turned into the appropriate return value by the 'into\_value' method call. For 'continue' and 'break', this is '()' and for 'InterruptKind::Return(Value)', it is the wrapped value. The other two fatal interrupt kinds are simply passed along when encountered.

## 3.1.2. How the Interpreter Executes a Program

To provide a basic overview of a program's execution without too many implementation details, the evaluation of the rush program displayed in Listing 3.5 is now explained.

Firstly, the defined 'main' function is called by the interpreter's entry point via 'call\_func'. In there, 'visit\_block' is called, using the 'main' function's block as the argument. The

<sup>&</sup>lt;sup>1</sup>In contrast to static typing, with dynamic typing the types of variables and results of expressions are only known at runtime rather than during compile time.

```
crates/rush-interpreter-tree/src/interpreter.rs
    pub fn run(mut self, tree: AnalyzedProgram<'src>) -> Result<i64, Error> {
23
         for func in tree.functions.into_iter().filter(|f| f.used) {
24
25
             self.functions.insert(func.name, func.into());
26
27
         let mut global_scope = HashMap::new();
28
         for global in tree.globals.iter().filter(|g| g.used) {
29
             global_scope.insert(/* ... */);
30
40
         self.scopes.push(global_scope);
41
         match self.call_func("main", vec![]) {
56
             Err(InterruptKind::Error(msq)) => Err(msq),
57
             Err(InterruptKind::Exit(code)) => Ok(code),
58
             Ok(\underline{\ }) \mid Err(\underline{\ }) \Rightarrow Ok(\underline{\ }),
59
         }
60
    }
61
```

**Listing 3.3** – Tree-Walking Interpreter: Beginning of Execution

```
crates/rush-interpreter-tree/src/interpreter.rs
    fn call_func(&mut self, func_name: &'src str, mut args: Vec<Value>) -> ExprResult {
81
        if func_name == "exit" {
82
            return Err(InterruptKind::Exit(args.swap_remove(0).unwrap_int()));
83
84
85
        let func = Rc::clone(&self.functions[func_name]);
86
87
        let mut scope = HashMap::new();
        for (param, arg) in func.params.iter().zip(args) {
89
            scope.insert(param.name, arg.wrapped());
90
91
92
        self.scoped(scope, |self_| match self_.visit_block(&func.block, false) {
93
            Ok(val) => Ok(val),
95
            Err(interrupt) => Ok(interrupt.into_value()?),
        })
96
97
   }
```

**Listing 3.4** – Tree-Walking Interpreter: Calling of Functions

'visit\_block' method then iterates over the statements contained in the block, and evaluates each one in order using the 'visit\_statement' method. In this case, that is only the call to 'exit'. Since calls in rush are considered expressions, and expressions can also be used wherever statements are allowed by appending a ';', the 'visit\_statement' method only forwards to 'visit\_expression', which itself forwards to 'visit\_expression' again. Here, that involves another call-expression, this time of the 'plus\_two' function. It again evaluates its arguments, that being the access of the global variable 'global' here, and then runs the function's block. The call stack at the point of reaching the 'return' statement is displayed in Figure 3.1. Now, the expression 'num + 2' is evaluated by first evaluating both sides of the '+' operator on their own, and then adding the results together.

Following that, an 'InterruptKind::Return(\_)' is constructed, holding the computed sum. By making use of Rust's '?' operator<sup>2</sup>, the interrupt causes early returns in all function calls up to the bottom 'call\_func'. Thus, the statement 'global += 4;' is never reached. All

<sup>&</sup>lt;sup>2</sup>Can be used after expressions returning a 'Result<\_, \_>' to return early in case they are the 'Err(\_)' variant.

```
let mut global = 40;
1
2
   fn main() {
3
4
        exit(plus_two(global));
5
6
   fn plus_two(num: int) -> int {
7
8
        return num + 2;
9
        global += 4;
   }
10
```

**Listing 3.5** – Example rush program.

other functions up to the top 'call\_func' then also return, as they now resolved to a value. After all this, the 'exit' function call now has a value for its argument and can be performed. However, as described earlier, 'exit' is built-in and does not call 'visit\_block', but instead constructs an 'InterruptKind::Exit(\_)' with the result value, so '42' in this case.

#### 3.1.3. Supporting Pointers

Adding support for pointers in a tree-walking interpreter is actually not as straight forward as it is for the other backends, which all have a manually managed memory layout. It also depends a lot on the implementation language. In languages with a garbage collector, for example Go or Java, the pointer functionality of that language can just be reused, and the cleanup is already managed. However, unlike many modern languages, Rust does not have a garbage collector. Instead, it makes use of the so-called borrow checker that validates all references at compile time. Using default Rust references for pointers while conforming to Rust's borrowing rules turns out to be quite complicated though, but it is not required to use them. Rust provides an additional way of having pointers to values besides their reference system. The 'Rc<\_>' type, short for reference counter, stores its contained value on the heap and provides shared access to it, without any particular owner. The contained value is freed as soon as no more references to it exist [McN21,

```
run(/* ... */)
call_func("main", vec![])
visit_block(/* ... */)
visit_statement(/* ... */)
visit_expression(/* ... */)
visit_call_expr(/* ... */)
visit_expression(/* ... */)
visit_expression(/* ... */)
visit_call_expr(/* ... */)
visit_call_expr(/* ... */)
visit_statement(/* ... */)
visit_statement(/* ... */)
```

Figure 3.1 - Call stack at the point of processing the 'return' statement.

pp. 131–133]. This makes it ideal for implementing pointers in a tree-walking interpreter.

However, as mentioned, a reference counter only provides **shared** access to the contained value. In Rust that implies not being able to mutate the value, as that would again break the borrowing rules. In order to still support mutable variables while having pointers to them, one can make use of another type the Rust standard library provides. A 'RefCell<\_>' implements so-called *interior mutability*<sup>3</sup> by enforcing Rust's borrowing rules at runtime. By wrapping values inside both a reference counter and a 'RefCell', it is possible to support mutable variables and pointers [McN21, pp. 131–133].

<sup>&</sup>lt;sup>3</sup>Types in Rust that have interior mutability allow mutation through shared references

# 3.2. Using a Virtual Machine

Just like a tree-walking interpreter, a virtual machine presents a way of implementing an interpreter for a programming language. However, the way a virtual machine operates differs fundamentally from a tree-walking interpreter. In order to compare the two, we have implemented a virtual machine backend for rush.

#### 3.2.1. Defining a Virtual Machine

Often, one might encounter the term  $virtual\ machine\ (VM)$  when talking about emulating an existing type of computer using a software system. This emulation typically includes additional devices like the computer's display or its disk, whereas in this context, the term only describes a software entity which emulates how a processor interprets instructions. Since the VM is unable to traverse the AST, it depends on a compiler generating its input instructions.

Because a physical processor and a virtual machine share some fundamental traits, the architecture of a virtual machine is often closely resembling the von Neumann architecture. This architecture was first introduced by John Neumann in the year 1945. Von Neumann originally presented a design which allows implementing a computer using relatively few components. Following the von Neumann architecture, a processor would usually contain components like an arithmetic logic unit (ALU), a control unit, multiple registers, memory, and basic IO [Led20, p. 172]. The ALU's task is to perform logical and mathematical operations. Which ones is being told by the control unit within the fetch-decode-execute cycle. This cycle is a simplification of the steps a processor performs in order to execute instructions. The list below explains its individual steps [Led20, pp. 208–209].

- **Fetch**: The processor's control unit loads the next instruction from the adequate memory location. The instruction is then placed into the processor's internal instruction register where it is available for further analysis.
- **Decode**: The processor's control unit examines the fetched instruction in order to determine whether additional steps must be taken before or after instruction execution. Such steps may involve accessing additional registers or memory locations.
- Execute: The control unit dispatches the instruction to a specialized component of the processor. The target component is often dependent on the type of instruction since each processor component is designed with one specific type in mind. For instance, the control unit may invoke the ALU to execute a mathematical instruction.

A computer's processor performs this fetch-decode-execute cycle repeatedly from the moment it powers on to the point in time when it shuts down again. For relatively simple processors, each cycle is executed in isolation because instructions are executed in a sequential order. This means that the execution of the instruction i is delayed until execution of i-1 has completed [Led20, pp. 208–209]. More sophisticated processors often use many performance-enhancing techniques like SIMD processing or simultanious multithreading [Led20, pp. 217f].

In most cases, a virtual machine executes instructions similarly to the fetch-decode-execute cycle. Although the von Neumann architecture is relatively simple, one does not always have to adopt it when implementing a virtual machine. Since virtual machines are purely abstract constructs without physical limitations, design constraints are usually kept to a minimum. Therefore, a virtual machine can also be implemented with the high-level abstractions of its input language in mind. For instance, the VM might feature specialized 'break', 'continue', or 'loop' instructions which are not present in modern-day CPUs. Designing the architecture of a virtual machine can sometimes be a challenging task since choosing an adequate set of

features may involve a lot of testing iterations. Because neither the compiler nor the VM exist in the beginning, one should carefully plan the implementation of their VM's architecture.

#### 3.2.2. Register-Based and Stack-Based Machines

One of the main decisions to be made during the design of the VM is how it implements temporary storage. Physical processors often use registers in order to make larger computations feasible. Registers are a limited set of very fast, low capacity storage units. On most relevant architectures today, like x86 64, each general-purpose register is able to hold as much as 64 bits of information. However, there is always only a limited amount of registers available since they are physical components of the CPU. Therefore, programs typically only utilize registers for storing temporary values, such as intermediate results of a large computation. The main alternative to registers is a stack-based design. A popular example for a stack-based virtual machine is WebAssembly [Sen22, p. 44]. For more information on WebAssembly, Section 4.3 presents a compiler targeting it. For compiler writers, managing registers is often a demanding task. This problem is described thoroughly in Chapter 5. On the other hand, targeting stack-based machines is usually significantly easier compared to register-based ones. Therefore, one might choose to implement a stack-based virtual machine in order to minimize the complexity of both the compiler and the interpreter. However, a stack-based design also introduces several issues on its own. For example, register-based machines might regularly outperform stack-based machines. A possible reason for this is that utilizing the stack typically requires a lot of seemingly redundant push and pop operations.

#### 3.2.3. The rush Virtual Machine

The rush virtual machine is a stack-based machine implemented using the Rust programming language. The machine's architecture was solely developed for this project and includes a stack for storing short-term data,  $linear\ memory$  for storing variables, and another separate  $call\ stack$  for managing function calls. Like most virtual machines, the rush VM uses a fetch-decode-execute cycle in order to interpret programs.

```
crates/rush-interpreter-vm/src/vm.rs _
   pub struct Vm<const MEM SIZE: usize> {
16
        /// Working memory for temporary values
17
        stack: Vec<Value>,
18
        /// Linear memory for variables.
19
        mem: [Option<Value>; MEM_SIZE],
20
        /// The memory pointer points to the last free location in memory.
21
22
        /// The value is always positive, but using `isize` does not require casts.
23
        /// Holds information about the current position (like ip / fp).
24
        call_stack: Vec<CallFrame>,
25
    }
26
```

**Listing 3.6** – Struct definition of the VM.

Listing 3.6 displays the struct definition of the rush VM. The 'stack' field in line 18 saves the main stack for temporary values. In line 20, the 'mem' field is declared. This field represents the linear memory used for storing variables. Since the 'Value' is wrapped in an 'Option', each cell may also hold a 'None' value representing uninitialized memory. In line 23, the 'mem\_ptr' field is declared. It holds the memory pointer, which saves the index of the last free cell in 'mem'. Lastly, a field named 'call\_stack' is declared. This field is responsible for managing function calls and returns.

The instructions in Listing 3.11 on page 30 can be interpreted by the rush VM. Programs for the rush VM are always partitioned into functions which each represents a sequence of instructions. Since function and variable names are replaced by indices, human-readable names causing inefficiencies can be omitted entirely. For better understanding, the individual functions have been manually annotated with their human-readable names.



Figure 3.2 – Linear memory of the rush VM.

The first block of instructions is called prelude. Its only task is to declare global variables and call the 'main' function. Global variables need to be initialized at the beginning of a program so that they can be accessed later. If the prelude was omitted, the 'main' function would instead have to contain these instructions since it is executed at program start. However, recursion of the 'main' function is allowed in rush. Therefore, each time the 'main' function recurses, all global variables would be restored to their ini-

tial values. In order to prevent this bug, the rush VM uses a prelude function which is guaranteed to only run once.

Linear memory in the VM is represented as an array of runtime values of variables. Each storage cell can be accessed through two different addressing modes. When using the absolute addressing mode, the exact index of the memory cell is specified. For instance, if the value of the variable 'num' in Figure 3.2 was to be retrieved, the VM would need to access the storage cell with the absolute index '4'. However, the absolute position of a variable in memory can only be determined at runtime. For example, in a recursive function, each recursion adds another scope containing variables, thus allocating more memory. However, the rush VM also implements a relative addressing mode which can be used without knowledge about the absolute position of the memory cell. For instance, the same variable can also be addressed through index '1' using the relative addressing mode. The mode is called *relative* because all addresses are specified relative to the memory pointer. In Figure 3.2, the memory pointer (shortened to mp) is set to '3'. Considering the value of the memory pointer, the absolute address of any relative address can be calculated at runtime. Here, the absolute address is the sum of the relative index rel and the runtime memory pointer mp. By also implementing this relative addressing mode, compilers targeting the rush VM can generate code without knowing the runtime behavior of a program. For better understanding, Figure 3.2 can be considered more closely. Here, the first column contains a character to identify each memory cell. The second column specifies the relative address of the cell, assuming that mp is set to '3'. Lastly, the third column contains the absolute address of the cell, calculated through mp and the relative address.

```
1 fn main() {
2    let mut num = 42;
3    let to_tum = #
4 }
```

**Listing 3.7** – Minimal pointer example in rush.

In order to get a deeper understanding of the addressing modes, a practical example can be considered. The code in Listing 3.7 displays a rush program in which a pointer to a variable is created. First, in line 2, the integer variable 'num' is created. In line 3, a pointer variable called 'to\_num' is created by referencing the 'num' variable.

In the rush VM, absolute addressing is only used for global variables and pointers. Since a pointer specifies the address of another variable, its runtime value will be the absolute

address of its target variable. In the VM, the absolute address of a variable is calculated as soon as it is referenced using the '&' operator. For this purpose, the 'reltoaddr' (relative to

address) instruction exists. This instruction calculates the absolute address of its operand and pushes the result onto the stack. Here, the operand is the relative address of the variable to be referenced. Listing 3.8 shows the VM instructions generated from the rush program in Listing 3.7.

The first instruction 'setmp' (set memory pointer) increases the memory pointer by two since the operand is a positive number. This is because the 'main' function contains two local variables whose space is to be allocated at the start of the function. Next, the 'push' instruction pushes the value '42' onto the stack. In line 4, the 'svari' (set variable immediate) instruction pops the top-most value from the stack, here '42', and assigns it to the specified relative address. Now, the variable 'num', with an initial value of '42', has been created. Next, the 'to\_num' variable is created by referencing the 'num' variable. In line 5, the 'reltoaddr' instruction is used to calculate the absolute memory address of the 'num' variable. The calculated

Listing 3.8 – VM Instructions for the minimal Pointer Example

absolute address is then pushed onto the stack where it can be accessed by the following instruction. Here, the relative address '0' is used since the 'svari' instruction has previously saved 'num' at this location. In line 6, another 'svari' instruction is used to save the value of the 'to\_num' variable, that being the absolute address of 'num' which is now on top of the stack, at the relative address -1. This is because the compiler targeting the VM assigns variables to higher relative addresses first. The compiler then progresses into lower relative memory as more variables of the function are initialized.

#### 3.2.4. How the Virtual Machine Executes a rush Program

By considering the previous pointer example, one now has a rough idea of how the VM executes a program. In order to get a better understanding, the execution of the program in Listing 3.9 will now be explained. For this, the instructions in Listing 3.11 on page 30 should be considered again. The first instruction of the prelude function is 'setmp'. This instruction adjusts the memory pointer by the amount specified in the instruction's operand. In this case however, the memory pointer remains unmodified since the operand of the instruction is '0'. Next, the 'call 1' instruction calls the 'main' function. In order to understand how function calls work in this VM, the call stack of the rush VM should be considered. Before the call-

**Listing 3.9** – A recursive rush program.

instruction, the caller pushes any arguments onto the stack so that they can be used as parameters by the callee. Figure 3.3 displays the state of the VM's call stack after the 'call 1' instruction has been executed. During execution of a call-instruction, the VM pushes a new stack frame onto its call stack. Listing 3.10 shows how the 'CallFrame' struct is implemented.

```
crates/rush-interpreter-vm/src/vm.rs

struct CallFrame {
    /// Specifies the instruction pointer relative to the function
    ip: usize,
    /// Specifies the function pointer
    fp: usize,
}
```

**Listing 3.10** – Struct definition of a 'CallFrame'.

|                                                                                            | , |
|--------------------------------------------------------------------------------------------|---|
| $     \begin{array}{l}       \text{main} \\       fp = 1 \\       ip = 0     \end{array} $ |   |

Figure 3.3 – Example call stack of the rush VM.

In this implementation, each call frame holds two important pieces of information. In line 31 of Listing 3.10, the 'ip' field is declared. It specifies the *instruction pointer*, which saves the index of the current instruction. This index is relative for each function, meaning that the instruction pointer '0' can refer to multiple instructions, each in another function. Since the 'call' instruction was interpreted previously, the

instruction pointer of the new call frame is set to '0' as execution should continue at the first instruction of the called function. The second important field, 'fp', is declared in line 33. It specifies the *function pointer*, which saves the index of the current function. Therefore, the combination of the instruction and function pointer specifies the instruction to be executed. After the function call, 'fp' is set to '1' since 'ip' should now refer to the instructions inside the 'main' function.

Function calls are managed in a stack in order to allow early returns from functions. If the VM encounters a 'ret' (return) instruction, it should leave the current function immediately. However, it should also know where to resume its fetch-decode-execute cycle. For this, the VM simply pops the top element off its call-stack, meaning that the call frame of the current function is removed. Now, the top element on the stack contains the call-frame of the caller function. In this call frame, 'ip' still points to the 'call' instruction which was responsible for calling the function. Since 'ip' is incremented automatically after most instructions, the VM resumes instruction execution at the first instruction after the 'call' instruction. Therefore, the call frame of the caller function stays unaffected as long as the called function is executed.

```
0: (prelude)
 1
2
              setmp 0
              call 1
 4
    1: (main)
 5
              setmp 0
              push 1000
 6
7
              call 2
              exit
8
    2: (rec)
9
10
              setmp 1
              svari *rel[0]
11
              push *rel[0]
12
13
              gvar
14
              push 0
15
              jmpfalse 9
16
              push 0
17
              jmp 14
18
              push *rel[0]
19
20
              gvar
21
              push 1
22
              sub
              call 2
23
              setmp -1
24
25
              ret
```

Listing 3.11 – VM instructions matching the AST in 3.4.

Now that the 'call' instruction has been interpreted, the VM begins executing the first instruction of the 'main' function. Since the 'main' function only calls the 'rec' function with the argument '1000', there are no new concepts to consider in this function. When the VM encounters the 'call' instruction in line 7, execution continues with the instructions of the 'rec' function. At the beginning of the 'rec' function, the memory pointer is incremented by '1'. This might seem erroneous since the 'rec' function contains no visible variable definitions in its body. However, this behavior is correct since function parameters are treated similarly to variables by the rush VM. Since the function takes one parameter, the memory pointer is incremented by one cell. Next, the 'svari' instruction saves the value of the parameter which was previously pushed onto the stack at the relative address '0'. In line 12, the relative address of the memory cell containing the value of the parameter is pushed onto the stack. At this point, the top element on the stack contains an address value referring to the target of the 'gvar' (get variable) instruction. It is then popped by the 'gvar' instruction in line 13. Therefore, the instruction first pops the top element from the stack in order to retrieve the value of the variable at the specified location, and pushes the retrieved value onto the stack. In this case, the value of the popped element is the relative address '0', meaning that the instruction retrieves the value of first parameter.

In line 14, the constant value '0' is pushed onto the stack.

Next, the 'eq' (is equal) instruction pops two elements off the stack in order to test them for equality. Then, the result of the equality test is pushed onto the stack as a boolean value. In other words, here, the instruction compares whether the current value of 'n' is equal to '0' in order to produce a boolean result. The 'jmpfalse' (jump if false) instruction in line 16 jumps to the specified instruction index if the boolean value on top of the stack is 'false'. In this example, if the value on the stack is 'false', the parameter 'n' was not equal to '0'. If this was the case, the VM would jump to the instruction in line 19 as it represents index '9' of the 'rec' function. Here, the value of the parameter 'n' is pushed onto the stack using the previously explained 'push' and 'gvar' instructions. Now, the top item on the stack is the value of the parameter 'n'. In line 21, the 'push' instruction pushes a constant '1' onto the stack. Next, the 'sub' (subtract) instruction pops the first two elements off the stack in order to subtract their values from each other. In this case, the instruction subtracts '1' from the value of 'n', pushing the result on the stack at the end. Next, the 'rec' function calls itself recursively using the aforementioned 'call' instruction. Since the call argument is the top element on the stack, the result of the subtraction is used as the argument of the recursive call. In line 24, the 'setmp' instruction decrements the memory pointer in order to deallocate used memory. At the end of a function, the memory pointer is always decremented by the amount it was incremented by at the beginning of the function. By deallocating the now unused memory, the compiler prevents leaking the memory at runtime. Lastly, the 'ret' instruction is used to return from the 'rec' function. Now, the case in which the value of 'n' is not equal to '0' was considered.

On the opposite, if the result of the comparison in line 15 was true, meaning that 'n' was equal to '0', the 'jmpfalse' instruction in line 16 would perform no operation. In this case, the VM continues execution at the 'push' instruction in line 17. Here, the constant value '0' is pushed onto the stack. Since functions also return values by placing them on top of the stack, the return value would be '0' in this case. Next, the VM interprets the 'jmp' (jump) instruction in line 18. Unlike 'jmpfalse', this instruction performs its jump without any condition. Here, the instruction jumps to the instruction at index '14' of the current function, meaning 'setmp' in line 24. The instructions in the lines 24 and 25 return '0' in case 'n' was equal to '0'.

# 3.2.5. Fetch-Decode-Execute Cycle of the VM

Now that the semantic meaning of the instructions in Listing 3.11 has been explained, the fetch-decode-execute cycle of the VM will be explained. Listing 3.12 displays the 'run' method of the rush VM.

```
crates/rush-interpreter-vm/src/vm.rs
     pub fn run(&mut self, program: Program) -> Result<i64> {
174
         while self.call_frame().ip < program.0[self.call_frame().fp].len() {</pre>
175
             let instruction = &program.0[self.call_frame().fp][self.call_frame().ip];
176
177
             // if the current instruction exists the VM, terminate execution
178
             if let Some(code) = self.run_instruction(instruction)? {
179
                 return Ok(code);
180
             };
181
182
183
184
         0k(0)
185
```

**Listing 3.12** – The 'run' method of the rush VM.

This method manages the entire fetch-decode-execute cycle of the VM. It is immediately apparent that this method looks relatively simple considering that it plays such of a vital

role in the VM. Since the fetch-decode-execute cycle executes instructions repeatedly, the method's main construct is the while-loop beginning in line 175. The condition of the loop checks whether the current instruction pointer refers to a legal instruction inside the current function. This way, the VM comes to a halt as soon as it reaches the end of an instruction sequence. In line 176, the next instruction to be interpreted is saved in the 'instruction' variable. This line represents the *fetch* step because the next instruction is fetched from memory and placed in a spot where it is accessible to the later steps of the cycle.

In the body of the loop, the current instruction is executed using the 'run\_instruction' method. This method is responsible for executing the previously fetched instruction. If execution of the instruction fails, the method returns a runtime error, such as integer overflow. Furthermore, this method may return an optional integer, representing the exit code of the program. If such an integer is returned, instruction execution comes to a halt instantly and the VM exists using it as the exit code. In case the method returns none of these two possible variants, the fetch-decode-execute cycle continues as usual. However, one cannot observe how the instruction pointer is incremented. Among other tasks, this is done by the code in Listing 3.13. It displays the 'run\_instruction' method. This method mainly consists of a

```
crates/rush-interpreter-vm/src/vm.rs
     fn run_instruction(&mut self, inst: &Instruction) -> Result<Option<i64>> {
187
         match inst {
188
             Instruction::Nop => {}
189
             Instruction::Push(value) => self.push(*value)?,
190
197
             Instruction::Jmp(idx) => {
                  self.call_frame_mut().ip = *idx;
198
                  return Ok(None);
199
200
             Instruction::Exit => {
267
                  return Ok(Some(self.pop().unwrap_int()));
268
269
             Instruction::Add => {
278
279
                  let rhs = self.pop();
                  let lhs = self.pop();
                  self.push(lhs.add(rhs))?;
281
282
             // ...
         }
363
         self.call_frame_mut().ip += 1;
364
365
         Ok(None)
    }
366
```

**Listing 3.13** – Parts of the 'run\_instruction' method of the rush VM.

match-expression that determines which code to run depending on the current instruction. Here, the implementation of several instructions is visible. In line 190, the 'push' instruction is handled. This instruction pushes its first operand onto the stack by calling the internal 'push' method, which first validates that the stack will not exceed its maximum capacity. If the push operation causes the capacity of the stack to be exceeded, the method returns a runtime error describing the issue. In line 197, the code responsible for executing the 'jmp' instruction can be seen. This instruction only sets 'ip' to the target index specified by the first operand. It is to be noted that this instruction does not perform any bound checks, therefore allowing illegal jump destinations. For some special instructions, such as this one, 'ip' should not be incremented since it would interfere with the jump operation. Since the instruction pointer is incremented at the end of the method, the 'return' statement in line 199 is used to terminate this method early, preventing modification of 'ip' at the end.

Line 267 contains the implementation of the 'exit' instruction, which is inserted by the compiler if it encounters a call to the 'exit' function. In this case, the instruction leads to the termination of the fetch-decode-execute cycle immediately. Line 278, handles the execution of 'add' instruction. This instruction first pops two elements of the stack since they represent the operands of the underlying mathematical computation. Then, the 'add' method is called on the left-hand side of the computation. It then performs the actual addition. This way, the 'run\_instruction' method stays organized and simple. Although just the code for addition is shown, most of the other infix-expressions are implemented similarly. As outlined previously, after an instruction has been executed, 'ip' is incremented in line 364. If no error occurred during the execution of these instructions, nothing is returned since execution should proceed with the next instruction. This method represents both the decode and execute steps since it first matches (decodes) and then evaluates (executes) the current instruction. The compiler targeting the rush VM is presented in Section 4.2.

### 3.2.6. Comparing the VM to the Tree-Walking Interpreter

One significant benefit of virtual machines is that they execute programs faster compared to tree-walking interpreters. A reason for this speedup is that tree traversal involves a lot of overhead which is omitted when instructions are executed directly. The aforementioned Listing 3.9 on page 29 displayed a recursive function implemented in rush. Figure 3.4 displays a heavily simplified syntax tree representing this function. The root node of the tree represents the 'rec' function. Since the function only contains a single expression, the if-expression node is the only child of the root node. The if-expression contains a condition, an if-branch, and an else-branch.

Since the function should not call itself again if 'n' is equal to '0', the if-branch returns '0'. In the elsebranch, the 'rec' function calls itself recursively. When the above program is executed using the tree-walking interpreter, the algorithm traverses the entire tree of the 'rec' function every time it recurses. In this example, the AST of the program is relatively simple. However, the complexity of the tree grows as the source program evolves. Since loops and recursive functions execute the code in their bodies repeatedly, the repeated tree



**Figure 3.4** – AST and VM instructions of the recursive rush program in Listing 3.9.

traversal causes inefficiency. Here, the inefficiency solely lies in the repeated tree traversal, not in the repetition introduced by an iterative or recursive algorithm.

In order to improve efficiency, an algorithm could traverse the AST once, saving its semantic meaning in the process. Then, the semantic meaning of the previously traversed tree could be interpreted repeatedly without the additional overhead. This behavior is used in the rush VM since it interprets instructions previously generated by a compiler. The compiler only traverses the AST once, generating VM instructions during the process. As mentioned previously, the instructions in Listing 3.11 on page 30 represent this recursive program. Every time the 'call' instruction in line 23 is executed, the VM only needs to jump to the instruction in line 10 in order to call the 'rec' function recursively. Since repeated traversal of the syntax tree is omitted, rush programs will run significantly faster when running on the VM compared to the tree-walking interpreter. Using the VM, executing the 'rec' function

using an input of n=1000 took around 160 µs. However, executing the identical code using the tree-walking interpreter took around 427 µs<sup>4</sup>. The rush VM executed the identical code roughly 2.7 times faster than the tree-walking interpreter. However, the initial delay caused by compilation was not considered in this benchmark.

As a conclusion, a VM is often a reasonable approach if an interpreted programming language is to be implemented. The main advantages of a VM are increased speed, reduced memory usage at runtime, and fewer runtime errors due to type checking performed by its compiler. The main downsides include the need for a compiler targeting the VM, thus making its implementation more involved compared to a tree-walking interpreter. Furthermore, debugging the VM is often significantly more demanding than debugging a tree-walking interpreter. In the past, commercial software has shown that implementing a programming language leveraging a virtual machine can indeed be successful and on a larger scale. For instance, the Java Virtual Machine (JVM) is the software component responsible for executing the compiled Java bytecode. Through the JVM, the compiled Java program preserves its platform independence while using the benefits introduced by a VM [Lin+14, Section 1.2].

<sup>&</sup>lt;sup>4</sup>Average from 10000 iterations. OS: Arch Linux, CPU: Ryzen 5 1500, RAM: 16 GB.

# 4. Compiling to High-Level Targets

In the previous chapter, the implementation of a programming language using an interpreter has been explained. However, another method of implementing a programming language is to create a compiler. This chapter presents three different compilers for high-level targets, and one transpiler.

# 4.1. How a Compiler Translates the AST

Often, a compiler traverses an AST generated by the analyzer in order to translate it to some sort of output. For each AST node, the compiler usually calls a separate function or method which is specialized in translating this specific node type, similar to the treewalking interpreter. These individual methods return some sort of value representing the translated



Figure 4.1 – Abstract syntax tree for '1 + 2 < 4'.

node. Otherwise, each individual method may also insert generated instructions into an internal field of the compiler. Here, the newly generated instructions are inserted into the output sequence. In this case, each method also returns metadata about the previously compiled node. For instance, this data may include the register or memory location of a previously compiled expression, so that other AST nodes can refer to it later. In transpilers, i.e., compilers translating one high-level language into another one, each node-specific method returns another tree node representing code of the output language.

Figure 4.1 displays a simplified syntax tree of the rush expression '1 + 2 < 4'. The encircled number after each expression represents the order in which most compilers would traverse this tree. Compilation of the expression starts at the root node of the tree. Here, most compilers will begin translation by first compiling the child nodes using post-order traversal. Post-order traversal is frequently used because the compilation of a node often depends on the output of its child nodes. In this example, translation of the root comparison expression depends on the information returned by compiling its left- and right-hand sides. Therefore, the compiler first considers the node 'Expression 3' which represents the infixexpression '1 + 2'. However, due to post-order traversal, this node is not actually processed since the compiler skips straight to its child nodes. Therefore, the left child 'Expression (1)' is traversed as the very first node. Next, its sibling node 'Expression (2)' is traversed. Since post-order traversal involves considering a root node after the traversal of its children, 'Expression (3)' is traversed after (1) and (2) have been considered. Here, the compiler considers the operator of the expression in order to generate the appropriate output instruction. Therefore, the instruction responsible for the addition is inserted after 'Expression 3' has been traversed. Since it and its children are now completed, and its output instruction has been inserted, the compiler now considers the right-hand side of the comparison. Here, the node 'Expression (4)' only consists of the integer literal '4'. Now, all child nodes of 'Expression (5)' have been traversed, thus the compiler now considers this node itself. Here, the compiler also

inserts the appropriate instruction, that is, a less-than comparison between integers. Since the compiler must be aware of the operands of the instruction, each method involved in the tree traversal returns an entity describing the location of the runtime value of its previously compiled node. For instance, if the target architecture uses registers, every method translating an expression must return the register or memory location containing the value of the expression at runtime. By returning information like this, a node will have information about its children after they have been traversed. It might rely on the values returned by its children, therefore it is traversed at last, thus creating the demand for post-order traversal.

```
1 r0 = 1
2 r1 = 2
3 r2 = add r0, r1
4 r3 = 4
5 r4 = lt r2, r3
```

**Listing 4.1** – Simple pseudoinstructions for a fictional architecture.

Listing 4.1 displays a sequence of instructions for a fictional architecture. This sequence could have been generated from the previously discussed tree in Figure 4.1. It is apparent that the order of instructions matches the order in which the tree was traversed. The instructions in lines 1 and 2 represent the tree-nodes 'Expression ①' and 'Expression ②' respectively. Here, the value '1' is assigned to a register named 'r0' while the value '2' is assigned to the register 'r1'. The 'add' instruction in line 3 appears after the instructions in lines 1 and 2 since their tree nodes were traversed first. Furthermore, the instruction uses the registers 'r0' and 'r1' as its

operands and therefore depends on them containing a value. Therefore, the 'add' instruction can use the registers returned by compiling its child nodes as its operands. Next, the integer '4' is assigned to the register 'r3'. Lastly, the comparison instruction 'lt' uses the result of the addition and the register containing '4' as its operands. Here, it is apparent that the instruction generated by the node which was traversed at last is also inserted at the end.

Therefore, using post-order traversal in order to generate output instructions targeting a register-based architecture is often required. This example illustrates how a simple compiler might operate. However, a similar algorithm is often found in even the most complex compilers.

# 4.2. The Compiler Targeting the rush VM

Since the rush VM interprets instructions directly, there must be a compiler targeting its architecture. For this purpose, a compiler translating rush source code into instructions, which can be executed by the VM, was implemented. Since the VM's architecture was developed with the features of rush in mind, the compiler sometimes requires surprisingly little effort for translating certain AST-nodes. For instance, the compiler translates infix-expressions, such as 'n + m', into instructions using the 'infix\_expr' method. This method is displayed in Listing 4.2.

```
crates/rush-interpreter-vm/src/compiler.rs
     fn infix_expr(&mut self, node: AnalyzedInfixExpr<'src>) {
522
523
         match node.op {
             InfixOp::Or | InfixOp::And => {/* ... */}
524
537
             } <= ao
                  self.expression(node.lhs);
538
                  self.expression(node.rhs);
539
                  self.insert(Instruction::from(op));
540
541
         }
542
543
```

**Listing 4.2** – Compilation of infix-expressions targeting the VM.

This method differentiates between the 'II', '&&', and other possible operators since the former require extra code to be emitted. Here, the focus only lies on the compilation of the latter, meaning any other type of infix-expression. In line 538, the left-hand side expression is compiled first. In the next line, the right-hand side is compiled too. Finally, in line 540, the appropriate instruction representing the infix-operator is inserted. The instruction is generated by a separate function, 'Instruction::from', which converts an infix-operator into a matching instruction. It is to be mentioned that most of the other rush compilers required significantly more code in order to implement the translation of infix-expressions. The reason for the simplicity of this implementation is the fact that the rush VM implements one instruction for each infix-operator. Since this VM implements all features of the source language, some complex operations are not required and thereby simplify the compiler's implementation.

```
_ crates/rush-interpreter-vm/src/compiler.rs
    fn expression(&mut self, node: AnalyzedExpression<'src>) {
445
         match node {
446
             AnalyzedExpression::Int(value) =>
447
         self.insert(Instruction::Push(Value::Int(value))),
457
             AnalyzedExpression::Infix(node) => self.infix_expr(*node),
             AnalyzedExpression::Call(node) => self.call_expr(*node),
458
             AnalyzedExpression::Grouped(node) => self.expression(*node),
459
460
    }
461
```

**Listing 4.3** – Compilation of expressions targeting the VM.

Listing 4.3 shows the 'expression' method which takes an 'AnalyzedExpression'. However, the method does not return anything which represents the runtime value of the expression. This is possible because the result of the expression is pushed onto the stack directly. In line 447, the code for translating an integer literal can be seen. Here, a 'push' instruction using the integer as its operand is inserted. By pushing the values of expressions onto the stack, most methods do not need to return values. Due to this, short and elegant code, like the one in Listing 4.2, can be implemented. In other compilers, the method responsible for compiling expressions would usually return the register which will contain the value of the compiled expression at runtime. Due to the values being saved on the VM's stack, other parts of the compiled program can still use the runtime values of compiled expressions. Similar to the 'expression' method of the analyzer, this method also dispatches the traversal of more complex expressions to specialized methods in order to keep its implementation simple. For instance, the method for compiling infix-expressions is invoked in line 457.

Another reason for the compiler's simplicity regarding infix-expressions is that the rush VM includes a special instruction for the mathematical power operation. In rush, the expression 'n \*\* m' can be used to denote the mathematical expression  $n^m$ . Since many real architectures lack an instruction for power operations, most other rush compilers demanded implementation of special edge-cases in order to make compiling power-expressions feasible. The fact that the VM provides one instruction per operator shows how a carefully chosen target architecture simplifies the implementation of its compiler by a great deal.

However, there is also one aspect of the VM which made implementation of the compiler targeting the VM more demanding than usual. For instance, in most assembly dialects, labels can be used to allow jumps between blocks of code. However, the VM intentionally does not support the use of such labels. Since the VM would have to look up the exact instruction index of a label at runtime, each jump targeting a label would involve some additional overhead. Like seen in the previous examples, VM jump instruction require the exact index of the destination instruction as their operand. Therefore, the exact target index

to which the instruction should jump must be determined at compile time. To illustrate this issue, Listing 4.4 shows a rush program containing a loop, and Listing 4.5 displays the corresponding compiler output. In the loop's body, the variable 'n' is incremented by '1'. Next, the 'break' statement is used to terminate loop execution. Therefore, the total amount of iterations is '1'.

```
fn main() {
    let mut n = 0;
    loop {
        n += 1;
        break;
    }
}
```

**Listing 4.4** – A rush program containing a loop.

```
1  setmp 1
2  push 0
3  svari *rel[0]
4  push *rel[0]
5  clone
6  gvar
7  push 1
8  add
9  svar
10  jmp 11
11  jmp 3
```

**Listing 4.5** – Compiler output for the rush program in Listing 4.4

In lines 2 and 3 of Listing 4.5 the variable 'n' is initialized. The instructions in lines 4–9 are used to increment 'n' by '1'. A new instruction which was not covered so far is the 'clone' instruction. This instruction duplicates the top item on the stack it without prior calls to 'pop'. Therefore, after the instruction has been executed, two identical values exist on the top of the stack. This instruction is only used in assign-expressions in order to duplicate the address value of the assignee.

After 'n' was incremented, the instruction in line 10 jumps to the instruction index '11'. However, the last valid index is '10', it is represented by the 'jmp 3' instruction in line 11. If this jump is executed, the VM has no next instruction to fetch and therefore stops its fetch-decode-execute cycle. Since it jumps to a position outside the loop, it represents the 'break' statement in line 5 of the source program. The 'jmp' instruction in line 11 is responsible for the repetition introduced by the loop. It jumps to the first instruction of the loop's body in line 4. Therefore, the instructions inside the loop's body are executed repeatedly. The difficulty presented by this design is that the index of the jump's target instruction must be determined before the target instruction is inserted. Listing 4.6 displays the method responsible for compiling loops.

In line 331, the loop's body is compiled and instructions generated during this process are inserted into the output sequence. The next statement in line 337 inserts an instruction responsible for jumping back to the start of the loop's body. The variable 'loop\_head\_pos' was previously defined and specifies the index of the first instruction of the loop's body.

In line 340, the top loop is popped off the 'loops' stack. In line 322, this loop was previously pushed onto the 'loops' stack. This stack is an internal field used by the compiler in order to save information about loops. The top item on it always represents the loop currently traversed by the compiler. Each loop item saves two lists, each containing the indices of jump instructions whose target index needs to be adjusted. The first list contains the indices of jump instructions generated by 'break' statements which were encountered during traversal of the loop's body. The second list saves the indices of jump instructions emitted by 'continue' statements. For instance, if the compiler encounters a 'break' statement, the code in Listing 4.7 is executed.

Here, the 'pos' variable saves the index of the jump instruction to be inserted. In line 271,

**Listing 4.6** – Implementation of loops in the rush VM compiler.

```
_ crates/rush-interpreter-vm/src/compiler.rs
     fn statement(&mut self, node: AnalyzedStatement<'src>) {
253
             AnalyzedStatement::Break => {
268
                 // the jmp instruction is corrected later
269
                 let pos = self.curr_fn().len();
270
                 self.curr_loop_mut().break_jmp_indices.push(pos);
271
272
                 self.insert(Instruction::Jmp(usize::MAX));
273
287
    }
288
```

**Listing 4.7** – Compilation of 'break' statements in the rush VM compiler.

this index is then inserted into the list containing the placeholder indices of the current loop. Lastly, a 'jmp' instruction is inserted containing a placeholder target index. In line 342 of Listing 4.6, the 'fill\_blank\_jmps' method is used to set the target indices of the specified jump instructions to 'pos'. The explanation of this method is omitted because it only iterates over the passed list of indices, replacing the target of each jump instruction during the process.

# 4.3. Compilation to WebAssembly

The first external compilation target<sup>1</sup> presented here is WebAssembly, or Wasm for short. It aims to be a safe, portable, low-level, efficient, and compact code format, mainly intended for enabling high performance applications in websites. However, unlike the name implies, WebAssembly is not only used in web applications. By itself, it is only a specification that can be implemented by runtimes in any context [Ros22, Section 1.1]. Most modern browsers include such a WebAssembly runtime, but there are also standalone ones, for example wasm-time and wasmer.

### 4.3.1. WebAssembly Modules

Every valid WebAssembly file must contain exactly one module. The WebAssembly specification defines two different representations for these modules. First, there is a human-readable text representation, called  $WAT^2$ , closely resembling S-Expressions<sup>3</sup>. This is comparable to assembly languages for CPU architectures and is the typical target for compilers. Secondly, WebAssembly modules can also be represented in a binary format, which is optimized for size, and comparable to the binary files produced by assemblers [Sen22, pp. 40–44]. Typically, these binary modules are constructed from a text module by using a tool such as wat2wasm from the  $WebAssembly\ Binary\ Toolkit\ (WABT)$  [Sen22, p. 57]. However, the rush WebAssembly compiler instead opts to target the binary format directly, highlighting a few reasons for why most compilers should not follow this approach. Listing 4.8 and Listing 4.9 on page 41 show the same basic WebAssembly module once as WAT and once as a commented hex dump of the same module in its binary representation as produced by wat2wasm.

Focusing on the text representation first, the module contains one function that takes two 'i32' parameters and returns a single 'i32'. An 'i32' in WebAssembly represents an uninterpreted 32-bit integer, meaning that it is not clear whether the integer is signed or unsigned from the type itself. Instead, values of this type can be interpreted as either signed or unsigned by different instructions [Ros22, Section 2.3.1]. For instance, the instruction 'i32.eq', which checks for equality between two 'i32' values, behaves the same disregarding how the integer is signed. In contrast, 'i32.lt\_s' and 'i32.lt\_u' are two instructions both querying whether one 'i32' is less than another, once for signed and once for unsigned integers, as denoted by the suffix [Sen22, p.46].

The mentioned function is exported by the module under the name 'addTwo' to make it accessible externally. What exactly 'externally' means depends on the context the module is executed in. WebAssembly is *stack based* and has one primary stack instructions can operate on. The first two instructions of the 'addTwo' function retrieve the local variable of the given index and push its value onto the stack. 'Locals' in WebAssembly are simple values separate from the main stack. Function parameters are always the first locals, but additional ones can be added too. After the two instructions have been executed, the stack now contains the values of the two function parameters. They are then added by the 'i32.add' instruction, which pops the top two elements off the stack and pushes the sum back onto it. The implicit return value is always what remains on the stack at the end of a function body, just like in the rush VM.

Now, the hex dump of the same module in binary, displayed in Listing 4.9, is to be considered. A WebAssembly binary file always starts with the four bytes '00 61 73 6d'

<sup>&</sup>lt;sup>1</sup>Here, "external" describes that the compiler emits an output file which is then executed or further processed by another, already existing tool.

<sup>&</sup>lt;sup>2</sup>Short for "WebAssembly Text" [Sen22, p. 40].

<sup>&</sup>lt;sup>3</sup>Short for "symbolic expression", used for representing nested, tree-structured data [Sen22, p. 41].

```
1 (module
2 (func (export "addTwo") (param i32 i32) (result i32)
3    local.get 0
4    local.get 1
5    i32.add))
```

**Listing 4.8** – Simple WebAssembly module in text representation.

```
0000000: 0061 736d
                                      ; WASM_BINARY_MAGIC
   0000004: 0100 0000
                                      ; WASM_BINARY_VERSION
   ; section "Type" (1)
                                      ; section code
   0000008: 01
                                     ; section size
   0000009: 07
   000000a: 01
                                     ; num types
   ; func type 0
7
  000000b: 60
                                     ; func
9 000000c: 02
                                     ; num params
10 000000d: 7f
                                     ; i32
11 000000e: 7f
                                     ; i32
  000000f: 01
                                     ; num results
13 0000010: 7f
                                      ; i32
   ; section "Function" (3)
14
  0000011: 03
                                     ; section code
15
16 0000012: 02
                                      ; section size
   0000013: 01
                                      ; num functions
17
  0000014: 00
                                      ; function 0 signature index
   ; section "Export" (7)
                                      ; section code
20 0000015: 07
                                      ; section size
21
   0000016: 0a
                                      ; num exports
22 0000017: 01
23 0000018: 06
                                     ; string length
24 0000019: 6164 6454 776f
                             addTwo ; export name
25 000001f: 00
                                     ; export kind
26 0000020: 00
                                      ; export func index
27 ; section "Code" (10)
28 0000021: 0a
                                      ; section code
                                     ; section size
29 0000022: 09
30 0000023: 01
                                     ; num functions
   ; function body 0
31
                                     ; func body size
   0000024: 07
32
                                      ; local decl count
33 0000025: 00
                                      ; local.get
34 0000026: 20
35 0000027: 00
                                      ; local index
   0000028: 20
                                     ; local.get
                                     ; local index
   0000029: 01
37
                                     ; i32.add
   000002a: 6a
                                     ; end
39
   000002b: 0b
40 ; section "name"
41 000002c: 00
                                      ; section code
                                     ; section size
42 000002d: 0a
43 000002e: 04
                                      ; string length
44 000002f: 6e61 6d65
                               name ; custom section name
45 0000033: 02
                                      ; local name type
46 0000034: 03
                                      ; subsection size
47
  0000035: 01
                                      ; num functions
  0000036: 00
                                      ; function index
48
   0000037: 00
                                      ; num locals
```

**Listing 4.9** – Simple WebAssembly module in binary representation.

called the *Wasm binary magic*, representing a zero byte followed by the string 'asm' using ASCII representation. This is used by other programs to easily identify binary files as WebAssembly modules. Following that is the version of the binary format, stored as a 32-bit integer in little-endian<sup>4</sup>. At the time of writing it is always '1' [Ros22, Section 5.5.16].

The binary module is then split into different sections, each containing one kind of information about the entire module. Empty sections can be omitted. Each section begins with its identifier, followed by the section size in bytes. Most sections contain one vector of relevant data, and vectors always start with the count of elements they contain, and continue with the elements themselves. The first section present here is the 'Type' section. It declares different types used by the module, most importantly, the function signatures. The 'Function' section then contains the number of functions of the current module and simply references the 'Type' section for each function's signature. The module's exports are declared in the 'Export' section. Finally, the 'Code' section contains the actual instructions for each function. It is again stored as a vector, containing function bodies for all functions defined in the 'Function' section in the same order. Each function body begins with its size in bytes, continues with the instructions, and ends with an 'end' instruction represented by a 'Ob' byte [Ros22, Sections 5.5, 5.1.3, 5.4.9].

The wat2wasm tool used here additionally adds a custom 'name' section. Custom sections always have the ID '0' and must provide a custom name using UTF-8 encoding. This 'name' section has its own specification separate from the main module specification, and is used to provide names for functions and variables that can then be used by development tools like debuggers [Ros22, Section 7.4.1].

Apart from exporting, WebAssembly modules can also import external functions. Only the name and type signature must be provided, and the WebAssembly runtime will then have to provide an implementation during program execution. Furthermore, WebAssembly does not only have local variables, but also global ones, accessible from every function. These must be initialized with a constant value and can either be mutable or immutable.

One may have already noticed that except for the version number at the start, all sizes, indices, lengths, and alike, have been stored using just a single byte. But, this is not because those can only reach a maximum of '255', but instead WebAssembly uses the LEB128 encoding for integer literals in binary modules. It is a space efficient way to store integers by only ever needing as many bytes as necessary for a number [Ros22, Section 5.2.2]. However, the encoding details are not explained here, as the rush Wasm compiler simply uses a pre-existing crate called "leb128".

# 4.3.2. The WebAssembly System Interface

Since WebAssembly itself is independent of a runtime, it cannot provide standardized ways to interact with the environment, except for module imports and exports. That is why an additional specification called the WebAssembly System Interface, short WASI, was created for WebAssembly modules that need to communicate with an operating system for instance. Any runtime supporting WASI must provide a set of functions comparable to system calls on Linux or Windows. These can then be imported from a WebAssembly module to, e.g, write to a console or exit with a specific exit code. Both wasmtime and wasmer implement the WASI interface.

A WebAssembly module making use of WASI must export one function under the name '\_start' that acts as the entry point of the program. The rush WebAssembly compiler only ever imports WASI's 'proc\_exit' function, which takes one 32-bit integer as an argument and terminates execution with the given code.

<sup>&</sup>lt;sup>4</sup>Little-endian starts with the least significant byte, whereas big-endian starts with the most significant one.

### 4.3.3. Implementation

As indicated, the rush WebAssembly compiler targets the binary format directly. This complicates compilation in a few ways, but removes the need for any external dependencies. First, public constants are defined for all instructions and all types in separate files. Listing 4.10 shows an extract containing the 'i64.add' and 'i64.sub' instructions.

**Listing 4.10** – Definition of instruction opcodes.

The 'Compiler' struct has a lot of fields for various purposes. Only some are shown in Listing 4.11 and explained here. To begin, multiple fields regarding the currently compiled function are defined. The field 'function\_body' contains the bytes with instructions for the current function. The field 'locals' stores which local variables the function has, along with their types. In the binary format, the locals are stored as a WebAssembly vector, starting with the number of locals, followed by each local. Since the compiler cannot know the count of local variables before compiling the function body, it stores them as a vector of byte vectors first. This way, in the end, it can first append the vector's length to the final output and then concatenate the vector's contents. The three following fields all map names to indices. One for local variable scopes, one for the global scope, and one for function names. Each index itself is stored as a vector of bytes, as it uses the aforementioned LEB128 encoding which can vary in length. Finally, one field for every supported section is defined. These are of the type 'Vec<vec<u8>>' for the same reason as 'locals'.

```
crates/rush-compiler-wasm/src/compiler.rs
   pub struct Compiler<'src> {
11
        /// The instructions of the currently compiling function
12
        pub(crate) function_body: Vec<u8>,
13
        /// The locals of the currently compiling function
14
        pub(crate) locals: Vec<Vec<u8>>,
15
        /// Maps variable names to `Option<local_idx>`, or `None` when of type `()`
26
        pub(crate) scopes: Vec<HashMap<&'src str, Option<Vec<u8>>>>,
27
        /// Maps global variable names to `global_idx`
28
        pub(crate) global_scope: HashMap<&'src str, Vec<u8>>,
29
        /// Maps function names to their index encoded as unsigned LEB128
30
        pub(crate) functions: HashMap<&'src str, Vec<u8>>,
31
        // ...
        pub(crate) type_section: Vec<Vec<u8>>,
37
        pub(crate) import_section: Vec<Vec<u8>>,
38
        pub(crate) function_section: Vec<Vec<u8>>, // 3
39
   }
58
```

**Listing 4.11** – The 'Compiler' struct definition of the WebAssembly compiler.

Listing 4.12 contains the compiler's entry point method. Some details are omitted, but essentially, it simply calls another method to compile the program itself, and then concatenates all sections together to form the final binary. It also imports all required functions from WASI and exports blank linear memory as required by WASI. The 'Self::section' function is used to add the section identifier, byte length, and element count in front of each section's contents.

Inside the 'program' method, the global variables are defined and initialized, and all function signatures are added to the 'Type' and 'Function' sections and the 'functions' map. Afterward, it calls 'function\_definition' for every defined function. This has to be parti-

```
pub fn compile(mut self, tree: AnalyzedProgram<'src>) -> Vec<u8> {
 70
         self.program(tree);
81
104
             &b"\0asm"[..],
                                    // magic
105
             &1_i32.to_le_bytes(), // spec version 1
106
             &Self::section(1, self.type_section),
107
             &Self::section(2, self.import_section),
108
             &Self::section(3, self.function_section),
109
         ]
124
         .concat()
125
    }
126
```

**Listing 4.12** – Entry point of the Wasm compiler.

tioned into two steps because rush allows a function to be called before its definition. For every function body, all statements and the optional trailing expression are compiled first, and then the values of 'function\_body' and 'locals', along with their combined length are appended to the 'Code' section.

All other nodes have a matching method defined again, like every time an AST is traversed. Each of those methods simply pushes instructions to 'function\_body'. Because WebAssembly is stack-based, none of them have to return any value, not even the expressions. They simply add their instructions to the resulting code and call methods for nested nodes beforehand. This will result in the intended behavior, just like in the rush VM.

#### **Function Calls**

Compared to the other compilers presented later, supporting functions and function calls is very straight forward for WebAssembly. It was already explained that WebAssembly modules have a concept of functions with parameters and return values, so mapping rush functions to these is the obvious strategy. Calling a declared function is as simple as compiling all argument expressions in order, causing all evaluated arguments to be on top of the stack, and emitting a 'call' instruction with the target function's index.

#### **Logical Operators**

One special case to highlight is the compilation of logical operators<sup>5</sup> in infix-expressions. This comes down to the fact that in many programming languages, and likewise in rush, these operators evaluate lazily, that is, they only evaluate their right-hand side if the result is not already predeterminable by the left-hand side. Listing 4.13 shows the relevant part of the 'infix\_expr' method. At the start of this method, there are special cases for the '&&' and '||' operators. Here, only the code for the '&&' operator is shown, but the '||' operator is handled similarly. It is evident that the operation is compiled as if it were an if-expression, like 'if !lhs { false } else { rhs }'. Using this strategy, the expected behavior of lazy evaluation is achieved, unlike if both sides had been compiled and the results were compared using an 'i32.and' instruction. The negation of the left-hand side is done in line 764 using the 'i32.eqz' instruction, which consumes an 'i32' from the top of the stack and replaces it with either '0' or '1' based on whether the value was equal to zero. Because the top-most value is already a boolean, the instruction has the effect of negating it. In line 777 an early return is issued to skip the code used for all other infix operators.

<sup>&</sup>lt;sup>5</sup>Operators that only operate on boolean values, typically '&&' and '||'.

```
fn infix_expr(&mut self, node: AnalyzedInfixExpr<'src>) {
758
         match node.op {
759
760
             InfixOp::And => {
761
                  self.expression(node.lhs);
                  self.function_body.extend([
762
                      // if lhs is not true
763
                      instructions::I32_EQZ,
764
                      instructions::IF,
765
                      types::I32,
766
                      // then return false
767
                      instructions::I32_CONST,
768
769
                      // else return rhs
770
                      instructions::ELSE,
771
                  ]);
772
                  self.expression(node.rhs);
773
774
                  // end if
                  self.function_body.push(instructions::END);
775
776
777
                  return;
778
             InfixOp::Or => {/* ... */}
779
797
798
864
```

**Listing 4.13** – Compilation of logical operators for Wasm.

The other compilers explained in this paper, while not mentioning it again, all use the same strategy for these logical operators.

### 4.3.4. Considering an Example rush Program

Listing 4.14 shows an example rush program containing a global integer variable 'a', a local boolean variable 'b', and a call to the built-in 'exit' function. The generated compiler output, converted to WAT using wasm2wat and manually commented, is shown in Listing 4.15. Since the compiler detects usage of the 'exit' function, it imports the corresponding 'proc\_exit' function from WASI under the name '\_\_wasi\_exit' in line 4. Below the 'main' function are the definition and the export of blank memory in lines 23 and 26 respectively, as to conform with WASI's requirements. In line 25, the 'main' function

```
let mut a = 2;

fn main() {
    a += 1;
    let b = true;
    exit(a + b as int);
}
```

**Listing 4.14** – Example rush program.

is also exported as '\_start' and also separately declared as the module's entry point in line 27. Both is done in order to have this function be the one to start execution. Line 24 declares the global 'a' as a mutable 'i64' and sets the initial value of '2'.

Inside the 'main' function, the first thing is the declaration of the local variable 'b' with the type 'i32' in line 6. The type is 'i32', even though a boolean would really only need one bit because WebAssembly only defines this and 'i64' as integer types, and therefore uses the smaller of the two for booleans. As indicated by the comments, the first four instructions represent the rush statement 'a += 1;' from line 4 of the source. They push both the current value of 'a' and a constant '1' onto the stack, which are then added together by the 'i64.add'

```
(module
1
      (type (;0;) (func (param i32)))
2
      (type (;1;) (func))
3
      (import "wasi_snapshot_preview1" "proc_exit" (func $__wasi_exit (type 0)))
4
      (func $main (type 1)
5
        (local $b i32)
6
7
        ;; a += 1;
                                ;; push the current value of `a`
8
        global.get $a
9
        i64.const 1
                                ;; push `1`
        i64.add
                                ;; add those two
10
        global.set $a
                                 ;; pop and set `a` to the result
11
        ;; let b = true;
12
        i32.const 1
                                 ;; push `true`
13
                                 ;; pop and set `b`
14
        local.set $b
15
        ;; exit(a + b as int);
                                 ;; push `a`
        global.get $a
16
                                ;; push `b`
        local.get $b
17
        i64.extend_i32_u
                                ;; cast bool to int
18
                                ;; add those two
19
        i64.add
                                ;; convert i64 to i32 (for `__wasi_exit`)
        i32.wrap_i64
20
        call $__wasi_exit
                                ;; call `__wasi_exit
21
        unreachable)
22
      (memory (;0;) 0)
23
      (global $a (mut i64) (i64.const 2))
24
      (export "_start" (func $main))
25
      (export "memory" (memory ₀))
26
      (start $main))
27
```

**Listing 4.15** – Commented compiler output of the rush program in Listing 4.14.

instruction. The resulting sum is then popped off the stack and set as the new value for 'a'. To set the value of 'b' to 'true', a constant '1' is pushed in line 13 and then used as the new value for 'b' in line 14.

The call to 'exit' is a bit more complex. First, the values of both 'a' and 'b' are retrieved. The cast from a boolean to an integer is performed by the single instruction 'i64.extend\_i32\_u', which zero-extends<sup>6</sup> the 32-bit value to a 64-bit one. The two integers are then added together in line 19. Because WASI's 'proc\_exit' function only takes a 32-bit integer for the exit code, but rush uses 64-bit integers for its 'int' type, the argument value must be converted into an 'i32' first. This happens in line 20 with the 'i32.wrap\_i64' instruction. In case the value is too large to fit into a 32-bit integer, it is wrapped, effectively only interpreting the lower 32 bits of the 64-bit number. After this, the call to '\_\_wasi\_exit' can now be performed without issues.

One might notice the trailing 'unreachable' instruction in line 22. The rush WebAssembly compiler inserts one such instruction after every expression that has the '!' type. As the name suggests, the instruction guarantees the WebAssembly runtime that it is never reached. This is helpful in cases where such expressions are used inside others, such as '1 + exit(2)', and the semantic analyzer ignores the invalid type of the right-hand side because it knows the outer expression will never be reached during runtime. The same information must be given to the WebAssembly runtime using the 'unreachable' instruction since the specification requires runtimes to validate types in modules before executing them.

<sup>&</sup>lt;sup>6</sup>Converting an *n*-bit sized integer to an *m*-bit sized one, where m > n, by prepending zeros to the binary representation.

# 4.4. Using LLVM for Code Generation

LLVM is a software project intended to simplify the construction of a compiler generating highly-performant output programs. It originally started as a research project by *Chris Lattner* for his master's thesis at the University of Illinois at Urbana-Champaign [Lat02]. Since then, the project has been widely adopted by the open source community. In 2012, the project was rewarded the *ACM Software System Award*, a prestigious recognition of significant software which contributed to science. From the point when popularity of the framework grew, it was renamed from *Low Level Virtual Machine* to the acronym it is known by today. Nowadays, it can be recognized as one of the largest open source projects [CA14, preface]. Among many other projects, the Rust programming language depends on LLVM in order to generate its target-specific code [McN21, p. 373]. Furthermore, the *Clang* C and C++ compiler uses LLVM as its code generating backend [Hsu21, preface]. Besides open source projects, many companies have also utilized LLVM in their commercial software. For instance, since 2005, Apple has started incorporating LLVM into some of its products [Fan10, pp. 11-15]. A recent example of software developed by Apple which uses LLVM is the *Swift* programming language which is mainly used for developing iOS apps [Hsu21, preface].

### 4.4.1. The Role of LLVM in a Compiler

In a compiler system, LLVM is responsible for generating target-specific code and performing optimizations. The framework is known for performing very effective optimizations during code generation so that the translated program executes faster at runtime, uses less memory and is smaller in size. In order to interact with LLVM, the system provides an interact with which is usable by other software. Typically, a compiler frontend only analyzes the source program to create an AST. Next, a separate step of compilation invokes the LLVM framework which carries on from this point. This component traverses the AST and uses the LLVM API in order to construct an intermediate representation (IR) of the program. This way, the framework will be able to understand the semantic meaning of the program. Next, LLVM compiles this IR to an output targeting any of its supported platforms. As of today, LLVM features many target platforms so that a compiler designer does not have to implement portability manually [Hsu21, preface]. Listing 4.2 shows how LLVM integrates into the previously discussed steps of compilation. Therefore, the framework represents the backend component of a compiler.



**Figure 4.2** – Steps of compilation when using LLVM.

The updated Figure 4.2 now includes two new steps: "LLVM IR generation" and "LLVM backend". The first step traverses the AST in order to generate a semantically equivalent program formulated using LLVM IR. The second step uses the IR as its input in order to generate target-specific code. A compiler writer must only implement the first three steps of this figure as the last step is represented by the framework itself.

# 4.4.2. The LLVM Intermediate Representation

The LLVM IR represents the source program in a low-level manner. However, even though this IR is low-level, it is still target-independent, and unlike many other low-level languages,

it also contains detailed type information. Therefore, high-level type information is preserved while the benefits of a low-level representation are introduced. This allows LLVM to perform significantly more aggressive optimizations compared to other compiler solutions or frameworks. Therefore, programs compiled using LLVM as the backend will often run significantly faster compared to those generated by simple compilers.

LLVM provides official APIs for interacting with the IR in memory. This way, IR can be created by a frontend without the need for a separate file. If a file containing the IR is written and read by the individual parts of the compiler, the same performance issues introduced by multipass compilation would emerge. Therefore, LLVM provides these APIs for the C++ and C programming languages. However, there are also many unofficial bindings for other languages, such as for Rust, Go, and Python. Since LLVM must be able to perform complex program analysis before it can optimize a program, its IR introduces many rules and constraints. For instance, a program formulated using the IR should always obey the following hierarchy [Hsu21, pp. 211–213]:

- The top-most hierarchical structure is the so-called *module*. It represents the current file being compiled.
- Each module contains several *functions*. Often, each function in the source program is represented using one function in the IR.
- Each function contains several *basic blocks*. A basic block contains a sequence of instructions. Blocks should always be terminated using a *jump*, *return*, or *unreachable* instruction. However, a basic block must never be terminated twice.
- Each *instruction* holds a semantic meaning and represents a part of the source program. For instance, LLVM provides instructions for integer addition or function calls.

Because the LLVM IR provides only little abstraction, numerous optimizations can be achieved in the early stages of compilation. Furthermore, due to the high-level type information contained in the IR, LLVM is able to perform more optimizations on the IR during later stages of compilation. Therefore, the virtual instruction set of LLVM is designed as a low-level representation with high-level type information. This instruction set describes a virtual architecture which is able to represent an abstraction over most of the common types of processors. Although it is low-level, the IR avoids machine specific constraints like a fixed set of registers or low-level calling conventions. Instead, the virtual architecture provides an infinite set of virtual registers which can hold the values of primitives like integers, booleans, floating-point numbers, and pointers. All registers in the IR use the  $SSA^7$  form in order to allow more optimizations. In order to enforce the correctness of the type information included in the IR, each program has to obey these rules [Lat02, p. 14-17].

#### Structure of a Compiled rush Program

In order to understand how a program can be formulated using the LLVM IR, we consider the rush program for calculating Fibonacci numbers again. For reference, the rush program used in this example can be found in Listing 1.1 on page 3. The code in Listing 4.16 displays the identical rush program formulated in LLVM IR, it was generated by the rush compiler targeting LLVM<sup>8</sup>. The code displayed in this listing shows a module named 'main'.

In the lines 5 and 23, two functions are defined using the 'define' keyword. It is apparent that the functions in the LLVM module represent the functions from the source rush program. When examining the signature of the 'fib' function in line 5 of the IR, it becomes apparent

<sup>&</sup>lt;sup>7</sup>Short for "static single assignment", widely used in optimizing compilers.

<sup>&</sup>lt;sup>8</sup>Generated using rush (Git commit '93ee068').

```
; ModuleID = 'main'
1
   source_filename = "main"
2
   target triple = "x86_64-alpine-linux-musl"
3
4
   define internal i64 @fib(i64 %0) {
5
6
   entry:
7
     %i_lt = icmp slt i64 %0, 2
8
      br i1 %i_lt, label %merge, label %else
9
                                                         ; preds = %entry, %else
10
   merge:
     %if_res = phi i64 [ %i_sum3, %else ], [ %0, %entry ]
11
      ret i64 %if_res
12
13
                                                         ; preds = %entry
14
      \%i_sum = add i64 \%0, -2
15
      %ret_fib = call i64 @fib(i64 %i_sum)
16
17
      \%i_sum1 = add i64 \%0, -1
      %ret_fib2 = call i64 @fib(i64 %i_sum1)
18
      %i_sum3 = add i64 %ret_fib2, %ret_fib
19
      br label %merge
20
   }
21
22
   define i32 @main() {
23
   entry:
24
25
      %ret_fib = call i64 @fib(i64 10)
      call void @exit(i64 %ret_fib)
26
27
      unreachable
   }
28
29
   declare void @exit(i64)
```

**Listing 4.16** – LLVM IR representation of the program in Listing 1.1.

that the function returns a runtime value of the type 'i64', meaning a signed 64-bit integer. In rush, the 'int' type represents a 64-bit signed number. Furthermore, we can observe that the function takes a parameter named '%0' of the type 'i64'. It represents the 'n' parameter in the rush source program.

In line 6, the start of the 'entry' block of the 'fib' function is declared using the block's name followed by a colon. Since LLVM can perform more optimizations on variables if they are declared in the 'entry' block of a function, the rush compiler places variable declarations solely in the 'entry' block. In line 7, the 'icmp slt' instruction is used in order to compare the runtime value of the parameter '%0' to a constant '2'. The boolean result is then saved in a virtual register named '%i\_lt'. Since LLVM's virtual registers may have arbitrary names, the rush compiler uses names which depend on the context of the translation. In line 8, the block is terminated using the 'br' instruction. The instruction will only jump to the 'merge' label under the condition that the value of '%i\_lt' is 'true'. Here, the 'merge' and the 'else' labels are used as operands of the branch-instruction. Conditional jumps in LLVM always require two jump targets, one for the case in which the condition is true, and one for when it is false. Due to constraints introduced by its internal optimizations, LLVM only allows jumps to target blocks contained in the same function. This branch-instruction presents the essential part of the if-expression in the source program. If the condition was true at runtime, the instruction would jump to the 'merge' block in line 10. What might seem odd is that there is no if-block, even though the rush frontend has previously translated the if-block into LLVM IR. However, since that block would only jump to the 'merge' block, LLVM's

<sup>&</sup>lt;sup>9</sup>Short for "integer compare (signed less than)".

<sup>&</sup>lt;sup>10</sup>Short for "branch", used to jump to the beginning of a basic block.

optimizations removed it entirely.

In line 11 of the 'merge' block, the 'phi' instruction is used. These so called  $\phi$ -nodes are necessary due to the SSA form used in the IR. In short, a  $\phi$ -node produces a different value depending on the basic block where control came from. Since the if-construct is an expression in rush, LLVM must know whether the result of the 'entry' or the 'else' branch is to be used as the result of the entire if-expression. As a solution to this problem, these  $\phi$ -nodes associate a value to an origin branch. In this example, the phi-node yields the value of the parameter 'n' (stored in '%0') if control came from the 'entry' block. Otherwise, in case control came from the 'else' block, the  $\phi$ -node yields the value of the virtual register "%i\_sum3". However, we have not covered where this virtual register is declared. For this, we consider the instructions in the 'else' block, starting in line 15 with the 'add' instruction. In this case, the instruction subtracts '2' from the parameter '%0' and saves the result in "". Here, an addition instruction using a negative operand is used to perform the subtraction. This is done in order to create the argument value for the first recursive call to 'fib' which is performed by the following 'call' instruction in line 16. The return value of the function call is saved in the '%ret\_fib' register. The same behavior is used in order to call 'fib(n - 1)'.

Next, the 'add' instruction in line 19 is used in order to calculate the sum of the two return values. This sum is then saved in the virtual register '%i\_sum3'. Therefore, when this register is used in the  $\phi$ -node in line 11, the result of the recursive calls can be used as the result of the entire if-expression. In line 20, the 'br' instruction jumps to the 'merge' block unconditionally as there is no condition provided in the operands. After the jump to the 'merge' block, the previously explained  $\phi$ -node is encountered. Finally, the 'ret' instruction in line 12 is used in order to use the result of the if-expression as the return-value of the function. Since the 'main' function does not introduce any new concepts, detailed explanation of its contents is omitted. In line 27, the 'unreachable' instruction is used in order to state that this instruction is never reached. This is necessary because LLVM requires that every basic block is terminated at its end. The 'exit' function terminates the program using a system call and therefore terminates the basic block. However, LLVM does not regard callinstructions as diverging and therefore disallows the call to 'exit' as a way to terminate the basic block. Since LLVM is not aware of the fact that the 'exit' function terminates program execution, an 'unreachable' instruction has to be added in order to signal that a block is terminated.

It is to be mentioned that the original IR generated by the rush compiler looks slightly different because LLVM has already performed aggressive optimizations on this code. By considering the example from above, it became apparent that the IR represents many source language constructs in a high-level way. For instance, function calls can be used without considering the complex rules introduced by low-level calling conventions. Here, calling and returning from a function can be implemented using very little effort. Furthermore, virtual registers allow the compiler frontend to omit the process of register allocation entirely. Lastly, the LLVM IR can subjectively be seen as very readable since registers, basic blocks, and functions may contain custom, human-readable labels.

# 4.4.3. The rush Compiler Using LLVM

In order to get acquainted with the LLVM framework, we have implemented a rush compiler which uses the framework as its backend. However, the entire rush project is implemented using Rust, and as mentioned previously, LLVM's official APIs only support C and C++. Therefore, a third-party Rust wrapper around LLVM is required. We have settled on using the *Inkwell* Rust crate since it exposes a safe Rust API for interacting with LLVM [Kol17].

This compiler uses the annotated AST in order to translate it into LLVM IR. Apart from its output, its principles do not differ from the other compilers presented so far. The code in Listing 4.17 displays the top part of the 'Compiler' struct definition. The 'context' field represents a container for all LLVM entities including modules. Next, the 'module' field contains the underlying LLVM module. The 'builder' field contains a helper struct provided by Inkwell which allows generation of IR in memory. In general, these fields provide an abstraction over LLVM's

```
_ crates/rush-compiler-llvm/src/compiler.rs _
pub struct Compiler<'ctx, 'src> {
    pub(crate) context: &'ctx Context,
    pub(crate) module: Module<'ctx>,
    pub(crate) builder: Builder<'ctx>,
    // ...
    pub(crate) curr_fn:
    Option<Function<'ctx, 'src>>,
    // ...
}
```

**Listing 4.17** – Parts of the struct definition of the rush LLVM 'Compiler'.

APIs which is used by this compiler. In order to get a deeper understanding of how this compiler works exactly, the rush program in Listing 4.18 should be considered. The LLVM IR generated from this input is displayed in Listing 4.19

```
fn main() {
    foo(2);
}

fn foo(n: int) {
    let mut m = 3;
    exit(n + m);
}
```

Listing 4.18 – Simple rush program containing two functions.

The program in Listing 4.18 contains the 'foo' and the 'main' functions. The corresponding definitions are found in the lines 5 and 14 of the output IR in Listing 4.19. The 'foo' function takes the two parameters 'n' and 'm'. It calculates their sum in order to use it as the exit code of the program. In line 7 of the IR, the parameter 'n' and the variable 'm' are added together. What might seem odd is that the definition of 'm' cannot be seen in the IR. Instead, the constant value '3' of the variable is used in the addition instruction. The result of this addition is then used for the function call of the 'exit' function in line 8 of the IR. Because 'n' is '2', the exit code of the program will be '5'.

```
At the beginning of translation, the compiler first iterates over all declared functions in order to add them to the LLVM module. Listing 4.20 displays
```

```
define internal i1 @foo(i64 %0) {
entry:
                                      6
  %i_sum = add i64 %0, 3
                                      7
  call void @exit(i64 %i_sum)
                                      8
  unreachable
                                      9
                                      10
                                      11
declare void @exit(i64)
                                      12
                                      13
define i32 @main() {
                                      14
entry:
                                      15
  %ret_foo = call i1 @foo(i64 2)
                                      16
  ret i32 0
                                      17
                                      18
```

Listing 4.19 – LLVM IR generated from the program in 4.18

parts of the method responsible for translating the 'main' function.

```
fn main_fn(&mut self, node: &'src AnalyzedBlock) {
321
334
         let fn_type = self
335
             .module
             .add_function(fn_name, fn_type, Some(Linkage::External));
336
337
         // create basic blocks for the function
338
339
         let entry_block = self.context.append_basic_block(fn_type, "entry");
         let body_block = self.context.append_basic_block(fn_type, "body");
340
341
         // set the current function to `main`
342
         self.curr_fn = Some(Function {
343
             name: "main",
344
             llvm_value: fn_type,
345
             entry_block,
346
         });
347
348
349
         // compile the body
         self.builder.position_at_end(body_block);
350
         self.block(node, true);
351
         self.builder.position_at_end(entry_block);
369
         self.builder.build_unconditional_branch(body_block);
370
    }
371
```

**Listing 4.20** - Compilation of the 'main' function using LLVM.

In the lines 334-336, the 'main' function is added to the current LLVM module. The definitions of the variables 'fn\_name' and 'fn\_type' are not visible. The first variable specifies the name of the function to be inserted, while the latter describes its signature. In most cases, the 'main' functions's return type is an integer so that a C standard library can use the function as the entry point. In cases where the generated code should not depend on C libraries, 'fn\_name' will be '\_start' and 'fn\_type' will state that the function returns 'void'. This concept is mentioned since the LLVM compiler will not always target the current architecture, meaning a C standard library is not always available. In the lines 339 and 340, the 'entry' and 'body' basic blocks are created and added to the function. Next, in the lines 343-347, the 'curr\_fn' field of the compiler is updated. This field holds information about the current function being compiled. In line 345, the 'llvm\_value' property is of particular importance since all later additions of basic blocks, e.g., during loop compilation, require this value. This property can then later be accessed in case a new basic block should be appended, even though it happens in another method. The 'entry\_block' property in line 346 is required every time a pointer is created, as will be explained later. However, the reason for this behavior is explained later.

Using the Inkwell library, most generated instructions will be automatically appended to the end of the current basic block. For that reason, the position of the instruction builder is changed to the newly created 'body' block, so that the following call to 'block' in line 351 adds instructions to it. The 'block' method first creates a new scope and then compiles all the block's statements and its optional trailing expression. Lastly, the method attempts to compile the block's optional expression. If the content of the body of the 'main' function does not lead to the insertion of more basic blocks, the 'body' block will contain the entire contents of the function after the method call.

In line 2 of the example rush program, the 'main' function calls the foo function using the argument value '2'. In order to understand how this compiler translates function calls, Listing 4.21 should now be considered.

```
crates/rush-compiler-llvm/src/compiler.rs
     fn call_expr(&mut self, node: &'src AnalyzedCallExpr) -> BasicValueEnum<'ctx> {
916
         let args: Vec<BasicMetadataValueEnum> = node
932
933
              .args
              .iter()
934
              .filter_map(|arg| {
935
936
                  if arg.result_type() == Type::Never {
                      self.expression(arg);
937
                      has_never_param = true;
938
                      None
939
                  } else {
940
                      Some(BasicMetadataValueEnum::from(self.expression(arg)))
941
                  }
942
             })
943
             .collect();
944
         // ...
951
         let res = self
952
              .builder
              .build_call(func, &args, &format!("ret_{}", node.func))
953
954
              .try_as_basic_value();
     }
957
```

**Listing 4.21** – Compilation of call-expressions using LLVM.

It displays a small part of the 'call\_expr' method. The shown statement inserts a LLVM 'call' instruction. For this, the 'build\_call' method of the builder is called using the target function, call arguments, and the name of the result register. The variable 'func' represents the called function which was previously declared by looking up the function name in the module. The 'args' variable contains a vector of Inkwell's 'BasicMetadataValueEnum' and therefore specifies the type of each argument. This variable was defined previously by compiling each expression in the 'node.args' vector. However, we cannot understand how results of expressions are handled in this compiler without considering Listing 4.22.

```
crates/rush-compiler-llvm/src/compiler.rs
     fn expression(&mut self, node: &'src AnalyzedExpression) -> BasicValueEnum<'ctx> {
873
874
         match node {
             AnalyzedExpression::Int(value) => self
875
                  .context
876
877
                  .i64_type()
878
                  .const_int(*value as u64, true)
                  .as_basic_value_enum(),
879
             AnalyzedExpression::Infix(node) => self.infix_expr(node),
910
         }
911
     }
912
```

**Listing 4.22** – Compilation of expressions Using LLVM.

The code in Listing 4.22 shows parts of the 'expression' method of this compiler. When considering the method's signature, it becomes apparent that it uses an 'AnalyzedExpression' in order to generate a 'BasicValueEnum'. This type represents a virtual register which will contain the expression's result at runtime. When using Inkwell, most inserted instructions yield such a value. The lines 875–879 show how a constant integer expression is compiled. Here, a constant int value of the 'i64' type is created and transformed into a 'BasicValueEnum' which is then returned. For more complex expressions, the 'expression' method invokes other methods which are specialized on certain types of expressions. For instance, if an infix-

expression like '3 \* n' is compiled, this method calls the 'infix\_expr' method in line 910. Here, the current AST node is passed to the specialized function as a call argument.

```
crates/rush-compiler-llvm/src/compiler.rs
     fn infix_helper(/* ... */) -> BasicValueEnum<'ctx> {
960
967
          match lhs_type {
              Type::Int(0) \mid Type::Bool(0) \mid Type::Char(0) => {
999
                  let lhs = lhs.into_int_value();
1000
1001
                  let rhs = rhs.into_int_value();
1002
                  match op {
                       InfixOp::Mul => self.builder.build_int_mul(lhs, rhs, "i_prod"),
1021
1022
                       InfixOp::Div => self.builder.build_int_signed_div(lhs, rhs,
          "i_prod"),
                       InfixOp::Rem => self.builder.build_int_signed_rem(lhs, rhs,
1023
           i_rem"),
                       InfixOp::Pow => self.__rush_internal_pow(lhs, rhs),
1024
1050
1059
     }
1060
```

**Listing 4.23** – Compilation of integer infix-expressions using LLVM.

The code in Listing 4.23 shows a part of the 'infix\_helper' method which is responsible for compiling parts of infix-expressions. Line 1021 contains the code for inserting the 'mul' instruction. Here, the variables 'lhs' and 'rhs' are used as arguments for the 'build\_int\_mul' method call. They, too, represent virtual registers which will contain the values of the leftand right-hand side at runtime. Furthermore, the string containing 'i\_prod' specifies the name of the virtual register that will contain the resulting product. Compiling basic integer multiplication has proven to be really simple since only one instruction needs to be inserted. This simplicity applies to most infix operations performed on integers. However, compiling mathematical powers has proven to be more demanding since LLVM does not provide an instruction for performing these operations. Line 1024 is executed if the method needs to compile such an integer power operation. In order to mitigate this issue, the '\_\_rush\_internal\_pow' method is called instead of a method provided by Inkwell. This method first declares the 'core::pow' function which implements an algorithm for calculating powers given an integer base and exponent. It was manually written in IR by using the appropriate calls to the Inkwell builder. This way, even complex calculations can be implemented, even though LLVM does not provide an existing solution. Although it might seem odd to implement IR functions manually, this approach is used to maintain target independence.

In line 6 of the source program, a let-statement is used to define the mutable variable 'm' with the initial value '3'. For demonstration purposes, this variable is marked as mutable so that the compiler uses stack memory for it. In order to understand how the compiler translates let-statements, Listing 4.24 should be considered.

The code displays parts of the 'let\_stmt' method of this compiler. In line 610, the initializing expression of the statement is compiled. The 'rhs' variable then specifies the virtual register which contains the result of this expression. The code in the lines 614–624 is only executed if the variable was declared as mutable. Otherwise, the variable would be constant and therefore require no space in memory. Therefore, in order to present relevant code in this example, the 'm' variable in the source program had to be declared as mutable. In line 616, the 'alloc\_ptr' method is used in order to create a new Inkwell pointer value that is declared

```
crates/rush-compiler-llvm/src/compiler.rs
    fn let_stmt(&mut self, node: &'src AnalyzedLetStmt) {
609
610
         let rhs = self.expression(&node.expr);
611
         // if the variable is mutable, a pointer allocation is required
612
         match node.mutable {
613
             true => {
614
                 // allocate a pointer for the value
615
                 let ptr = self.alloc_ptr(node.name, rhs.get_type());
616
617
                 // store the rhs value in the pointer
618
                 self.builder.build_store(ptr, rhs);
619
620
                 // insert the pointer into the current scope (for later reference)
621
                 let var = Variable::new_mut(ptr, node.expr.result_type());
622
                 self.scope_mut().insert(node.name, var);
623
             }
624
             // ...
         };
629
    }
630
```

**Listing 4.24** – Compilation of let-statements using LLVM.

in the 'entry' block. The first argument of the call specifies that the name of the pointer should be identical to the name of the variable. The second argument passes the type of the initializer expression to the method. In line 619, the expression's result is then stored in the newly created pointer. Since pointers present a way to use stack memory, also non-pointer variables in the source program are internally compiled to pointers. Finally, in line 622, the newly defined variable is inserted into the current scope of the compiler. Every variable inside the scope saves its Inkwell value and its type since these properties are required when the variable is used later.

#### 4.4.4. Final Code Generation: The Linker

After LLVM has compiled a program, it outputs an object file representing the compiled source program. Object files contain the binary machine code output of a compiler or an assembler. In the case of LLVM, they contain the target-specific machine code generated from the intermediate representation. There are many different formats for representing object files, such as ELF on Unix-like systems. However, object files are usually still  $relocatable^{11}$  and not directly executable. In order to create an executable program from object files, a linker is used.

A linker or link editor is a program which takes one or more object files in order to combine them into a single file. Often, the output of the linker is a file which can be executed by the operating system. For instance, a linker might take an object file generated by a compiler in order to create the final executable program. During linking, a linker often perform numerous tasks, such as relocation or symbol resolution. For instance, a linker might also include library code in the executable if the object file depends on external functionality provided by that library. A common example for this library code is the functionality provided by a C standard library. In order to combine these modules, an essential part of the liker's actions is presented by relocation and code modification [Lev00, pp. 1-15]. During relocation, the linker assigns definitive addresses to numerous parts of the program. Relocation is required, for instance, when the program to be translated consists of multiple modules referencing each other. Here, the order in which the individual parts of the program will be placed in memory is not known. Therefore, any absolute addresses in the program are not determined [Zhi17,

 $<sup>^{11}\</sup>mathrm{Load}$  addresses of position-dependent code may still be changed.

p. 74]. However, we will not explain these concepts further since they are not of particular relevance for understanding the purpose of a linker.

The shell command in Listing 4.25 presents an example liker invocation. In this example, the LLVM compiler has generated an object file named input.o. The flag -dynamic-linker is specified in order to tell the linker which dynamic linker to use. Next, some library files in the directory '/usr/lib/' are included by using the '-lc' flag. These files belong to an implementation of the C standard library and are required so that the 'exit' function works properly. Next, the 'input.o' file is specified so that the linker includes it. Lastly, the '-o' flag is used to specify the output file. This



Figure 4.3 – The linking process [Lev00, p. 7].

way, the shell command would generate an executable file named 'output' from an object file named 'input.o'. Therefore, a linker often presents the final step of translating a source program into an executable file which the computer can understand. However, the linker is completely independent of the previous stages of compilation and is therefore not displayed in the figures. Even though the linker program LD is used in this example, the choice of the linker is completely irrelevant as long as the linker supports the required input and output formats.

```
1    ld -dynamic-linker /lib/ld-linux-x86-64.so.2 \
2      -lc /usr/lib/crt1.o \
3      -lc /usr/lib/crti.o \
4      -lc /usr/lib/crtn.o \
5      input.o \
6      -o output
```

**Listing 4.25** – Invoking LD to link an LLVM output.

#### 4.4.5. Conclusions

To conclude, implementing a compiler which leverages LLVM presents a lot of advantages, like support for many target architectures and aggressive optimizations. Most of the demanding work, like optimizations, is being handled by LLVM, therefore implementing the compiler will prove to be less difficult and error-prone. LLVM presents a robust, production-ready, and scalable backend which is even used by very popular compilers. Finally, in order to give an example for how LLVM's optimizations can positively impact performance at runtime, the Fibonacci benchmark should be considered again. In this benchmark, the 42<sup>nd</sup> Fibonacci number is calculated using the algorithm displayed in Listing 1.1 on page 3. Running the program compiled using the rush LLVM compiler took around 1.3 seconds. However, executing the binary generated using the rush x86\_64 compiler took around 2.17 seconds<sup>12</sup>. This means that the program compiled using LLVM ran roughly 1.66 times faster.

<sup>&</sup>lt;sup>12</sup>Average from 100 iterations. OS: Arch Linux, CPU: Ryzen 5 1500, RAM: 16 GB.

# 4.5. Transpilers

Another, possible target for compilers is another high-level programming language. A compiler translating one language to another language is usually called a *transpiler*. Historically, the very first version of the C++ compiler was a transpiler which translated C++ into C. The main advantage of a transpiler is that its implementation is usually very simple, as low-level compilation steps can be omitted. A significant downside of a transpiler is that it often generates very inefficient code. Furthermore, the entire project would also depend on the target language [Jef21, p. 5].

For this paper, we have implemented a transpiler which generates C from rush source programs. Listing 4.26 shows a rush program which terminates with the code '42'. For this, the value of the block-expression in the lines 2–5 is saved in the variable 'num'. In the block, a temporary variable with a value of '40' is created. The block results in a value of '42' since '2' is added to this variable. In line 6, the 'exit' function is called, using 'num' as its argument. Listing 4.27 shows the identical program represented using C. The code was generated from the rush program in Listing 4.26 using the rush C transpiler<sup>13</sup>.

```
fn main() {
    let num = {
        let b = 40;
        b + 2
    };
    exit(num);
}
```

**Listing 4.26** – A rush program containing a block-expression.

```
#include <stdlib.h>
2
    #include <stdbool.h>
3
    int main();
4
5
6
    int main() {
7
        long long int b0 = 40;
8
        long long int num1 = b0 + 2;
9
        exit(num1):
    }
10
```

**Listing 4.27** – C transpiler output of rush program in Listing 4.26.

When examining the C output, it becomes apparent that the structure of the code has changed slightly. In the lines 1 and 2, the file contains '#include' directives in order to import certain functions and types from the current C standard library. In line 4, the 'main' function is declared, but not yet defined. Although this might not seem very reasonable in this example, C requires functions to be declared before their use, even if the definition follows later. Furthermore, the C program also does not include the aforementioned block because C does not have a structure comparable to rush's block-expressions. The lines 2–5 in the rush code are represented by the lines 7 and 8 in the C code. This examples demonstrates how a transpiler often alters the program's structure in order to represent the source program in another language with different semantic rules. Since transpilers are often not the best way of implementing a compiler, the internals of this transpiler will not be discussed further. It is to be mentioned that some successful and popular transpilers exist. For instance, the popular programming language Typescript uses a transpiler to generate Javascript [Che19, Chapter 2]. To conclude, transpilers can present an easy but often impractical way of implementing a compiler. Due to its disadvantages, a transpiler is often only considered during the prototype phase of compiler implementation [Jef21, p. 5].

<sup>&</sup>lt;sup>13</sup>Generated using rush (Git commit '93ee068').

# 5. Compiling to Low-Level Targets

In the previous chapter, compilation to high-level targets has been presented. The previously shown compilers generated outputs which were still platform independent and portable. However, compilers can also target a specific physical computer architecture directly, thus removing another layer of abstraction. Implementing compilers targeting the architectures presented in this chapter has proven to be a lot more demanding. Reasons for that mostly include target-specific constraints which were still irrelevant in the previous chapter. The term "low-level" is used to describe such target-specific outputs here.

# 5.1. Low-Level Programming Concepts

Programming using high-level languages does not require knowledge about the target architecture of the program. In this chapter, two compilers targeting low-level assembly are presented. In order to make the sections in which these compilers are explained more approachable, some of the most important low-level programming concepts are explained in this section.

#### 5.1.1. Sections of an ELF File

Since a program needs to be representable in bytes, special formats for organization are required. *ELF* stands for "executable and linkable format" and is often found on unix-like systems including Linux. Programs using the ELF format can be represented by three different types of files. Firstly, object files generated by a compiler, like in the previous LLVM section, might use the ELF format. Secondly, dynamic libraries using *shared object files* might also leverage the ELF format. Thirdly, executable programs use the format in order to represent a structured container for instructions, data, and additional information. This way, the unit is mostly self-contained and can be executed by the operating system easily. Therefore, ELF describes the format of a class of files and not just of an individual type of file [Zhi17, pp. 74–76].

Even though a processor only has access to one physical memory unit for both instructions and program data, most assembly programs like to separate these types of memory into their separate components. Therefore, an object file and assembly program is divided into so-called sections [Zhi17, p. 19]. Some of the important ELF sections are listed below [Zhi17, p. 76]:

- '.text' stores the logic of the program represented using CPU instructions.
- '.rodata' stores read-only global data, it is often used for global constants.
- '.data' stores mutable global data, such as mutable global variables.

### 5.1.2. Assemblers and Assembly Language

Assembly language describes a type of low-level programming language which are directly influenced by the target architecture. Since the assembly code provides a slight abstraction over the computer's hardware, the assembly code must be translated to machine code before it can be executed. This process is performed by a program named the assembler and is called assembly process.

For instance, the RISC-V assembly instruction 'add a0, a0, 2' would be used for integer addition. Often, the name of the instructions is a mnemonic roughly describing its functionality. In the case of RISC-V, 'addi' stands for "add immediate". Lastly, the instruction for adding integers differs for most CPU architectures. For instance, an equivalent instruction for the x86\_64 architecture is 'add %rdi, 2'.

As described earlier, the application code in form of instructions is placed in the '.text' section of an ELF binary. In most assembly dialects, the programmer is able to partition the code manually using sections. If the assembly code is assembled to an ELF object executable, the '.text' section should contain all the instructions. Just like parts of the LLVM IR, the '.text' section of an assembly program can be partitioned into blocks. This section contains many labels which mark the beginning of a new block of instructions. In assembly, labels rather serve as targets for jump instructions. However, these blocks do not come with



Figure 5.1 – Level of abstraction provided by assembly [Dan05b, p. 5-6].

all the constraints which are to be followed when using LLVM IR. For instance, a block in assembly might even contain no instructions, does not have to be terminated and could even be terminated twice. Here, terminating instructions refer to jump- and return-instructions. Therefore, the constraints in assembly are much weaker than the ones introduced by LLVM. Because an assembly usually omits complex optimizations, strict rules constraining the assembly code can be omitted too.

In order to understand how much abstraction is provided by assembly, Figure 5.1 should be considered. Here, the highest level of abstraction is provided by high-level programming languages like Rust, C and rush. Compared to the aforementioned languages, rush presents roughly the same level of abstraction. The next lower level of abstraction is represented by assembly languages. As of today, one only rarely encounters a programmer writing programs using this level of abstraction or less. Here, the program is no longer independent of a target architecture and consists of significantly more code. However, the next lower level of abstraction below assembly is machine language. Since the machine language program is represented in binary, it is very time-consuming for a human to write or understand. However, the machine language is also just an abstraction of the computer's hardware and operating system. Assembly provides enough abstraction to be comprehensible for a human while being a low-level representation of the program. Some benefits of using assembly languages are increased runtime efficiency and decreased code size, while maintaining finegrained control over the hardware and operating system. However, achieving these results is often very demanding and time-consuming. Due to the aforementioned reasons, it is often reasonable for a compiler to generate assembly code from the source program. This way, the program is translated into a low-level, target-specific representation which allows the program to be executed on the target machine directly. However, an assembler is still required in order to translate the assembly output of the compiler. Since most assemblers output object files, a linker is required to create the final executable program. Therefore, a compiler targeting the assembly of a specific architecture depends on these two additional steps before the program can be executed.

One might argue that the compiler could output object files directly. However, doing so rarely creates any significant benefits other than the omitted dependence on the assembler. Furthermore, implementing a compiler using this approach significantly increases the complexity of the compiler since it now has to perform the role of the assembler as well.

### 5.1.3. Registers

Most processors contain numerous registers in order to hold data, instructions, and state information. Unlike external memory, they are very limited in both size and count, and must therefore be used carefully. Since they are a physical part of the CPU, they are much faster to access than the external memory. Due to these traits, registers are very suitable for storing temporary values in larger computation [Wat17, pp. 212–214]. Figure 5.2 shows a simplified layout containing a CPU, registers, and memory. Here, the CPU and its registers are connected via a thick arrow, indicating a high throughput



**Figure 5.2** – Relationship between registers, memory, and the CPU [Dan05b, pp. 20–21].

connection. However, the CPU and the memory are connected via a thin arrow which indicates a slower connection. Nearly every CPU architecture will define an individual register layout, meaning that sizes, count, and types of its registers may vary. As a general rule, architectures with large number of registers usually outperform the ones that provide less registers. However, as a result of this, the CPU will be more expensive to manufacture. Since not every register is identical, there are platforms which distinguish between general-purpose, floating-point, and special-purpose registers. For the majority of architectures, a subset of the registers is additionally designated for use in specific operations like *procedure calls* [Dan05b, Chapter 2].

```
4 li a0, 40
5 li a1, 2
6 add a0, a0, a1
```

**Listing 5.1** – Example assembly program for explaining register allocation.

For optimizing compilers, the limited number of registers presents a challenge as performance of the output program shall be maximized. Some architectures may require that the operands of arithmetic or logical instructions have been explicitly loaded into registers beforehand. In this case, registers would be required in nearly all computations. In order to understand how register management can present a challenge, Listing 5.1 should be considered. It displays RISC-V assembly instructions which calculate

the sum of the two integers '40' and '2'. In line 4, the integer value '40' is placed in the register 'a0'. Next, in line 5, the integer '2' is placed in 'a1'. In line 6, the 'add' instruction is used to calculate the sum of these two integers. Since the first operand of the instruction specifies the register in which the result should be placed, the original value of '40' in the register 'a0' would be overwritten by the instruction [PW17, reference]. Through this example, it becomes apparent that instructions might use registers in a way that would interfere with other operations. Therefore, subtle bugs can emerge as soon as an instruction unintentionally overwrites a register.

In compilers, the process of *register allocation* is responsible for managing how registers are used. This process attempts to use registers in a way that leads to maximized efficiency of the output program. Since accessing registers is faster than accessing memory, a register

allocator will try to use registers as often as possible. Real compilers regularly follow this rule to the point at which normal variables are also kept in registers. It is apparent that in most programs, the number of variables will certainly exceed the capacity provided by registers. Therefore, register allocation has to detect when no free registers are available anymore. In this case, the compiler has to save data in memory instead of registers. This process is called register spilling. Since register spilling introduces a performance penalty, register allocators often attempt to use it as infrequently as possible. Therefore, sophisticated algorithms are mandatory if execution performance is regarded as non-trivial. Moreover, register allocation has to ensure that no conflicts between registers are introduced. Such a conflict may be that a register is accidentally overwritten by an instruction in a completely unrelated block. It is apparent that implementation of an algorithm performing all these tasks is usually very demanding. For instance, register allocation often requires complex graph algorithms [Wat17, pp.212-214]. Since register allocation represents a complex topic, it will not be explained any further. However, there are also heavily simplified approaches to register allocation, such as the ones used by the compilers presented in later sections of this chapter.

### 5.1.4. Using Memory: The Stack

As registers are limited in both size and count, additional storage, for example for variables is required. For this, most architectures provide a *stack*. In general, a stack is a last-in-first-out data structure, meaning the element that was last added via a *push* operation is the first to be removed by a *pop* operation. In computer hardware, the stack is usually physically separate from the CPU and therefore much slower to access. It is typically implemented as linear memory that can be accessed and modified freely. A so-called *stack pointer*, which is normally stored in a special register, saves the address of the top-most element of the stack. Thus, the principle of this memory being a stack data structure is merely a convention. A stack typically grows towards lower addresses, so that in order to push values, the stack pointer must be decreased [PH17, pp. 68, 99, 100].

It is typical for compiled procedures to allocate an entire region of the stack for themselves in a procedure's prologue by subtracting the amount of needed bytes from the stack pointer, and freeing it in batch in a procedure's epilogue. These regions are called stack frames, and some architectures, including both RISC-V and x64, define an additional register for pointing to the other end of the current stack frame. This pointer is usually called the frame pointer or base pointer [Wal98, p. 94].

#### Alignment

In order to save space, one might want to save variables of different sizes, say a boolean and an integer, directly next to each other on the stack. However, most architectures require values to be *aligned* to a multiple of their size. Figure 5.3 shows three example memory layouts of one 'u8', one 'u16', and one 'i64'<sup>1</sup>, taking up one, two, and eight bytes respectively. The first layout has two of the integers misaligned, marked in dark gray, thus being invalid. The second layout preserves the same order but inserts empty white areas, called *padding*, in two places, to correct the alignment. Firstly, there is one byte just before the 'u16' to align it to an integer multiple of two bytes, and secondly, there are four bytes before the 'i64' to have that be aligned to eight bytes. The third layout inverts the order, which causes all three values to be correctly aligned as is. **TODO: cite?** 

<sup>&</sup>lt;sup>1</sup>These types are defined by Rust, representing unsigned ('u') and signed ('i') integers with 8, 16, and 64 bits of storage respectively.



Figure 5.3 – Examples of memory alignment.

Listing 5.2 shows the implementation of the rush x64 compiler for aligning a pointer to a given size. The pointer must only be modified if it is not an integer multiple of the desired byte count yet, hence the outer if-expression. To calculate the amount of padding, the formula 'size - (ptr % size)' is used. For instance, say 'ptr' is '22' and 'size' is '8'. The stack memory is then thought of in chunks of eight bytes each, where the resulting pointer should point to the start of the next empty chunk. With the pointer currently being '22', there are two full chunks and one chunk with six of the eight bytes occupied. Hence, two bytes of padding are needed to reach the next chunk. The expression '22 % 8' yields '6', which is the number of bytes that are already in use in the last chunk. This result is then subtracted from '8' again to result in '2', the number of bytes required for padding that is then added to the pointer.

```
pub(crate) fn align(ptr: &mut i64, size: Size) {
    if *ptr % size.byte_count() != 0 {
        *ptr += size.byte_count() - *ptr % size.byte_count();
    }
}
```

**Listing 5.2** – Alignment of values on the stack.

# 5.1.5. Calling Conventions

Programmers often use procedures or functions in order to structure their programs, both to make the program easier to understand and to allow shared logic to be reused. Procedures allow the programmer to focus on one portion of the tasks at the time, thus representing one way of how a high-level programming language might provide abstraction. Often, parameters and returned values act as the interface between the procedure and the remaining code. During a procedure call in a high-level language like rush, many individual steps need to be performed in the program's low-level representation. Since assembly does not support the use of high-level functions, most architectures specify their own calling convention which describes how low-level function calls are to be performed. On most architectures, a low-level procedure call follows these steps [PH17, p. 98]:

- 1. The caller places the call arguments in a place where the procedure can access them.
- 2. Control is transferred to the procedure, often using a jump or specialized call instruction. This specialized instruction often saves the *return address*<sup>2</sup> in a special register, so that function returns are possible.
- 3. First, the procedure acquires local storage resources, like stack memory for its local variables. This step is usually referred to as the *prologue*.
- 4. Internal code of the procedure is executed by evaluating the instructions in its body.

<sup>&</sup>lt;sup>2</sup>Saves the address of the call-site, allows the called procedure to return back to the caller [PH17, p. 99].

- 5. The procedure's result value is placed somewhere so that the caller can access it. Here resources allocated in step 3 are also released again. This step is usually referred to as the *epilogue*.
- 6. Control is transferred back to the caller using a specialized return-instruction that jumps back to the instruction which had initiated the procedure call.

In order to leverage optimal performance during procedure calls, passed arguments are usually kept in registers. However, in the previous subsection regarding registers, it was stated that the majority of architectures only plan for a portion of their registers to be used as call arguments. If the number of call arguments exceeds the number of available argument registers, every remaining argument would have to be spilled to memory, meaning that they are placed on the stack so that the called function can access them. [PH17, p. 98].

For instance, a target architecture might plan for *four* registers ('r0'-'r3') to be used as call arguments. Now, a function is called using *six* arguments. Due to the fact that there are two more arguments than registers, register spilling is now required. In that case, the registers 'r0'-'r3' would contain the first four argument values while the fifth and sixth argument are spilled to the stack.

It is to be mentioned that implementing a compiler which respects the calling convention of its target architecture is not required. However, following the convention is often reasonable in order to preserve  $ABI^3$  compatibility of the compiled program.

### 5.1.6. Referencing Variables Using Pointers

A pointer is a variable which contains the memory address of another variable. Leveraging pointers often leads to more efficient and compact code. Furthermore, pointers are sometimes mandatory for solving a specific problem [KR88, p. 93]. A problem which can only be solved by using pointers is displayed in the rush program in Listing 5.3. This program contains the 'main' and 'modify' functions. In line 2, in the 'main' function, the mutable variable 'answer' is defined with an initial value of '42'. Next, the 'modify' function is called, passing the variable as the call argument. In line 4, the program exits with the value of 'answer'. In the signature of the 'modify' function, the 'n' parameter is declared as a mutable integer. Next, in line 8 of this function, the value of the parameter is

```
fn main() {
    let mut answer = 42;
    modify(answer);
    exit(answer);
    4
}

fn modify(mut n: int) {
    n += 1;
    8
}
```

**Listing 5.3** – A rush program trying to alter the variable behind an argument.

incremented by '1'. One might expect that the function call in line 3 causes the variable 'answer' to be incremented by '1'. This seems likely since both the parameter and the variable are declared as mutable using the 'mut' keyword. However, since rush passes function arguments by value and not by reference, the called function has no direct way of altering the variable behind the passed parameter<sup>4</sup>. In order to solve this problem, pointers are required. The code in Listing 5.4 displays a rush program solving this issue.

Most parts of this rush program look similar to the one displayed in Listing 5.3. However, some parts of the code have been adjusted so that they use a pointer. First, the signature of the 'modify' function looks slightly different. Now, the function takes a parameter of type '\*int' instead

```
fn main() {
    let mut answer = 42;
    modify(&answer);
    exit(answer);
}
```

63

3

5

7

8

<sup>&</sup>lt;sup>3</sup>Short for "application binary interface", allows calling foreign functions from another irregram.

<sup>&</sup>lt;sup>4</sup>Passing by value means that the value of the argument is copied for the function call.

of 'int'. The syntax '\*type' describes a pointer to the type specified after the '\*'. Thus, the function now takes a parameter which is a pointer to an integer variable. In rush, pointers allow full read-write access to the target of the variable. This time, the parameter itself is not declared as mutable since the target variable behind the pointer and not the pointer itself should be incremented. In line 8, the variable stored in the pointer is incremented. When assigning to the target variable behind a pointer, the pointer

first needs to be *dereferenced*. In rush, dereferencing a pointer is accomplished using the '\*' operator before the pointer's identifier. The process of dereferencing a pointer involves accessing the value of the variable the pointer points to [KR88, p. 94].

In rush, only pointers targeting an existing variable can be created. In order to create a pointer to a variable, rush's '&' prefix operator is used. This operator references the target variable in order to return a pointer value. Referencing a variable produces the memory address of that variable [KR88, p. 95]. Since the resulting value is only a normal integer, it could also be treated similarly.

However, some target architectures of a compiler might make the implementation of pointers more difficult. The statement in line 8 increments the value of the variable pointed to by 'n' by 1. This time, 'answer' does change since only the address, which still points to the same mutable variable, was copied during the function call. Due to this write-access, the analyzer only allows the referencing ('&') operator to be used on *mutable* variables. As seen in the updated program, there is no need for the resulting pointer to be mutable since it only stores a read-only address.

A special trait of pointers in rush is that they are able to introduce undefined behavior<sup>5</sup> into a program. However, this only occurs if the pointer is used as a return value of a function in which it references a local variable. This problem occurs because the memory used by that function is freed after the function has executed, resulting in the pointer referencing uninitialized memory. Therefore, pointers should be used with caution since the semantic analyzer cannot guarantee that they are used correctly<sup>6</sup>.

<sup>&</sup>lt;sup>5</sup>The runtime behavior of such scenarios is undefined and might vary per architecture

<sup>&</sup>lt;sup>6</sup>If additional constraints for pointers are introduced, the analyzer could also validate their usage.

# 5.2. RISC-V: Compiling to a Modern RISC Architecture

The RISC-V ISA<sup>7</sup> is a new and modern reduced instruction set architecture focusing on simplicity and expandability. The initial version was developed at UC Berkely in the context of another related research project. Since its introduction in 2011, the architecture has been rapidly growing in popularity while being managed and led by the RISC-V foundation. Today, corporate members of the RISC-V foundation include companies like Google, Microsoft, Samsung, and IBM. Therefore, the general popularity and commercial attraction of the technology is apparent. However, unlike most previous ISAs, the RISC-V architecture is a completely open source project and is therefore not controlled by a single large corporate entity. This can be regarded as a large competitive advantage over other popular RISC architectures like ARM. In the past, many ISAs have failed due to them being too restrictive with their licensing, thus preventing widespread commercial adoption. However, RISC-V is completely open and free to use, so that many companies can leverage the technology commercially while contributing to the project. Unlike most of the previous ISAs, which were developed during the 1970s and 80s, RISC-V is one of the few which were developed in the last decade. Therefore, it seems like RISC-V could be a significant architecture to be used in all sorts of devices in the near future [PW17, preface].

### 5.2.1. Register Layout

Compared to popular CISC architectures, RISC architectures typically have a large count of registers [Dan05b, Chapter 2]. For instance, while the 32-bit variant of the x86 architecture has 8 registers, the popular RISC architecture ARM-32 provides twice that amount, meaning 16 registers. However, a RISC-V CPU includes 32 registers, which is drastically more than the aforementioned architectures. Moreover, these 32 registers only include the general purpose ones. Just for floating-point numbers, the ISA provides another 32 registers. Due to this difference, a register allocator targeting the RISC-V architecture could be more aggressive compared to one targeting a CISC architecture [PW17, p. 10].

The Table 5.1 shows most of the registers provided by the RISC-V architecture. For this table, the official ABI names of the registers have been used. The special 'zero' register always holds a constant '0' Unlike other registers, it is read-only, meaning that it can never be overwritten. The 'ra' register saves the return address of a procedure. If a return instruction is used, the value in 'ra' is used as the address to jump back to. The 'sp' and 'fp' registers are used for managing stack memory. The registers 'a0' and 'a1'

**Table 5.1** – Common registers of the RISC-V architecture [WA19, p. 155].

| Register(s) | Purpose               |
|-------------|-----------------------|
| zero        | hardwired zero        |
| ra          | return address        |
| sp          | stack pointer         |
| t0-t6       | temporary storage     |
| fp          | frame Pointer         |
| a0, a1      | function arguments,   |
|             | return values         |
| a2-a7       | function arguments    |
| s1-s11      | saved register        |
| fa0, fa1    | float arguments, re-  |
|             | turn values           |
| fa2-fa7     | float arguments       |
| fs0-fs11    | float saved registers |
| ft0-ft11    | float temporaries     |

both serve as call arguments and return values of functions. The remaining 'an' registers ('a2'-'a7') should be primarily used as function call arguments. How functions are called using registers will be explained in Section 5.2.3. The next row contains the *saved* registers 's1'-'s11'. These registers are meant to be preserved across function calls, meaning a called function must not overwrite them. What the previously explained registers have in common

<sup>&</sup>lt;sup>7</sup>Short for: "instruction set architecture", it specifies an abstract model of a CPU.

is that they all hold integer values. Depending on the used RISC-V extensions, all registers, including floating-point registers, either hold 32 or 64 bits of information. The float registers are able to hold floating-point numbers according to the *IEEE 754–2008* standard [WA19, Chapter 11]. Just like their integer counterparts, the floating-point registers 'fa0' and 'fa1' are used as function call arguments and as return values. However, the other 'fan' registers ('fa2'-'fa7') are meant to be used as float function arguments. Just like the 'sn' registers, the 'fs0'-'fs11' registers are usually preserved across function calls. Lastly, the 'ft0'-'ft11' registers can be used as temporary registers for floating-point numbers. It is apparent that the floating-point registers are provisioned very similarly to the integer registers. Therefore, a compiler targeting the architecture can utilize roughly the same principles, regardless of the data-type [PW17, pp. 18f, p. 34], [WA19, p. 155].

Now, it has become apparent that RISC-V includes many registers grouped into semantic categories. Every category is meant to be used in the specified manner, however, these groups are mostly only a suggestion of how each register should be used.

### 5.2.2. Memory Access Through the Stack



Figure 5.4 – Stack layout of the RISC-V architecture TODO: cite.

Figure 5.4 shows the stack layout used by RISC-V. Stack frames are enclosed by the 'fp' and 'sp' register. Here, 'sp' always points to the lowest address in- $^{8n}$  (fg)d of the current stack frame while 'fp' points to 16 by tes<sup>8</sup> above the top of the current frame. The available space for functions therefore ranges from '-24(fp)' (sp)'. It is used for storing variables, following the -8(fpalignment rules that were explained in Section 5.1.4. he space between ' $\theta(fp)$ ' and '-24(fp)', the frame pointer of the previous stack frame and the return ad--24(fp) and '-16(fp)', and '-16(fp)', respectively. If stack arguments were present, they o(sp)wolld be placed upwards of 'fp'. Each stack argument always uses eight bytes of memory, regardless of the actual size of its content. Thus, the  $n^{\text{th}}$  stack argument would be placed at the position 8n(fp). To keep track of variable locations on the stack, the compiler maps a variable's name to a memory location. To be precise, the compiler contains a 'HashMap' which associates a variable's name with its 'fp' offset. When the

variable is referenced at a later point, the compiler performs a simple lookup of the variable's memory address.

# 5.2.3. Calling Convention

Just like previously explained, most architectures provide a calling convention which dictates how low-level function calls should be managed. For most architectures, the calling convention is part of the ISA's official specification. In the case of RISC-V, the calling convention consists of four main steps and is specified in a separate document [CCW22].

The first step of calling a function involves placing the arguments in a place where the function can access them. For RISC-V, this involves placing the arguments into certain registers. Like described in Table 5.1, only some registers are used for call arguments. For integer arguments, the first eight arguments are placed in the registers 'a0'-'a7' For instance,

<sup>&</sup>lt;sup>8</sup>Assuming the **TODO: RV64I** extension is used.

the first two arguments of the rush function call 'foo(40, 2, 3.14)' would be placed in the registers 'a0' and 'a1', respectively. However, the third argument is a floating-point number and can therefore not be placed inside an integer register. Instead, the first floating-point argument register 'fa0' is used. In this case, all arguments can be held in registers and spilling would not be required.

In case the function accepted nine or more integer arguments, all further integer arguments starting with the ninth one would have to be spilled onto the stack. Here, the successive registers 'a0'-'a7' would contain the first eight integer arguments of the called function. However, the argument at position nine is spilled onto the stack since there are no registers left which could contain the additional argument [CCW22, p. 8].

In the second step, the underlying procedure call is made using a specialized instruction. In RISC-V assembly, one typically uses the 'call' pseudoinstruction<sup>9</sup>. The instruction uses the name of its target label as its operand, resulting in labels being callable, just like functions. Internally, the instruction first saves the address of the following instruction in the 'ra' register, and then performs an unconditional jump [PW17, p. 22].

The third step takes place in the function's prologue. Here, the previous frame pointer and return address are saved on the stack, and space for storing local variables is allocated by decrementing the stack pointer. Following the prologue, the function's body is executed. After it has also been evaluated, the epilogue restores the changes made by the prologue. The previous values for both the frame pointer and the return address registers are taken from the stack, and the stack pointer is incremented back to its original value. The 'ret' instruction, which jumps to the address stored in 'ra', is finally used to return control back to the caller.

In case a function returns a value, it must be mad accessible to the caller. For integer types, the first return value of a function is placed in the register 'a0' while floating-point numbers are placed in the register 'fa0'. This way, the caller code can obtain a function's return value by accessing the 'a0' and 'fa0' registers respectively. If a function does not return a value, these steps are just omitted. It is to be mentioned that char and bool values are also placed inside the 'a0' register since these types are internally represented by integers.

## 5.2.4. The Core Library

As outlined in the section about the linker on page 55, a program might use functionality provided by external libraries. In case of the rush RISC-V compiler, external functions are used for character-arithmetic, the mathematical power operator, and the system exit call. Character-arithmetic requires external functionality since the bounds of the ASCII range should be regarded while performing operations. For instance, in rush, calculating ''a'+'a'' results in '66' because '194' (97 + 97) is larger than the maximum allowed value '127'. If the result is outside the valid range [0; 127], only the least significant seven bits are considered. This way, the ASCII range is always valid.

Since these concepts must introduce additional logic, the compiler should not be emitted their instructions every time they are used. In that case, the repeated emission of redundant instructions would result in enlarged and unnecessary complex output code. In order to mitigate these issues, the compiler simply inserts call-instructions referencing the appropriate external function. External functions are called just like any other function; however, their definition is not found in the same assembly file. As described previously, resolving these external calls is later handled by the linker. This target-specific libary code is later refered to by the term *corelib*.

<sup>&</sup>lt;sup>9</sup>A macro generating one or more instructions from one pseudoinstruction. This way, the actual count of ISA instructions remains low while convenience features can be used in assembly [Dan05b, p. 68].

For instance, a function for mathematical power operations is implemented in the corelib. Therefore, the compiler can emit a procedure call to this method every time rush's '\*\*' operator is used in the source program. For this project, the entire corelib is written in RISC-V assembly. However, it is also reasonable to implement a corelib using a high-level language like C. Since the corelib's functions are specified in separate files, they are packaged into an archive file which is later used by the linker. The assembly code in Listing 5.5 shows the implementation of the 'exit' subroutine in the RISC-V corelib. The task of this subroutine is to invoke a specific functionality of the Linux kernel by performing a system call.

```
s global exit

exit:

li a7, 93 # syscall type is `exit`

ecall # exit code is already in `a0`
```

**Listing 5.5** – The assembly implementation of the 'exit' subroutine.

A system call (often abbreviated to syscall) is an invocation of a function of operating system's kernel. In Linux, more than 90 percent of the available system calls are implemented on all architectures [Lov13, p. 3]. One common system call is 'exit'. It terminates the current process and performs various cleanup steps. Its only parameter is the exit code which can be checked by the shell or other programs [Lov13, p. 148]. In RISC-V Linux, the syscall id of 'exit' is '93' [Tor91]. In line 8 of Listing 5.5, the 'exit' label is declared as global using the '.global' directive. Next, the 'exit' label is declared in line 10. In line 11, the 'li' instruction is used in order to load '93' into the register 'a7'. On RISC-V Linux systems, the content of the 'a7' register specifies the type of syscall to be performed. In line 12, the 'ecall' instruction is then used to perform the underlying syscall [PW17, p. 23]

## 5.2.5. RISC-V Assembly

```
let mut m = 42;
1
2
    fn main() {
3
4
         m += 1;
5
         foo(m);
6
         return;
    }
7
8
    fn foo(n: int) {
9
10
         exit(n);
11
```

**Listing 5.6** – Example rush program containing two functions.

The Listing 5.6 shows a rush program containing two functions and a global variable. In line 1 of the rush program, the mutable global variable 'm' is defined with an initial value of '42'. In line 4, in the 'main' function, 'm' is incremented by '1'. Next, in line 5, the 'foo' function is called using 'm' as the only call argument. In line 6, a 'return' statement is used to terminate the 'main' function explicitly. The body of the 'foo' function only contains a call to the 'exit' function and therefore causes the program to exit with the code stored in 'n'. When executed, the exit code of the displayed program will be '43'. The code in Listing 5.7 shows the output assembly generated from this program by the rush RISC-V compiler<sup>10</sup>. The assembler code of the 'foo' function is intentionally omitted.

In line 1, the '.global' assembler directive is used to declare the global symbol '\_start' [PW17, p. 36]. On most architectures, the '\_start' label indicates a program's entry point, marking the first instruction to be executed [Zhi17, p. 19]. In line 6, the 'call' instruction is used to call the 'main..main' function. Here, the

 $<sup>^{10}</sup>$ Generated using rush (Git commit '93ee068').

already familiar 'main' function is prepended by the prefix 'main..'. Since this rush compiler implements name mangling<sup>11</sup>, every function defined in a rush program will contain this prefix.

In line 7, the 'li' ("load immediate") instruction is used to load the constant integer '0' into the register 'a0' [PW17, Reference Card]. As explained in the previous section, the register 'a0' is used for the first integer call argument. In line 8, the 'exit' function from rush's corelib is called, using '0' stored in 'a0' as the exit code. Therefore, the two instructions in the lines 7–8 are responsible for terminating the program with the exit code '0'. They are always inserted at the end of the '\_start' label in order to terminate the program appropriately in case the rush code does not call 'exit' on its own<sup>12</sup>.

Due to the function call in line 6, the 'main..main' label in line 10 should now be considered. In line 11, the first line of the 'main' function, a comment indicates the beginning of the function's prologue. The 'addi' ("add immediate") instruction in line 12 adds '-16' to the value stored in 'sp', thus allocating memory. Here, an addition instruction is used even though subtraction is required. In RISC-V, the 'addi' instruction's operands are one destination register, one source register and one immediate value. Since this immediate value can be negative, an additional instruction for immediate subtraction would be redundant [PW17, Reference Card]. In this case, 16 bytes are allocated since two 8-byte values are stored on the stack in the lines 13–14.

The comment in line 17 indicates the start of the function's body. First, the previously explained 'li' ("load immediate") instruction in line 18 places a constant '1' in the register 'a0'. Next, the 'ld' ("load doubleword" instruction in line 19 is used in order to load the value of the global variable 'm' into the register 'a1' [PW17, Reference Card]. Global variables, like 'm', are saved under the '.rodata' or the '.data' section, depending on their mutability. Since 'm' is declared as mutable, the compiler saves it under the

```
.global _start
                                       1
                                       2
.section .text
                                       3
                                       4
_start:
                                       5
    call main..main
                                       6
                                       7
    li a0, 0
    call exit
                                       8
                                       9
main..main:
                                       10
    # begin prologue
                                       11
    addi sp, sp, -16
                                       12
    sd fp, 8(sp)
                                       13
    sd ra, O(sp)
                                       14
    addi fp, sp, 16
                                       15
    # end prologue
                                       16
    # begin body
                                       17
    li a0, 1
                                       18
    ld a1, m
                        # m
                                       19
    add a2, a1, a0
                                       20
    sd a2, m, t6
                                       21
    ld a0, m
                                       22
    call main..foo
                                       23
                        # return
    j epiloque_0
                                       24
      end body
                                       25
                                       26
epiloque_0:
                                       27
    ld fp, 8(sp)
                                       28
    ld ra, 0(sp)
                                       29
    addi sp, sp, 16
                                       30
                                       31
.section .data
                                       53
                                       54
                                       55
    .dword 0x000000000000002a
                                       56
```

Listing 5.7 – Compiler output from the rush program in Listing 5.6.

= 42

'.rodata' section. The start of the '.rodata' section is marked by the '.section' assembler directive found in line 53. Here, a label called 'm' is defined. In this label, the '.dword' ("double word") directive is used to define the global initializer value of the variable. This directive stores 64 bits of data [PW17, p. 39]. The initializer value of the global variable is '42' and is represented as '0x2a' using hexadecimal in the assembly code. In line 56, the trailing comment shows the decimal representation of the number.

At this point, the register 'a0' contains '1', while 'a1' contains '42'. In line 20, the 'add' instruction is used to save the sum of 'a0' and 'a1' in the register 'a2'. Next, in line 21,

<sup>&</sup>lt;sup>11</sup>Compilers often mangle names in order to guarantee a unique name for every function [Lev00, pp. 119-120].

<sup>&</sup>lt;sup>12</sup>In order to prevent runtime errors, programs should always be exited properly.

<sup>&</sup>lt;sup>13</sup>Here, "doubleword" indicates that 64 bits of data shall be loaded.

the 'sd' ("store doubleword") instruction saves the value of the register 'a2' at the memory location of the global variable 'm', meaning that 'm' is updated to reflect its new value '43'. It now becomes apparent that these instructions are responsible for the add-assign expression in line 4 of the rush program. Another interesting observation is that the last operand of the 'sd' instruction specifies the temporary register 't6'. The instruction uses this register for saving temporary data during the process of saving data in 'm' [PW17, Reference Card].

The final instruction of the 'main' function's body is the 'j' ("jump") instruction in line 24. This instruction will cause the CPU to jump to the address of the specified label, here 'epilogue\_0' would be targeted [PW17, p. 17]. The rush compiler emits the 'call' and 'ret' instructions for jumps between functions, while the 'j' instruction is used for jumps between blocks of the current function. In the lines 27–31, inside the epilogue, the aforementioned task of restoring register values and returning are performed.

## **5.2.6.** Supporting Pointers

```
1 fn main() {
2    let mut a = 42;
3    let to_a = &a;
4    exit(*to_a);
5 }
```

**Listing 5.8** – Example rush program containing a pointer.

```
main..main:
10
         # ...
        li a0, 42
18
        sd a0, -24(fp)
                             # let \alpha = \alpha 0
19
        addi a0, fp, -24
                            # &a
20
                             # let to_a = a0
21
         sd a0, -32(fp)
        ld a0, -32(fp)
22
                             # to_a
        ld a0, 0(a0)
23
                             # deref
24
         call exit
```

**Listing 5.9** – RISC-V assembly output generated from Listing 5.8.

Listing 5.8 shows a rush program in which a variable is referenced to create a pointer which is then dereferenced. In line 3, the previously created variable 'a' is referenced in order to create the variable 'to\_a' which saves a pointer to 'a'. Listing 5.9 includes the relevant part of the compiler output which represents this rush program<sup>14</sup>. In line 18, the integer value '42' is placed in the register 'a0'. Next, 'a0' is saved on the stack at '-24(fp)' using the 'sd' instruction. As the comment suggests, the instruction in line 20 is used in order to reference 'a'. It does so by subtracting '24' from the value stored in the 'fp' register. This results in the absolute address of 'a' at runtime because the syntax '-24(fp)' describes the address stored in 'fp' offset by '-24' The calculation has to be performed at runtime since the value of 'fp' cannot be determined at compile time. In line 21, the 'a0' register, which contains this memory address, is also saved on the stack. Therefore, the 'a' variable is saved at '-24(fp)' and 'to\_a' is saved at '-32(fp)'.

In order to access the value of 'a', 'to\_a' is dereferenced in line 4 of the rush program. For this, the memory address stored in the variable 'to\_a' first needs to be loaded from the stack, which is accomplished by the 'ld' instruction in line 22. Next, in line 23, another 'ld' instruction is used in order to load the value of the variable at the address stored in 'a0'. Here, '0(a0)' is used instead of 'n(fp)' since the address stored in 'a0' offset by '0' shall be used. Because 'a0' contains the memory address of the 'a' variable, the instruction loads '42' into the 'a0' register.

<sup>&</sup>lt;sup>14</sup>Generated using rush (Git commit '93ee068').

#### 5.2.7. Implementation

Just like the other compilers presented in this paper, this one also traverses the annotated AST using post-order. Unlike the LLVM or Wasm compilers, this one emits RISC-V assembly files which can then be assembled by an assembler supporting RISC-V.

#### **Struct Fields**

Before any complex code samples can be considered, important struct fields of the compiler need to be explained. The Rust code in Listing 5.10 shows the 'Compiler' struct definition containing those fields.

```
crates/rush-compiler-risc-v/src/compiler.rs
   pub struct Compiler<'tree> {
11
        // ...
        pub(crate) blocks: Vec<Block<'tree>>,
15
        pub(crate) curr_block: usize,
19
        /// Data section for storing global variables.
20
        pub(crate) data_section: Vec<DataObj>,
21
        /// Read-only data section for storing constant values (like floats).
22
        pub(crate) rodata_section: Vec<DataObj>,
23
        /// Holds metadata about the current function
24
        pub(crate) curr_fn: Option<Function>,
25
26
        /// Holds metadata about the current loop(s)
        pub(crate) loops: Vec<Loop>,
27
        /// The last element is the current scope.
28
        pub(crate) scopes: Vec<HashMap<&'tree str, Variable>>,
29
        /// Holds the global variables of the program.
30
        pub(crate) globals: HashMap<&'tree str, Variable>,
31
        /// Specifies all currentlu used registers.
32
        pub(crate) used_registers: Vec<(Register, Size)>,
33
   }
34
```

**Listing 5.10** - Fields of the RISC-V 'Compiler' struct.

The 'blocks' field in line 15 holds a vector containing values of the type 'Block'. That type provides an abstraction for representing a label followed by instructions. The Listing 5.11 shows parts of the Rust code which declares the 'Instruction' enumeration.

Listing 5.11 - The 'Instruction' enum in the RISC-V compiler.

Although the listing only shows a few of the implemented instructions, this enumeration contains all instructions which the compiler might need at a later point. Depending on the type of instruction, the corresponding variant includes fields which define its operands. For instance, the 'Li' variant in line 70 represents the 'Li' instruction in assembly. Since this instruction loads the integer value specified by the second operand into the register specified by the first operand, it contains a field for a value of the type 'IntRegister' and a field for a 64-bit signed integer. Listing 5.12 shows parts of the 'IntRegister' enumeration. It specifies

all registers which RISC-V provides, assuming that the required extensions are considered. An equivalent enumeration called 'FloatRegister' is defined for the float registers of this RISC-V configuration.

**Listing 5.12** – The 'IntRegister' enumeration in the RISC-V compiler.

The 'curr\_block' field, defined in line 19 of Listing 5.10, saves the index of the block which is currently being inserted to. Next, the 'data\_section' and 'rodata\_section' fields are declared in the lines 21–22. They represent the ELF sections '.data' and '.rodata', respectively. Here, each vector might contain global variables, which are represented by the type 'DataObj'. Each 'DataObj' consists of a label, saved as a string, and the initial data. The data itself is represented by another enumeration holding either a 64-bit float, a 64-bit integer, or an 8-bit integer.

In line 25, the 'curr\_fn' field is declared. It saves a value of the type 'Function', which contains a counter for the bytes of stack space allocated by the current function. Furthermore, it contains the label of the current function's epilogue block. The former is incremented every time a variable declaration is encountered. This is required in order to allocate the correct amount of stack memory during a function's prologue.

Just like the other compilers, this one also features a 'loops' field which saves important labels of the current loop being compiled. It stores a vector of 'Loop' structs in order to support nested loops. Each 'Loop' struct saves the label of the loop's head and the label of the block which follows after the loop's body.

The 'scopes' field stores a list of nested scopes, the last element being the current one. Each scope is represented by a 'HashMap' associating variable names with important metadata. This metadata includes the variable's type and its memory offset from 'fp'. TODO: is this already improved? However, these 'HashMaps' are not used for saving global variables. Instead, the 'globals' field in line 31 is used. Just like the HashMaps in the 'scopes' field, this map also associates a variable's name to a value of the type 'Variable'. This time however, each variable contains the data label under which the global variable was declared. Lastly, the 'used\_registers' field is declared in line 33. It plays a vital role in the compiler's register allocation algorithm.

#### **Data Flow and Register Allocation**

The LLVM compiler represents runtime values by passing virtual registers, represented by Inkwell's 'BasicValueEnum', internally. Similarly, this compiler also passes registers in order to represent the data flow of the program. Unlike in LLVM, there is only a finite amount of registers available. Therefore, this compiler also manages register allocation so that programs can be represented using this limited number of registers. In order to understand the implementation, Listing 5.13 and Listing 5.13 are to be considered. The former contains a rush program which adds two integer variables together and uses the result as its exit code. The latter shows a part of the assembly output representing the logic in the 'main' function.

```
li a0, 1
                                                  18
                                                      sd a0, -24(fp)
                                                                            # let \alpha = \alpha 0
                                                  19
                                                      li a0, 2
                                                  20
   fn main() {
1
                                                  21
                                                      sd a0, -32(fp)
                                                                            # let b = a0
        let a = 1;
2
                                                      ld a0, -24(fp)
                                                                            # α
        let b = 2;
3
                                                      ld a1, -32(fp)
                                                                            # b
                                                  23
        exit(a + b);
4
                                                      add a0, a0, a1
                                                  24
5
                                                  25
                                                      call exit
```

**Listing 5.13** – A rush program calculating the sum of two integers.

**Listing 5.14** – Assembly output of the rush program in Listing 5.13.

In the lines 2–3 of the rush program, the integer variables 'a' and 'b' are declared. In line 4, the 'exit' function is invoked, using the sum of these variables as the call argument. In line 22 of the assembly output, the runtime value of the variable 'a' is loaded into the register 'a0'. Next, the same operation is performed for 'b'. However, this time, the instruction writes its result into the 'a1' register instead of the 'a0' register. This is because the previously loaded value in 'a0' should not be overwritten. Unlike the register allocation algorithm of a production-ready compiler, like LLVM, this one only aims to use registers without causing conflicts. All variables are saved on the stack in order to keep as most registers as possible free for temporary use in calculations and operations like this one. Since this algorithm only performs rudimentary register allocation, its implementation is also significantly simpler.

The core principle of the register allocator is that each time a register is required, a responsible helper method returns the next free one to be used. The helper method for managing int registers is called 'get\_int\_reg'. The Rust code in Listing 5.15 shows this method. It returns the first available register from the register pool that containing either

```
crates/rush-compiler-risc-v/src/utils.rs
     pub(crate) fn get_int_reg(&self) -> IntRegister {
175
         for reg in INT_REGISTERS {
176
              if !self
177
                  .used_registers
178
179
                  .any(|(register, _)| register == &Register::Int(*reg))
180
             {
181
182
                  return *reg;
183
184
         panic!("out of integer registers")
185
     }
186
```

**Listing 5.15** – The 'get\_int\_reg' method of the RISC-V rush compiler.

int or float registers. Each register pool contains all possible registers for either integer or floating-point numbers. In line 175, in the method's signature one can see that the it returns a value of type 'IntRegister'. The for-loop in line 176 is used to iterate over the entire 'INT\_REGISTERS' array. This array represents the pool containing integer registers. For each register in that pool, the if-expression in the lines 177–180 checks whether the current register 'reg' is contained in the 'used\_registers' vector. If this is not the case, the current register is still unused and can therefore be returned for usage. Since the 'return' statement causes the function to return immediate, the loop only iterates until a free register has been found.

Due to the fact that the register allocator only uses as few registers as required, the compiler should not run out of registers, therefore making the 'panic!' macro call in line 185



**Figure 5.5** – Simplified integer register pool of the RISC-V rush compiler.

unreachable. However, one can see that this method does not mark the returned register as used. This is because doing so is only required in some places, where overwriting a register would introduce a bug in the output program. Therefore, if this method was called repeatedly, it would always yield the same register. In order to mark a register as used, it is appended to the 'used\_registers' vector. For this, a helper method called 'use\_reg', which only performs this one operation, is implemented. When a register is no longer used, and can be overwritten again, it must be removed from the 'used\_registers' vector. For this, another simple helper method, called 'release\_reg', is implemented.

Figure 5.5 shows an abstract representation of the compiler's integer register pool. Only the first section of the pool, which contains the 'an' registers, can be seen. The gray cells specify registers that are currently in use by the compiler. The arrow points to 'a2', which is the first free register, meaning the one which follows 'a1', the last used register. Therefore, if the 'get\_int\_reg' method was called when the state of the registers is equivalent to the one displayed in Figure 5.5, it would return the register 'a2'.

```
crates/rush-compiler-risc-v/src/compiler.rs
596
         AnalyzedExpression::Bool(value) => {
597
             let dest_reg = self.get_int_reg();
598
             self.insert(Instruction::Li(dest_reg, value as i64));
599
             Some(Register::Int(dest_reg))
600
601
         AnalyzedExpression::Char(value) => {
602
662
     if node.op == PrefixOp::Ref {
663
```

**Listing 5.16** – Parts of the 'expression' method in the RISC-V rush compiler.

In order to understand how expressions are compiled, Listing 5.16, which shows parts of the 'expression' method, is to be considered. This method is responsible for compiling expressions. As the method's signature suggests, it takes an analyzed expression and returns an 'Option<Register>', which is 'None' if the expression results in the '!' or '()' types. Otherwise, the method returns the register in which the result of its computation will be contained at runtime. The code in the lines 598–602 is executed if the expression is an integer literal. First, a target register is retrieved by calling the previously shown 'get\_int\_reg' method. Now, the 'dest\_reg' variable holds the register which will contain the result of the expression. Next, an 'li' instruction is inserted which loads the integer contained in the variable 'value' into the register specified by 'dest\_reg'. Finally, this register is then returned. For more complex expressions, the specified methods are called, just like in the other compilers.

However, the question of *when* the compiler marks certain registers as used still remains. For this, the compilation of the rush expression 'n+m' is to be considered. The Rust code in Listing 5.17 shows a part of the 'infix\_expr' method of the rush compiler. It consumes an annotated tree-node representing an analyzed infix-expression. In line 853, the left-hand side expression is compiled, and its result register is saved in the variable 'lhs\_reg'. Next, in line 854, the previously discussed 'use\_reg' method is used in order to mark the register

```
fn infix_expr(&mut self, node: AnalyzedInfixExpr<'tree>) -> Option<Register> {
755
         match (node.lhs, node.rhs, node.op) {
791
             (lhs, rhs, op) => {
841
                 let lhs_type = lhs.result_type();
842
843
                 // mark LHS register as used and compile RHS
844
                 let lhs_reg = self.expression(lhs)?;
845
                 self.use_reg(lhs_reg, Size::from(lhs_type));
846
847
                 let rhs_reg = self.expression(rhs)?;
848
849
                 // set LHS register as unused
850
851
                 self.release_reg(lhs_reg);
852
                 let res = self.infix_helper(lhs_reg, rhs_reg, op, lhs_type);
853
854
                 Some(res)
855
             }
856
         }
857
    }
858
```

Listing 5.17 - Parts of the 'infix\_expr' method in the RISC-V rush compiler.

specified by 'lns\_reg' as used. The second argument to the method specifies the size of the data which the register holds, 64 bits for 'int', 8 bits for 'char' and 'bool'. This information is also saved in the 'used\_registers' vector and is accessed in case used registers need to be spilled.

In line 856, the right-hand side expression is compiled and its result register is saved in the variable 'rhs\_reg'. After this, 'lhs\_reg' is marked as unused again. Finally, in line 861, the 'infix\_helper' method is called in order to insert the instructions for the actual calculation. As discussed previously, calling the 'get\_int\_reg' or 'get\_float\_reg' methods repeatedly, without marking registers as used, results in same registers every time. In this scenario, this would create an issue since both the left-hand side and the right-hand side expressions would use the identical register. In order to mitigate this issue, the register of the left-hand side is marked as used before compilation of the right-hand side. Now, the right-hand side will respect that the register returned by the left-hand side is currently in use and will instead use the next free one.

Next, the implementation of the 'infix\_helper' helper method shall be considered. This method exists because translation of the underlying mathematical or logical operation is required during compilation of both infix-expressions and certain assign-expressions, 'a+=1' for instance. Listing 5.18 shows parts of this method. It takes the registers of the left- and right-hand side, the operator, and a type. As the registers do not specify one concrete type, it has to be provided additionally. The analyzer guarantees equal types on both sides, hence only one type parameter is needed. In the lines 874–875, an integer and a floating-point destination register are created, even though only one of them is used later. However, both are defined beforehand in order to avoid code duplication. In the lines 878–881, the code responsible for inserting an integer addition instruction can be seen. For this, the 'insert' method is used. It inserts a new instruction into the current block. Here, the 'add' instruction is used with the integer destination register specified in 'dest\_regi' and the left- and right-hand side registers as operands. Since the 'add' instruction demands it, the operand registers are converted to integer registers via the 'into' method. In the lines 987–990, the code for inserting a float subtraction instruction can be seen. Since floating-point operations require

Listing 5.18 - Parts of the 'infix\_helper' method of the RISC-V rush compiler.

self.insert(Instruction::Fsub(dest\_regf, lhs.into(), rhs.into()));

different instructions from the ones used for integers, the additional type information is passed to this method. Here, the 'fsub' instruction is used for subtraction. In this case, the variable 'dest\_regf' is used as the first operand because the resulting value is a float, and these 'into' calls convert the registers into float ones.

#### **Functions**

//

}

dest\_regf.into()

978 979

980

990

991

}

However, before the infix-expression in Listing 5.13 on page 5.13 is compiled, the compiler iteretes over the functions of the program. In this case, only the 'main' function is present. Listing 5.19 shows the 'define\_main\_fn' method of the rush RISC-V compiler. This method is only responsible for compilation of the 'main' function. All other rush functions are compiled by the method 'function\_definition'. However, these two methods only mainly differ in two ways: Firstly, the latter contains code which handles function parameters. Secondly, the special instructions for calling to the 'main' and 'exit' functions are not required.

The method takes a block, that is, a list of statements as its input. In the lines 124–126, the '\_start' label is created and exported so that the linker will later be able to refer to it. Next, in the lines 128–129, a new block with the 'main..main' label is created. As explained previously, this block represents the start of the body of the 'main' function. Next, the line 132 inserts a 'call' instruction, which is responsible for calling the 'main' function at the start of program execution, after the '\_start' label. In the lines 134–135, instructions for calling the 'exit' function using '0' as the exit code are inserted. Next, in the lines 137–139, a new epilogue label is generated. Then, the 'curr\_fn' field is updated so that it contains this newly created epilogue label. However, this label is not directly added to the 'blocks' vector since it should appear after function body.

Next, the 'main' function's body is compiled. For this, the current insertion position is first updated to the end of the previously created 'main..main' block by calling the method 'insert\_at'. Following that, a new variable scope is added for the function body. Then, the function body's content is compiled by calling the 'function\_body' method. After the body's compilation, the previously added scope is removed again.

In line 148, the 'prologue' method returns the instructions for the prologue. It is only inserted *after* the function body was compiled, even though it will execute *before* the function during runtime. This is because the prologue needs to allocate a certain amount of stack

Listing 5.19 - The 'define\_main\_fn' method of the RISC-V rush compiler.

self.blocks.push(Block::new(Rc::clone(&epilogue\_label)));

152

153154

155

}

self.epiloque()

space, which the compiler can only determine after the code for the body has been generated. Next, in line 149, the insertion position is moved back to the end of the 'main..main' label. The lines 150–151 are then responsible for prepending the generated prologue instructions to the instructions representing the function body. This is achieved by first appending the contents of the current block to the end of the prologue instructions. Then, the current block's instructions are overwritten with the result. In the lines 153–154, the code responsible for inserting the trailing epilogue code can be seen. For this, the previously created epilogue label is now appended to the 'blocks' vector, and the 'epilogue' method of the compiler is called.

Listing 5.20 shows the 'epilogue' method of the rush RISC-V compiler. This method generates the instructions for a function's epilogue. First, in the lines 67–68, the insertion position of the compiler is set to the epilogue block. The epilogue must exist in a separate label in order to be used when 'return' statements are encontered. The instruction insertion in line 70 is responsible for an 'ld' instruction. Here, the 'ld' instruction is used to restore the previously saved 'fp' register. In line 75, another 'ld' instruction for restoring the 'ra' register is inserted. In line 80, an 'addi' instruction is inserted in order to free the memory used for the function's stack frame. Finally, a 'ret' instruction is inserted which is responsible for transferring control back to the caller.

```
crates/rush-compiler-risc-v/src/call.rs _
   pub(crate) fn epilogue(&mut self) {
66
        let epilogue_label = Rc::clone(&self.curr_fn().epilogue_label);
67
68
        self.insert_at(&epilogue_label);
                    `fp` from the stack
69
        // restore
        self.insert(Instruction::Ld(
71
            IntRegister::Fp,
            Pointer::Register(IntRegister::Sp, self.curr_fn().stack_allocs + 8),
72
        ));
73
        // restore `ra` from the stack
74
        self.insert(Instruction::Ld(
75
            IntRegister::Ra,
76
            Pointer::Register(IntRegister::Sp, self.curr_fn().stack_allocs),
77
        ));
78
        // deallocate stack space
79
        self.insert(Instruction::Addi(
80
81
            IntRegister::Sp,
            IntRegister::Sp,
82
            self.curr_fn().stack_allocs + BASE_STACK_ALLOCATIONS,
83
        ));
84
        // return control back to caller
85
        self.insert(Instruction::Ret);
86
   }
87
```

Listing 5.20 - The 'epilogue' method of the RISC-V rush compiler.

#### Let Statements

In the lines 2–3 of the rush program in Listing 5.13 on page 73, let statements are used in order to save values on the stack. The method responsible for compiling these statements is called 'let\_statement'. It saves the initializing expression on the stack and saves the resulting pointer in the current scope. The main work is accomplished by the 'save\_expr\_on\_stack' method. The Rust code in Listing 5.21 displays parts of this method. This method takes

```
crates/rush-compiler-risc-v/src/compiler.rs
     fn save_expr_on_stack(/* ... */) -> Option<Pointer> {
557
         let type_ = node.result_type();
562
         let reg = self.expression(node)?;
563
         let comment = format!("{comment_prefix} = {reg}").into();
564
         let offset = self.get_offset(Size::from(type_));
565
566
         match reg {
567
             Register::Float(reg) => self.insert_with_comment(
580
                 Instruction::Fsd(reg, Pointer::Register(IntRegister::Fp, offset)),
581
582
                 comment,
583
             ),
         };
584
585
         Some(Pointer::Register(IntRegister::Fp, offset))
586
    }
587
```

Listing 5.21 - The 'save\_expr\_on\_stack' method in the RISC-V rush compiler.

an expression in order to compile it and save the resulting value on the stack. The method returns an 'Option<Pointer>', where 'None' again represents the '!' and '()' types. A 'Pointer' stores a relative memory address and an arbitrary base register, 'fp' in this case.

In line 569, the expression is compiled, and the resulting register is saved in the variable 'reg'. In line 571, the 'get\_offset' method is used in order to calculate the frame pointer offset of a variable. Since types like 'char' or 'bool' require less memory compared to 'int',

the size of the expressions's data type must also be available to this method. Now, the 'offset' variable contains the frame pointer offset at which the variable should be saved. Since the 'get\_offset' method is used every time values need to be saved on the stack, it also increments the 'stack\_allocs' variable, so that the prologue and epilogue are aware of this stack allocation.

The 'match' block in line 573 executes different code based on the register type which the expression yielded. In the lines 586–589, the code responsible for float registers can be seen. Here, the 'fsd' ("float store double") instruction is used in order to save the resulting register at the memory position relative to 'fp' which is specified in the variable 'offset'. The code for integer registers is similar, but was omitted since it also differentiates between 8-bit and 64-bit data. In line 592, the 'Pointer' abstraction representing the memory location of the expression is returned. Here, it is apparent that this variant of a 'Pointer' only saves a register and an offset to this register. In this case, 'fp' is specified as the register and the value of 'offset' is used as the offset.

#### **Function Calls and Returns**

For compiling call-expressions, also the ones of builtin functions, the compiler uses the 'call\_expr' method. The Rust code in Listing 5.22 shows the first part of this method.

```
crates/rush-compiler-risc-v/src/call.rs
    pub(crate) fn call_expr(&mut self, node: AnalyzedCallExpr<'tree>) ->
92
        Option<Register> {
         // before the function is called, all currently used registers are saved
93
         let regs_on_stack = self
 94
95
             .used_registers
96
             .clone()
97
             .iter()
98
             .map(|(reg, size)| (*reg, self.spill_reg(*reg, *size), *size))
             .collect();
99
100
         // save the previous state of the used registers
101
         let used_regs_prev = mem::take(&mut self.used_registers);
102
103
         // specifies the argument position of the specified register type
104
         // type dependent: (`a0` -> `int_cnt = 0`), (`fa0` -> `float_cnt = 0`)
105
         let mut float_cnt = 0;
106
         let mut int_cnt = 0;
107
         // calculate the total byte size of spilled params
108
109
         let mut spill_param_size = 0;
         // needed for freeing registers later
110
         let mut param_regs = vec![];
111
         // specifies the count of the current register spill
112
         let mut spill_cnt = 0;
113
         // ...
```

**Listing 5.22** – The 'call\_expr' method in the RISC-V rush compiler.

Due to the aforementioned low-level calling conventions, its implementation has proven to be very complex and demanding. The part of code shown here is only preparing the compilation of the function call. The method consumes an analyzed call-expression and optionally returns a register representing the call return value. For instance, if the 'exit' function was called, this method would return no register as the 'exit' function results in the '!' type.

In the lines 93–99, all registers which are used at the time of the function call are spilled and then marked as unused. This is required since the register allocator of this compiler is so simple that it only considers one function at the time. In production-ready compiler systems, a register allocation algorithm could also work interprocedurally. It becomes apparent that

saving all used registers does not produce very efficient output code. However, throughout the extensive testing conducted on the rush backends, this approach has presented itself as simple and reliable<sup>15</sup>. In the lines 93–99, the register spilling is performed by iterating over the currently used registers and calling the 'spill\_reg' method for each register. That method is structured very similarly to the aforementioned 'save\_expr\_on\_stack' method. After the register spilling, the contents of the 'used\_registers' vector are moved into the local 'used\_regs\_prev' variable. Therefore, after the assignment, the 'used\_registers' vector will be empty.

```
crates/rush-compiler-risc-v/src/call.rs
         // ...
         for arg in node.args {
115
             match arg.result_type() {
116
                  Type::Unit | Type::Never | Type::Unknown => {
117
118
                      self.expression(arg);
119
                  Type::Float(0) => {
120
                      let res_reg = self.expression(arg).expect("type is float");
121
122
                      if let Some(reg) = FloatRegister::nth_param(float_cnt) {
123
124
                          param_regs.push(reg.to_reg());
                          self.use_reg(reg.to_reg(), Size::Dword);
125
                      } else {
126
                          // no more param registers: spilling required
127
                          self.insert_with_comment(
128
                               Instruction::Fsd(
129
130
                                   res_reg.into(),
                                   Pointer::Register(IntRegister::Sp, spill_cnt * 8),
131
132
                               format!("{} byte param spill", /* ... */).into(),
133
                          );
134
                          spill_cnt += 1;
135
                          spill_param_size += 8;
136
137
                      float_cnt += 1;
138
139
             }
160
         }
161
```

**Listing 5.22** – The 'call\_expr' method of the RISC-V rush compiler. (cont.)

In line 136, the 'spill\_param\_size' variable is incremented. It specifies the additional storage required for placing arguments on the stack. For instance, a call with nine integer arguments would need eight additional bytes of memory since the 9<sup>th</sup> argument would have to be placed on the stack. The optional placement of arguments on the stack is performed by the loop in line 115. It iterates over each argument expression of the call, and for each expression, the match-block in line 116 executes different code depending on the expression's result data type. For instance, expressions which yield in '()' or '!' do not require registers and can thus be compiled without further consideration. However, for float values, the code in the lines 121–136 is executed. Firstly, the argument's expression is compiled and the resulting register is saved in the variable 'res\_reg'. Next, the 'nth\_param' method is called on the 'FloatRegister' enumeration. This method associates an index of a parameter to a register. For instance, the index 2 represents the third parameter, which is represented by the register 'fa2'. For unsupported indices, the method returns 'None'. Therefore, the if-expression in line 123 of Listing 5.22 checks whether there still is a register which could

 $<sup>^{15}</sup>$ For each new commit, 41 integration tests are conducted on every backend

represent the current argument position. This position is saved in the 'float\_cnt' and 'int\_cnt' variables which are incremented whenever an argument of the respective type is encountered, in line 138 for instance. If there is no register left, the 'spill\_param\_size' variable is incremented by 8 bytes as floats require eight bytes of information. Otherwise, in case a register is available, the code in the lines 124-125 is executed. Here, 'res\_reg' is only marked as used and pushed in the 'param\_regs' vector. There is no code validating that the correct register for the argument position is returned. This check is redundant since all registers have been marked as unused prior to this block. Therefore, the resulting value of argument expressions will always be in the correct registers. However, if there is no register for the current argument, the value has to be spilled onto the stack. For this, a 'fsd' instruction (or 'sd' instruction for integers) is inserted in order to save the result of the current argument on the stack. As described in a previous section about the RISC-V calling convention, spilled arguments are placed on the stack in a way so that the first spilled argument is at the lowest memory position. For reference, this process is described in detail in Figure?? in Section 5.2.3 on page 66. In order to calculate this increasing offset easily, the 'spill\_cnt' variable was declared in line 113 of Listing 5.22 Every time an argument is spilled, this counter increases. In line 131 of Listing??, the code calculating the offset relative to 'sp' can be seen. Here, the value of 'spill\_cnt' is simply multiplied by 8 in order to obtain the correct offset. Therefore, the memory offset o is calculated like this:  $o = \text{count}_{\text{spilled}} \times 8$ . In line 169, the 'spill\_cnt' variable is incremented like described previously. Although just the code for floats is shown here, the algorithm for integer-based types only deviates slightly from the one for floats.

The Rust code in Listing 5.22 shows the last part of the method.

```
crates/rush-compiler-risc-v/src/call.rs
         self.curr_fn_mut().stack_allocs += spill_param_size;
163
164
         // perform function call
165
         let func_label = match node.func {
166
             "exit" => {
167
                 // mark the current block as terminated (avoid useless jumps)
168
169
                 self.curr_block_mut().is_terminated = true;
                 "exit".into()
170
171
             func => format!("main..{func}").into(),
172
173
         self.insert(Instruction::Call(func_label));
174
175
         // restore the old list of used registers
176
         self.used_registers = used_regs_prev;
177
178
         let res reg = match node.result type {
179
             Type::Float(0) => Some(FloatRegister::Fa0.to_reg()),
180
             Type::Int(_) | Type::Char(_) | Type::Bool(_) | Type::Float(_) => {
181
                 Some(IntRegister::A0.to_reg())
182
183
             Type::Unit | Type::Never | Type::Unknown => None,
184
         };
185
186
         // restore all caller saved registers again
187
         self.restore_regs_after_call(res_reg, regs_on_stack)
188
    }
189
```

**Listing 5.22** – The 'call\_expr' method of the RISC-V rush compiler. (cont.)

The code in the lines 165–173 is responsible for mangling the name of the called function

so that it includes the 'main..' prefix. However, if the 'exit' function is called, the name is not mangled. In line 174, the 'call' instruction responsible for performing the actual call is inserted. At this point, the arguments are in their correct places so that the called function will be able to access them. In line 177, the previously emptied 'used\_registers' vector is restored to its state before the call. This is done in order to allow the register allocation for the call arguments to happen independently of the rest of the compiled code. The code in the lines 179–185 is responsible for selecting an appropriate output register based on the function's return type, storing it in 'res\_reg'. For instance, functions which return 'float' values place their return value inside the register 'fa0', and for functions which return the '()' or '!' types, no register is selected and 'None' is returned. In line 188, the method 'restore\_regs\_after\_call' later reloads the old register values from the stack. However, that method also requires the return register of the called function as its first argument. The reasons why 'restore\_regs\_after\_call' has to access the return register is that this register might habe been in use before the call. However, restoring the register would result in the function's return value to be overwritten, therefore creating a bug. In order to mitigate this issue, the method will alter the result register by inserting a 'mv' instruction if necessary. Since the method returns the final output register which contains the call's result, it is used as the last expression of the block, thus making it the return value of the 'call\_expr' method.

The last thing to mention is how a called function returns a value. Whenever an explicit or implicit return in a function is encountered, the compiler checks whether the expression already saved its result in the appropriate return register. In case it did not, a 'mv' ("("move)) instruction is used to move it there. Afterwards, a jump to the function's epilogue is issued, causing the function to return.

#### Loops

Since assembly dialects does not provide high-level control-flow constructs like loops, a compiler targeting assembly has to generate code which reproduces the behaviour. Listing 5.23 shows a rush program containing a while-loop. Next to it, Listing 5.24 shows the part of the compiler output that represents the loop and the 'exit' call.

```
while_head_0:
                                               21
                                               22
                                                       # while condition
                                                        ld a0, -24(fp)
                                                                           # count
                                               23
                                                        li a1, 10
                                               24
                                                        slt a0, a0, a1
                                               25
                                               26
                                                        begz a0, after_while_0
   fn main() {
1
                                                        # while body
                                               27
2
       let mut count = 0;
                                                        li a0, 1
                                               28
3
       while count < 10 {
                                                                           # count
                                               29
                                                        ld a1, -24(fp)
            count += 1;
4
                                                        add a2, a1, a0
                                               30
5
                                                        sd a2, -24(fp)
                                               31
6
       exit(count);
                                               32
                                                        j while_head_0
7
   }
                                               33
                                                   after_while_0:
                                               34
     Listing 5.23 – A rush program con-
                                               35
                                                        ld a0, -24(fp)
                                                                           # count
                                                        call exit
              taining a while-loop.
                                               36
```

**Listing 5.24** – Compiler output of the rush program in Listing 5.23.

The rush program contains a while-loop which iterates ten times. At the beginning of the program, the integer variable 'count' is declared. Before each iteration, the while-loop checks whether 'count' is less than '10'. If it is, count is incremented by '1', and the loop

begins its next iteration. Otherwise, the execution of the loop stops and the code after the loop is executed. In the assembly code, the content of the loop is represented by the instructions below the 'while\_head\_0' label. As the comment in line 22 suggests, the first four instructions represent the loop control code. These instructions are executed every time time loop begins a new iteration, therefore they check if the condition is true before any of the code inside the loop's body is executed. If it results in 'false', the execution of the loop should stop. This is accomplished by the 'beqz' ("branch (if) equal (to) zero") instruction in line 26. It is a pseudoinstruction which jumps to the specified label if the value inside the first operand is zero [WA19, p. 105]. Here, this means that the instruction would jump to the 'after\_while\_0' label if the condition of the loop evaluated to 'false'.

This label contains code which follows the while-statement, here only the call to the 'exit' function. Otherwise, if the condition evaluated to 'true', the 'beqz' instruction would not perform a jump and the CPU would just continue to the following instruction. The instructions following that conditional jump represent the loop body, i.e., the statement 'count +=1' In line 32, and unconditional jump, which jumps back to the beginning of the 'while\_head\_0' label, follows. Therefore, the loop begins its next iteration. Using this appropriate, the compiler is able to generate instructions for the 'loop', 'while', and 'for' statements.

# 5.3. x86\_64: Compiling to a CISC Architecture

In addition to reduced instruction set computers (RISC), there are also complex instruction set computers, short CISC. The main differences lie in the instruction count and complexity. For the purposes of this paper, the still widely used CISC architecture  $x86\_64^{16}$  is used as an example, just like RISC-V was used to represent RISC architectures in the previous section. While RISC-V with all extensions used by rush<sup>17</sup> provides about 100 different instructions [WA19, Chapter 24], x64 is estimated to have about 3600 instructions [RA17]. This is due to the fact that at the time when most CISC architectures were developed, many developers still programmed in assembly languages by hand, without the help of compilers. Additional instructions for higher level concepts were therefore very helpful [Dan05a, p. 9].

One example for such an instruction is the x64 'leave' instruction. It is used to release the current stack frame at the end of a procedure. Listing 5.20 on page 78 shows how multiple RISC-V instructions are used to accomplish the same manually. First, the frame pointer register is copied into the stack pointer register, freeing the stack space allocated at the start of the procedure. Then, the old frame pointer that was saved on the stack is loaded from the stack back into the frame pointer register. Restoring the return address and returning to the caller is both done by the 'ret' instruction in x64.

Another similar example is the 'call' instruction. Although it is also usable in RISC-V assembly, it is a pseudoinstruction provided by the assembler and resolves to multiple other RISC-V instructions during the assembly process. In x64 however, 'call' is a conventional instruction.

For x64 specifically, there are also some instructions that only operate on certain specific registers. For instance, the 'idiv' instruction divides a signed 128-bit integer stored in the two 64-bit registers '%rdx' and '%rax' by the given operand, storing the result in the '%rax' register and the remainder in the '%rdx' register. This makes usage of such instructions significantly more demanding, as values in these registers must be spilled to the stack in case they must be preserved.

The substantial growth of the x64 instruction set and its commitment to backwards compatibility have also led to a number of instructions that have long become obsolete. For example, 'aaa' (ASCII Adjust after Addition) is used in the context of adding two *Binary Coded Decimal (BCD)* values, but that technology is rarely used nowadays [PW17, p. 4]. TODO: maybe explain what BCD is?

## 5.3.1. x64 Assembly

```
1 let mut a = 2;
2
3 fn main() {
4    a += 1;
5    let b = true;
6    exit(a + b as int);
7 }
```

There are multiple different dialects of x64 assembly and multiple assemblers each supporting a different set of dialects. The rush x64 compiler emits assembly code using Intel syntax that can be assembled by the *GNU Assembler* (GAS).

Listing 5.25 contains another simple rush program. It defines a global variable called 'a' and assigns it the integer 2, increments it by one, defines a local boolean called 'b', using 'true' as its initial value, and exits with the sum of 'a' and 'b'. The corresponding assembly code generated by the x64 rush compiler is shown in Listing 5.26. It begins with the '.intel\_syntax'

directive, indicating that the following assembly code uses the Intel syntax. Then, similar

 $<sup>^{16}</sup>$ Later shortened to x64.

<sup>&</sup>lt;sup>17</sup>The used extensions are RV32I, RV64I, RV32M, RV64M, RV32D, and RV64D.

to the RISC-V assembly code shown in the previous section, the '\_start' label is marked as global, as it again represents the entry point. The sections are also the same, as these are defined by the ELF format which is independent of ISAs.

The main differences between x64 assembly and the previously shown RISC-V assembly, and also the other x64 assembly dialects, are the instruction mnemonics and the register names. When using the Intel syntax for x64, register names are always prepended by a percentage sign. Another difference is the syntax used for pointers. RISC-V assembly uses '-1(fp)' to specify the value in the 'fp' register as a memory address with an offset of '-1', Intel x64 uses 'byte ptr [%rbp-1]', as seen in line 16. It is obvious that the latter provides an additional constraint, the size of the value pointed to, here 'byte'. RISC-V assembly instead encodes this information in the instruction mnemonic. To store a byte, one would use 'sb' (store byte), whereas for storing a 64-bit integer, 'sd' (store dword) is used instead.

```
The naming of sizes other than 'byte' was not explained yet, and this, too, differs between RISC-V
```

```
.intel_syntax
                                                           1
.global _start
                                                           2
                                                           3
.section .text
                                                           4
                                                           5
_start:
                                                           6
                                                           7
                  main..main
    call
    mov
                  %rdi, 0
                                                           8
    call
                  exit
                                                           9
                                                           10
main..main:
                                                           11
                  %rbp
                                                           12
    push
    mov
                  %rbp, %rsp
                                                           13
                  %rsp, 16
    sub
                                                           14
    inc
                  qword ptr [%rip+main..a]
                                                           15
                  byte ptr [%rbp-1], 1 # let b = 1
                                                           16
    mov
                  %rdi, qword ptr [%rip+main..a]
    mov
                                                           17
                  %sil, byte ptr [%rbp-1]
    mov
                                                           18
                  %rsi, 255
    and
                                                           19
    add
                  %rdi, %rsi
                                                           20
                  exit
                                                           21
    call
main..main.return:
                                                           22
    leave
                                                           23
    ret
                                                           24
                                                           25
.section .data
                                                           26
main..a:
                                                           27
    .quad
                  0x0000000000000000000002
                                                           28
```

**Listing 5.26** – Compiler output from the rush program in Listing 5.25.

and x64. Architectures usually define the size of one so-called word. In RISC-V, a word is defined as 32 bits, so 4 bytes, in x64 it is defined as 16 bits, so 2 bytes. All other sizes, except for the byte, are then named based on the word size. Therefore, RISC-V assigns 16-bit, 32-bit, and 64-bit the names half word, word, and double word respectively [WA19, p. 6]. For x64, they are instead called word, double word, and quadruple word [Kus18, p. 3]. These names are often shortened to only include the first letter of the factor. For instance, a double word is called dword. In some instances, like in line 28, alternative short forms are used. Here, the directive '.quad' denotes a quadruple word.

Another important difference between x64 and RISC-V is the typical operand structure of instructions. RISC-V requires three operands for 'add', 'sub', and many other instructions, these being the destination, the first source, and the second source location. The norm in x64 is using just two operands where the first represents both the first source and the destination [Dan05a, pp. 14–20].

## 5.3.2. Registers

The base x64 ISA provides 16 general-purpose registers capable of holding 64-bit integers. These are shown in Table 5.2. Additionally, three more sets of 16 registers are defined, each half the size as the previous one. However, instead of these being additional storage, they simply provide an alias to the lower half of the respective larger register. For instance, when

Table 5.2 – Integer registers on the x64 architecture [Lu+22, pp. 20, 26].

| 64-bit | 32-bit | 16-bit | 8-bit | Caller-Saved | Purpose                                                              |
|--------|--------|--------|-------|--------------|----------------------------------------------------------------------|
| %rbp   | %ebp   | %bp    | %bpl  |              | base pointer / frame pointer                                         |
| %rsp   | %esp   | %sp    | %spl  |              | stack pointer                                                        |
| %rax   | %eax   | %ax    | %al   | $\checkmark$ | 1 <sup>st</sup> return register                                      |
| %rbx   | %ebx   | %bx    | %bl   |              |                                                                      |
| %rcx   | %ecx   | %сх    | %cl   | $\checkmark$ | 4 <sup>th</sup> argument register                                    |
| %rdx   | %edx   | %dx    | %dl   | $\checkmark$ | 2 <sup>nd</sup> return register<br>3 <sup>rd</sup> argument register |
| %rsi   | %esi   | %si    | %sil  | $\checkmark$ | 2 <sup>nd</sup> argument register                                    |
| %rdi   | %edi   | %di    | %dil  | $\checkmark$ | 1 <sup>st</sup> argument register                                    |
| %r8    | %r8d   | %r8w   | %r8b  | $\checkmark$ | 5 <sup>th</sup> argument register                                    |
| %r9    | %r9d   | %r9w   | %r9b  | $\checkmark$ | 6 <sup>th</sup> argument register                                    |
| %r10   | %r10d  | %r10w  | %r10b | $\checkmark$ |                                                                      |
| %r11   | %r11d  | %r11w  | %r11b | $\checkmark$ |                                                                      |
| %r12   | %r12d  | %r12w  | %r12b |              |                                                                      |
| %r13   | %r13d  | %r13w  | %r13b |              |                                                                      |
| %r14   | %r14d  | %r14w  | %r14b |              |                                                                      |
| %r15   | %r15d  | %r15w  | %r15b |              |                                                                      |

writing the byte  $(2a)_{16}$  into '%al', the least significant byte of '%ax', '%eax', and '%rax' changes to  $(2a)_{16}$ , too.

Considering line 18 from Listing 5.26, the move of the value behind the 'byte ptr' requires the destination register to be one byte in size, too. For this reason, the compiler uses the '%sil' register here, instead of its 64-bit equivalent '%rsi'. Line 19 then performs the cast from the boolean, saved as a byte, to a 64-bit integer. It does this by taking the 64-bit variant of the register, in this case '%rsi', and assuring that all bits, except for the least significant eight, are zeros. The latter is done using the bitwise AND operation with the number  $2^8 - 1 = 255$ , which in binary is represented by eight ones.

Another special register is '%rip'. It stores the instruction pointer, that is, the address of the current instruction being executed. The register is usually not modified directly, as branching can be achieved with instructions like 'call', 'ret', or 'jmp'. Nevertheless, it is sometimes accessed manually, as seen in the lines 15 and 17. In these lines, read and write operations on the global variable 'a', defined at the 'main..a' label, are performed. Pointers to such global variables and constants in the '.data' and '.rodata' sections require the instruction pointer as their base in x64 Assembly.

In addition to these general purpose registers, x64 also defines 16 registers for floating-point numbers, each being 128 bits in size and labelled '%xmm0' through '%xmm15'. For the purposes of rush, only the lower 64 bits are ever used, as this is enough to represent double precision floating-point numbers. The additional 64 bits are provided for *single instruction multiple data* (SIMD) instructions. They can be used by more optimized compilers to operate on multiple values at a time. Any single one of these registers would then for instance hold two 64-bit values or four 32-bit values [Kus18, pp. 5–7].

# 5.3.3. Stack Layout and Calling Convention

Alongside the instructions and registers defined by x64 itself, an additional ABI document defines how programs should structure the stack during execution. Figure 5.6 shows this layout. The overall outline is similar to that of RISC-V, but some details differ. For one, the order of the saved return address and the saved '%rbp' register is inverted. Secondly, the address '%rbp' points to differs by 16 bytes. Therefore, in x64 the first stack argument is accessed by offsetting '%rbp' by 16, whereas in RISC-V the 'fp' register already points to the first stack argument without any offset. In both calling conventions, every stack argument uses eight bytes of space. However, x64 has no special re-



**Figure 5.6** – Stack layout of the x64 architecture [Lu+22, p. 21].

turn address register, and instead only stores the return address on the stack and retrieves it from there when needed. At the moment of calling a function, the stack frame's size must be aligned to a multiple of 16 bytes. The unspecified space between '%rbp-8' and '%rsp' can be freely used by the current function to save local variables. Its internal layout may differ between programming languages [Lu+22, p. 21].

### 5.3.4. Implementation

Even though many implementation details could be the same as for the RISC-V compiler, there are some differences because we each created one separately without using any reference implementation. One such difference is that the x64 compiler's 'Instruction' definition, which is shown in Listing 5.27, not only includes instructions, but also directives and labels. This way, an entire assembly file can be represented as one 'Vec<Instruction' with every 'Instruction' representing one line of the assembly code. A matching 'Display' implementation 18 then

```
- crates/rush-compiler-x86-64/src/instruction.rs =
pub enum Instruction {
    // ...
    Section(Section),
    Symbol(Rc<str>, bool),
    // ...
    Add(IntValue, IntValue),
    Sub(IntValue, IntValue),
    // ...
}
```

**Listing 5.27** – The 'Instruction' definition in the rush x64 compiler.

simply emits the respective assembly representation as a string.

Line 20 shows the definition of the '.section' directive, simply holding an instance of another enumeration describing the valid sections. Line 21 defines an assembly label, holding its name and a boolean indicating whether a blank line should be added above. TODO: we might still have to explicitly state what labels are

The normal instructions, e.g., 'add' and 'sub', also cannot simply take registers as operands, as x64 often allows memory pointers and immediate values as operands. Instead, another enumeration, called 'IntValue', is introduced to allow these three variants. Listing 5.28 shows its definition.

<sup>&</sup>lt;sup>18</sup>In Rust, implementing the 'Display' trait on a type defines how instances of that type can be presented in text form to a user.

#### Struct Fields

```
_ crates/rush-compiler-x86-64/src/value.rs _
pub enum IntValue {
    Register(IntRegister),
    Ptr(Pointer),
    Immediate(i64),
}
```

**Listing 5.28** – The 'IntValue' definition in the rush x64 compiler.

The compiler struct itself, shown in Listing 5.29, is also a bit different. Both compilers have a 'used\_registers' field, but this compiler does not need to store the size for each register because the registers themselves already have a 'scopes' field for storing variable locations, but the RISC-V compiler has the 'Option<\_>' inside the 'Variable' struct (line 26). Both compilers store the global variables, but this compiler has

separate fields for every size (lines 46–56). The same applies for constants in the '.rodata' section (lines 58–70). And both compilers have one field for accumulating the final assembly code, but this compiler saves a list of 'Instruction's instead of 'Block's (line 44).

```
crates/rush-compiler-x86-64/src/compiler.rs -
   pub struct Compiler<'src> {
15
       /// The instructions for the current function body
16
       pub(crate) function_body: Vec<Instruction>,
17
        /// The currently used [`IntRegister`]s
18
19
        pub(crate) used_registers: Vec<IntRegister>,
        // ...
/// Maps variable names to `Option<Variable>`, or `None` for types `()` and `!`
        pub(crate) scopes: Vec<HashMap<&'src str, Option<Variable>>>,
        /// Internal stack pointer, separate from `%rsp`, used for pushing and popping
27
        /// inside the stack frame. Relative to `%rbp`, always positive.
28
        pub(crate) stack_pointer: i64,
29
        /// Size of current stack frame, always `>= self.stack_pointer`
       pub(crate) frame_size: i64,
31
        /// The text section (aka code section)
43
        pub(crate) text_section: Vec<Instruction>,
44
45
       ////// .data section ///////
46
        /// Globals with 64-bits
47
        pub(crate) quad_globals: Vec<(Rc<str>>, u64)>,
48
        /// Globals with 8-bits
55
        pub(crate) byte_qlobals: Vec<(Rc<str>>, u8)>,
56
57
        /////// .rodata section ///////
        /// Constants with 128-bits (maps value to name)
59
        pub(crate) octa_constants: HashMap<u128, Rc<str>>>,
60
69
        /// Constants with 8-bits (maps value to name)
       pub(crate) byte_constants: HashMap<u8, Rc<str>>>,
   }
71
```

**Listing 5.29** – Overview of the x64 'Compiler' struct definition.

Additional fields of this compiler are 'function\_body', 'stack\_pointer', and 'frame\_size'. The 'function\_body' field stores the instructions of the currently compiled function. During compilation of statements and expressions, emitted instructions are always appended to this vector. The method for compiling a function body then produces the prologue and afterward appends the vector contents to 'text\_section'. This is required because some values that are needed in the prologue, like the 'frame\_size', cannot be known before compilation of the function body, but must appear before the body in the assembly code.

#### **Memory Management**

The 'stack\_pointer' field is used to keep track of the current stack frame. Every time a variable definition is encountered, or a register is spilled, stack space for it is reserved by increasing this pointer. This process is done by the 'push\_to\_stack' method visible in Listing 5.30. It takes a value, its size, and an optional comment as parameters. At first, it aligns the stack pointer to a multiple of the values' size by the 'align' function. Then, it adds the number of bytes required for the value to it, and increases the frame size in case it is not big enough yet. Now, the pointer to this value is constructed using '%rbp' as the base and the negated stack pointer as the offset. Finally, a 'mov' instruction for inserting the value at the location of the pointer is added to the assembly code, and the pointer is returned for further use.

```
crates/rush-compiler-x86-64/src/compiler.rs
     pub(crate) fn push_to_stack(
254
255
         &mut self,
256
         size: Size,
         value: Value,
257
         comment: Option<impl Into<Cow<'static, str>>>,
258
     ) -> Pointer {
259
         // add padding for correct alignment
260
         Self::align(&mut self.stack_pointer, size);
261
262
         // allocate space on stack
263
         self.stack_pointer += size.byte_count();
         // possibly expand frame size
264
         self.frame_size = self.frame_size.max(self.stack_pointer);
265
         // get pointer to location in stack
266
         let ptr = Pointer::new(
267
268
             size,
269
             IntRegister::Rbp,
             Offset::Immediate(-self.stack_pointer),
270
         );
271
277
         self.function_body.push(match comment {/* ... */});
281
         ptr
282
     }
283
```

**Listing 5.30** – Stack space reservation for values.

#### TODO: example?

#### **Register Allocation**

The register allocation algorithm of the x64 compiler differs quite a lot from the RISC-V version, too. Instead of marking individual registers as used or unused and then searching through a specified order for the first free one, this compiler uses an almost stack-like behavior. Again, one specific order of registers is defined, starting with the return register '%rax', followed by the argument registers in their correct order, and ending with the remaining general purpose registers. Whenever a free register is required, the 'next' method, seen in Listing 5.31, is called on the last used register, returning the next register in line, which is then added to the 'used\_registers' field. For freeing a register, the compiler just removes the last register from this list. This only works because the rest of the compiler guarantees used registers to be freed in exactly the reverse order they were acquired in. With this guarantee, it is apparent that at any point in time during compilation, the registers in the 'used\_registers' field are in the exact order specified by Listing 5.31 without gaps. The same is done separately for the float registers.

```
257
258
259
260
261
262
263
264
265
270
271
272
273
274
```

```
fn main() {
    exit(foo(21));
}
fn foo(n: int) -> int {
    return n * 2;
    let b = n;
}
```

Listing 5.32 – Another example rush program with two functions.

To explain the processed compiling functions and function calls, there is another example produced assembly code in Listing 5.32, along with the produced assembly code in Listing 5.33. Although not suplicitly labelled; the x64 compiler also uses the concept of a function problem, body, and epilogue. Since the 'main' function in this example does not require any stack space for variables or register spilling, the prologue is automatically omitted, and the epilogue does not contain a 'leave' instruction for deaflocating the membery. The 'foo' function however does declare a local variable and thus requires stack space to be allocated in the prologue. The lines 20–22 are responsible for this. As previously shown in Figure 5.6, the stack should first contain the return address and then the previous base pointern the sortman of the stack should first contain the return address and then the

by the 'call' instruction. Therefore, the function's prologgie-begins with pushing '%rbp' onto the stack. Afterward, the base pointer's value is set to the current stack pointer. The stack pointer itself is then decremented by the amount of bytes needed for the function, which is saved in 'frame\_size', rounded up to a multiple of 16.

```
main..main:
11
                     %rdi, 21
12
        mov
        call
                     main..foo
13
                     %rdi, %rax
14
        mov
15
        call
                      exit
    main..main.return:
16
17
        ret
18
    main..foo:
19
20
        push
                     %rbp
                     %rbp, %rsp
21
        mov
                     %rsp, 16
22
        sub
                      qword ptr [%rbp-8], %rdi # param n = %rdi
23
        mov
                     %rax, qword ptr [%rbp-8]
24
        mov
25
        imul
                     %rax, 2
26
        jmp
                     main..foo.return
                                                   # return %rax;
27
        mov
                     %rax, qword ptr [%rbp-8]
28
        mov
                     qword ptr [%rbp-16], %rax # let b = %rax
29
    main..foo.return:
30
        leave
31
        ret
```

**Listing 5.33** – Trimmed compiler output for the rush program in Listing 5.32.

The epilogue ranges from line 29 to line 31. It defines a label where 'return' statements can jump to, releases stack frame using the 'leave' instruction, and returns control to caller with the 'ret' instruction. Usage of the return label can be seen in line 26, which does an unconditional jump to there, rendering all intermediate instructions as unreachable. This is the expected behavior because the jump instruction was emitted due to the 'return' statement in line 6 of the rush program, which should skip to the end of the function.

The return value was moved into the '%rax' register, or its correspondingly sized variant, in advance.

TODO: 'function\_body' definition?

#### **Function Calls**

To support function calls, two main aspects must be considered: passing arguments from the caller, and retrieving arguments in the callee.

Generally, for passing arguments, the compiler simply iterates over the argument expres-

sions, compiles each one, and moves its result into the appropriate register, or onto the stack if no registers are left. As the first six integer arguments are always passed through the same registers, a function call always requires access to these exact registers, or at least a subset of them. That means, if some of the registers happen to be in use already, their content has to be spilled to the stack and replaced with the argument value. An example for when this might happen is a function call as a non-first argument to another function call, like in 'foo(42, bar('x'))'. When the compiler has to move the first argument for the inner function call into '%rdi', here the argument ''x'' for the function 'bar', it notices this register is already used for the first argument of the outer function call, '42'.

Each function that takes parameters should move all passed arguments from their registers onto the stack in order to make the registers available for the rest of the function body. In Listing 5.33 this happens in line 23 for the one parameter called 'n'. More optimized compilers can of course try to keep the arguments in the registers for as long as they are not specifically needed elsewhere, but as the rush compilers only serve educational purposes, this is not done.

#### **Control Flow**

Just as with RISC-V, control flow constructs, like loops and if-branches, must be represented using simple jumps in assembly. There are both conditional and unconditional jumps, the latter of which was already outlined during explanation of 'return' statements on page 90. An unconditional jump, done in x64 with the 'jmp' instruction, only takes a label to jump to, and does so, as the name implies, without checking for any condition. Conditional jumps are rather a set of different instructions,

```
fn bar(a: bool) {
   let b = if a { 42f } else { 3.14 };
   let c = if b == 42f { 11 } else { 3 };
   loop {
      break;
      continue;
      3 / c;
   }
}
```

**Listing 5.34** – A rush example function containing ifexpressions and a loop.

each testing a different condition. Again, the way x64 handles conditions differs from RISC-V. It provides a set of flags<sup>19</sup> indicating various relations between two values, which are set by dedicated compare instructions like 'cmp' for integers, 'test' for booleans, and 'ucomisd' for floats. A conditional jump instruction like 'je' (jump if equal) then only has to query the flag that indicates equality, and optionally performs a jump based on its value.

Listing 5.34 shows an example rush function containing if-expressions and a loop with one 'break' and one 'continue' statement. The resulting assembly code is displayed in Listing 5.35. This function serves no purpose, and the semantic analyzer rightfully warns about unreachable statements and unused variables, but it still serves to show the translation of these constructs to assembly.

Compiled if-expressions always follow the same outline, as it is shown in Figure 5.7. First, the condition is compiled and an initial jump instruction that jumps to the elseblock in case the condition evaluates to



Figure 5.7 – Structure of if-expressions in assembly

<sup>&</sup>lt;sup>19</sup>A flag is a boolean value, typically stored in batch with one bit per flag.

```
17
    main..bar:
18
        push
                     %rbp
                     %rbp, %rsp
19
        mov
20
        sub
                     %rsp, 32
                     byte ptr [%rbp-1], %dil
                                                                  # param a = %dil
21
        mov
22
        test
                     byte ptr [%rbp-1], 1
23
                     .block_0
                                                                  # if a {
        jе
        movsd
                     %xmm0, qword ptr [%rip+.quad_constant_0] #
24
                     .block_1
25
        jmp
                                                                  # } else {
    .block_0:
26
                     %xmm0, qword ptr [%rip+.quad_constant_1] #
                                                                     3.14
27
        movsd
                                                                  # }
28
    .block_1:
                                                                  # let b = %xmm0
                     gword ptr [%rbp-16], %xmm0
29
        movsd
                     %xmm0, qword ptr [%rbp-16]
30
        movsd
                     %xmm0, qword ptr [%rip+.quad_constant_0]
31
        ucomisd
32
        jp
                     .block_2
        jne
                     .block_2
                                                                  # if b == 42f {
33
                     %rax, 11
                                                                     11
34
        mov
                     .block_3
35
        jmp
                                                                  # } else {
    .block_2:
36
                                                                  #
37
                     %rax, 3
                                                                      3
        mov
                                                                  # }
    .block_3:
38
                     gword ptr [%rbp-24], %rax
                                                                  # let c = %rax
39
        mov
    .block_4:
40
                                                                  # loop {
                     .block_5
                                                                  #
                                                                       break;
41
        jmp
                                                                  #
                                                                       continue;
42
        jmp
                     .block_4
43
        mov
                     %rax, 3
44
        cqo
                                                                  #
        idiv
                     qword ptr [%rbp-24]
                                                                       3 / c;
45
                     .block_4
                                                                  #
                                                                      continue;
46
        jmp
                                                                  # }
47
    .block_5:
    main..bar.return:
48
        leave
49
50
        ret
51
52
    .section .rodata
53
    .quad_constant_0:
54
        .quad
                     0x40450000000000000 # = 42.0
55
    .quad_constant_1:
56
        .quad
                     0x40091eb851eb851f # = 3.14
```

**Listing 5.35** – Trimmed compiler output for the rush function in Listing 5.34.

'false' is emitted. Following that are the instructions for the if-block, concluded by an unconditional jump to behind the else-block.

Now, the first label is inserted, here '.block\_x', followed by the else-block. The entire construct ends with the second label, here '.block\_y'. Any following code goes after that. It is apparent that this structure encodes the expected if-else branching behavior. The code from the if-block is only ever reached if the condition is true, and it then skips over the else-block to the following instructions. If the condition is false, the if-block is immediately skipped, and only the else-block is executed which then also continues with the following instructions, causing the two branches to merge.

Starting with the if-expression in line 6 of Listing 5.34, the condition here is merely a boolean variable. In order to execute a jump based on the value of this variable, the 'test' instruction is used with both the variable pointer and a constant '1' as operands in line 22 of Listing 5.35. This results in the equality flag being set to '1' if a bitwise AND operation of these operands results in '0', that is, if the boolean variable is 'false'. The following 'je'

instruction in line 23 therefore skips to the else-block in case the boolean value was 'false'.

The compiler could now simply compile all condition expressions as usual, always resulting in one 'IntValue' holding a boolean, and use the same simple boolean check every time. However, since x64 provides additional jump instructions for basic conditions like '==', '!=', '>=', or '<', the compiler will try to use one of these when the source rush expression is simple enough. This can be seen in lines 30–33 of the assembly code, which represent the condition of the if-expression in line 7 of the rush code. The compiler detects the condition to be a simple equality check between two floats, and therefore issues a 'ucomisd' instruction in line 31 to compare the floats, and a 'jne' instruction in line 33 to jump if the floats are not equal. The reason for the 'jp' instruction in between is explained later.

#### TODO: show Rust snippets?

The representation of the loop is much simpler. Only two labels must be defined, one before and one after the loop's contents, and an unconditional jump back to the former must be performed at the end of every iteration. This second point is the reason for the comments in Listing 5.35 showing two 'continue' statements, even though the rush function only contains one. An unconditional jump back to the loop's start is exactly the behavior of 'continue' statements. And 'break' statements are not much different either as they just jump to behind the loop instead. For conditional loops, that is, while-loops and for-loops, there is an additional head effectively performing 'if !condition { break; }' at the start of each iteration. For-loops also initialize a variable before the first iteration and call an update-expression at the end of each one.

#### Integer Division and Float Comparisons

This subsection's title might seem a little specific compared to the others, but it has the simple reason that this subsection highlights two unexpectedly complex difficulties which we encountered during the creation of this compiler. They were both already hinted at before. Firstly, there is integer division with the 'idiv' instruction, which was shortly explained in the introduction to x64 on page 84. It is used when compiling rush infix-expressions between two integers that use either the '/' or the '%' operator.

Listing 5.36 shows this compilation process. As 'idiv' always operates on the '%rax' and '%rdx' registers, it must first be assured that they are free for use. The left-hand side of the division, the numerator, is then moved into '%rax' and sign-extended<sup>20</sup> to 128 bits via the 'cqo' instruction. Now, the right-hand side, the denominator, is made sure to either be a register or a memory pointer, which is then used in the call to 'idiv' in line 93. The division result, either '%rax' or '%rdx' depending on the operator, is then moved into the register that previously contained the numerator, as this is guaranteed to be available now. To finish up, possibly spilled registers are reloaded, and the register is returned as the expression's result. The resulting assembly of this can be seen in lines 43–45 in Listing 5.35.

The second unexpected difficulty was presented by floating-point number comparisons. Floating-point numbers, as defined by the *IEEE 754–2008* standard, have the special property of not forming a total order **TODO:** cite. That means, not any two arbitrary values are comparable. For instance, the float standard defines a 'NaN'<sup>21</sup> value, which cannot be compared to itself [PH17, p. 216]. In rush an expression comparing two incomparable float values should result in 'false'. This is the reason for the additional 'jp' instruction in line 32 of Listing 5.35 that was previously hinted at on page 93. When comparing two floats with the 'ucomisd' instruction and there is no clear order defined, the *parity* flag is set to '1'. The 'jp' instruction then jumps straight to the else-block if this flag is set, hereby interpreting an expression like 'NaN == NaN' as false.

<sup>&</sup>lt;sup>20</sup>TODO: what is sign-extension

<sup>&</sup>lt;sup>21</sup>Short for "not a number".

```
(Some(Register::Int(left)), Some(Value::Int(right)), InfixOp::Div | InfixOp::Rem)
56
        => {
        // ...
// make sure the rax and rdx registers are free
65
         let spilled_rax = self.spill_int_if_used(IntRegister::Rax);
66
         let spilled_rdx = self.spill_int_if_used(IntRegister::Rdx);
67
68
69
         // move lhs result into rax
         // analyzer guarantees `left` and `right` to be 8 bytes in size
70
         self.function_body
71
             .push(Instruction::Mov(IntRegister::Rax.into(), left.into()));
72
73
         // sign-extend lhs to 128 bits (required for IDIV)
74
         self.function_body.push(Instruction::Cqo);
75
76
77
         // get source operand
78
         let source = match right {/* ... */};
91
         // divide
92
         self.function_body.push(Instruction::Idiv(source));
93
94
         // move result into result register
95
         self.function_body.push(Instruction::Mov(
96
97
             left.into(),
             // use either `%rax` or `%rdx` register, depending on operator
98
             IntValue::Register(match op {
99
                 InfixOp::Div => IntRegister::Rax,
100
101
                 InfixOp::Rem => IntRegister::Rdx,
102
                 _ => unreachable!("this arm only matches with `/` or `%`"),
             }),
103
         ));
104
         // reload spilled registers
110
         self.reload_int_if_used(spilled_rax);
111
         self.reload_int_if_used(spilled_rdx);
112
113
         Some(Value::Int(IntValue::Register(left)))
114
    }
115
```

**Listing 5.36** – Compilation of integer division in x64.

How this is achieved in the compiler is visible in Listing 5.37. The shown code snippet is inside the 'condition' method which is used during compilation of if-expressions and conditional loops. After pushing the 'ucomisd' instruction, two additional checks are performed. Firstly, if the condition is '==', then add a conditional jump to the else-block for when the result is unordered. Secondly, if the condition is '!=', then add a conditional jump to the else-block for when the result is *not* unordered. An expression like 'NaN != NaN' is therefore considered to be true.

#### **Pointers**

As for pointers, the x64 compiler operates based on the same principles as the RISC-V one. However, the exact implementation differs slightly as x64 provides a specialized 'lea' (load effective address) instruction for calculating a pointer address at runtime. For referencing a variable stored at an offset of '-24' from the frame pointer, the RISC-V compiler produces the instruction 'addi a0, fp, -24' in order to manually subtract 24 from the address stored in 'fp'. The x64 compiler instead emits 'lea %rax, qword ptr [%rbp-24]'. Here, the same syntax for pointers can be used as for accesses to the values behind pointers. In fact, in

```
(Value::Float(lhs), Value::Float(rhs)) => {
927
         self.function_body.push(Instruction::Ucomisd(lhs, rhs));
944
945
         if cond == Condition::Equal {
946
             // if floats should be equal but result is unordered, jump to end
947
             self.function_body.push(Instruction::JCond(
948
949
                 Condition::Parity,
                 Rc::clone(&false_symbol),
950
             ));
951
         } else if cond == Condition::NotEqual {
952
             // if floats should not be equal and result is not unordered, jump to end
953
             self.function_body.push(Instruction::JCond()
954
                 Condition::NotParity,
955
956
                 Rc::clone(&false_symbol),
             ));
957
958
         }
959
    }
```

**Listing 5.37** – Compilation of float comparisons in x64.

order to load the value instead of the address into '%rax', one only has to substitute 'lea' with 'mov'.

#### 5.4. Conclusion: RISC vs. CISC Architectures

During implementation of the RISC-V and x64 compilers, it has become apparent how each type of architecture affects the project's complexity. One metric for the complexity is line count: In Git commit '93ee068', the RISC-V compiler consisted of 2234 lines of code while its x64 counterpart required 2751, meaning that the latter required roughly 500 lines more than RISC-V. Even though the observable difference in line count might be reasonable, this metric might differ drastically when considering more advanced compilers. Apart from objective metrics, implementation of the x64 compiler has presented itself as slightly more demanding compared to the RISC-V version. For instance, the previously explained issues of integer division and float comparison presented just an example of how the architecture complicated development.

Furthermore, documentation and learning resources also showed a significant discrepancy. While there are many resources on the RISC-V architecture available, complete guides for implementing programs in its assembly were sparse. For x64, online resources covering assembly were slightly more prominent due to its larger popularity. However, easy to comprehend instruction references were only provided by third-party individuals, as the official Intel reference lacks structure to the point where it has to been helpful. **TODO:** Is this correct? For RISC-V, the official ISA specification is held up-to-date and is easily accessible, making it very helpful during the design of a compiler. Overall, it seemed like the RISC-V teams put a lot of effort into its documentation, making the architecture available for everyone.

Due to the discussed factors, mainly the latter one, RISC-V seems like a very promising architecture which will probably gain a lot more popularity in the future. Simultaneously, CISC architectures, like x64, are becoming increasingly obsolete. Furthermore, significant licensing fees introduced by large corporate entities like Intel lead to the architecture being unsuitable for third-party chip manufactures. Since RISC-V is an open source architecture, no licensing fees are required, thus making the architecture available to anyone.

It is to be mentioned that CISC architectures, like x64, itself cannot be considered bad.

For instance, since the introduction of x64, it has served in millions of devices. However, since it dates back to the time when assembly programs were still written by hand, its features are just not up-to-date in a world where code is mostly generated by compilers. Since CISC architectures provide a lot more, increasingly complex instructions, handwritten assembly programs will run faster at the expense of programs generated by a compiler. The differences in documentation and reference material can also be explained when considering the age. Because RISC-V intents to attract cooperations to use their architecture, good documentation is vital.

To summarize the paragraphs from above, during creation of the rush project and this paper, RISC-V has presented itself as the more attractive architecture, both in objective metrics and subjective experiences. Therefore, if the reader intents to develop a compiler on their own, we strongly recommend choosing a RISC over a CISC architecture as the compiler's target. Here, RISC-V could be considered as a modern and relevant RISC architecture.

TODO: By only considering the compiler's struct fields, it has become apparent that this implementation provides several abstractions over the bare strings in which assembly is normally formatted. Due to this, implementation of the actual compiler is a lot more structured and reliable. During development of the compiler, this approach has often proven itself to be optimal.

# 6. Final Thoughts and Conclusions

Chapter 2, explained how the syntactical and semantic analysis play a vital role before program execution can start. It has been explained how the *parser* is used for analyzing the program's syntax in order to construct an *abstract syntax tree*, and how the *semantic analyzer* validates the program's semantic rules.

Next, in Chapter 3, we have explained how a tree walking interpreter can be used to interpret the AST directly. Furthermore, a virtual machine has been presented as a way to leverage some advantages of interpreters while achieving greater runtime speed. Here, the tree walking-interpreter has presented itself as the easier solution, both in planning and implementation. Instead, the virtual machine's implementation was slightly more demanding as it also requires a small compiler.

Chapter 4 provided an overview of compilation to high-level targets. As examples for high-level targets, Webassembly and LLVM have been used. Although the former is one specific architecture, it is very portable and provides many high-level constructs by itself. This, combined with the fact that it is stack-based rather than register-based, resulted in a rather simple compiler. The more difficult aspects are mostly caused by our decision to target Webassembly's binary format directly, which could have been avoided by targeting the text format instead.

Leveraging LLVM has proven itself to make implementation of a high-performance, multi-target compiler both feasible and easy. Due to helpful literature and guides on LLVM being rather sparse, we have used the implementation of some other Rust LLVM compilers in order to learn about its principles and concepts. To summarize, writing a compiler targeting these high-level targets presented a relatively easily accomplishable task. If we were to construct a compiler for a more complicated language, LLVM (or a similar solution) would present an attractive solution considering its current relevance in the software industry.

In Chapter 5, low-level programming concepts and compilation to low-level targets have been covered. During the research and software implementation phase of this project, this chapter has proven itself to be the most demanding by far. Reasons for this are that writing a compiler targeting a low-level architecture requires detailed knowledge about the target machine, as assembly is used to represent programs. Programming in assembly allows for many subtle bugs which are hard to track down later. As a result of this, we have both spent many hours trying to locate subtle bugs which were rooted in the assembly program. Moreover, creating a compiler that emits efficient code has proven to be very difficult. Therefore, our compilers only focus on the minimal principles without regarding optimization techniques. Like initially expected, implementation of the x86-64 compiler has proven to be more demanding than the implementation of the RISC-V compiler.

To summarize, this paper has presented two entirely different means of program execution: interpreters and compilers. In order to demonstrate the differences using practical examples, we have implemented our own programming language called *rush*. In summary, two interpreters, one transpiler, and four compilers were developed. Additionally, we have created additional tooling like a language server, a web playground, and a command line interface for the backends. Production-ready programming languages often implement a lot more tooling, including dependency managers, build systems, intricate analyzers, and formatters. Since this paper is primarily about methods of program execution, a reader interested in this tooling might browse the rush GitHub organization at 'https://github.com/rush-rs'.

# **Acknowledgements**

We would like to express our sincere gratitude towards *Sonja Sokolović* for her invaluable supervision and support during the creation of this paper. Our gratitude extends to our school, the CFG Wuppertal, which allowed us to pursue this research project as part of our final exams. During the creation of this paper, both of us gained lots of invaluable knowledge about compilers, interpreters, and low-level programming. Even though the past months were very stressful, this project was very enjoyable. Additionally, we would like to thank our classmate *Fatima* for inspiring the rush logo.

# **List of Figures**

| 1.1. | Different steps of compilation                                          | 2 |
|------|-------------------------------------------------------------------------|---|
|      | Different steps of compilation. (altered)                               | 2 |
| 2.1. | Abstract syntax tree for '1+2**3'                                       | 3 |
| 2.2. | Abstract syntax tree for '1+2**3' using Pratt parsing                   | ) |
| 2.3. | Token precedences for the input '(1+2*3)/4**5'                          | 2 |
| 2.4. | How semantic analysis affects the abstract syntax tree                  | L |
| 3.1. | Call stack at the point of processing the 'return' statement            | ó |
| 3.2. | Linear memory of the rush VM                                            | 3 |
| 3.3. | Example call stack of the rush VM                                       | ) |
| 3.4. | AST and VM instructions of the recursive rush program in Listing 3.9 33 | } |
| 4.1. | Abstract syntax tree for '1 + 2 < 4'                                    | ó |
| 4.2. | Steps of compilation when using LLVM                                    | 7 |
| 4.3. | The linking process                                                     | ; |
| 5.1. | Level of abstraction provided by assembly                               | ) |
| 5.2. | Relationship between registers, memory, and the CPU                     | ) |
| 5.3. | Examples of memory alignment                                            | 2 |
| 5.4. | Stack layout of the RISC-V architecture                                 | 3 |
| 5.5. | Simplified integer register pool of the RISC-V rush compiler            | 1 |
| 5.6. | Stack layout of the x64 architecture                                    | 7 |
| 5.7. | Structure of if-expressions in assembly                                 | Ĺ |

# **List of Tables**

| 1.3. | Lines of code of the project's components in commit '93ee068' | 4  |
|------|---------------------------------------------------------------|----|
| 1.1. | Most important features of the rush programming language      | 5  |
| 1.2. | Data types in the rush programming language                   | 5  |
| 2.1. | Advancing window of a lexer                                   | 7  |
| 2.2. | Mapping from EBNF grammar to Rust type definitions            | 9  |
| 5.1. | Common registers of the RISC-V architecture                   | 65 |
| 5.2. | x64 Integer Registers                                         | 86 |

# **List of Listings**

| 1.1.        | Generating Fibonacci numbers using rush                              |
|-------------|----------------------------------------------------------------------|
| 2.1.        | Grammar for basic arithmetic in EBNF notation                        |
| 2.2.        | The rush 'Lexer' struct definition                                   |
| 2.3.        | Simplified 'Token' struct definition                                 |
| 2.4.        | Example language a traditional LL(1) parser cannot parse             |
| 2.5.        | Token precedences in rush                                            |
| 2.6.        | Pratt-Parser: Implementation for expressions                         |
| 2.7.        | Pratt-Parser: Implementation for grouped expressions                 |
| 2.8.        | Pratt-Parser: Implementation for infix-expressions                   |
| 2.9.        | A rush program which adds two integers                               |
|             | Fields of the 'Analyzer' struct                                      |
|             | Output when compiling an invalid rush program                        |
|             | Analyzer: Validation of the 'main' function's signature              |
|             | Analyzer: The 'let_stmt' method                                      |
|             | Analysis of expressions during semantic analysis                     |
|             | Obtaining the type of expressions                                    |
|             | Validation of argument type compatibility in the analyzer            |
|             | Method for determining whether an expression is constant             |
|             | Redundant 'while' loop inside a rush program                         |
|             | Loop optimization in the analyzer                                    |
|             |                                                                      |
| 3.1.        | Tree-walking interpreter: Type definitions                           |
| 3.2.        | Tree-Walking Interpreter: 'Value' and 'InterruptKind' Definitions 23 |
| 3.3.        | Tree-Walking Interpreter: Beginning of Execution                     |
| 3.4.        | Tree-Walking Interpreter: Calling of Functions                       |
| 3.5.        | Example rush program                                                 |
| 3.6.        | Struct definition of the VM                                          |
| 3.7.        | Minimal pointer example in rush                                      |
| 3.8.        | VM Instructions for the minimal Pointer Example                      |
| 3.9.        | A recursive rush program                                             |
|             | Struct definition of a 'CallFrame'                                   |
|             | VM instructions matching the AST in 3.4                              |
|             | The 'run' method of the rush VM                                      |
| 3.13.       | Parts of the 'run_instruction' method of the rush VM                 |
| 11          | Circula page de instructions for a fictional analytecture            |
| 4.1.        | Simple pseudo-instructions for a fictional architecture              |
| 4.2.        | Compilation of infix-expressions targeting the VM                    |
| 4.3.        | Compilation of expressions targeting the VM                          |
| 4.4.        | A rush program containing a loop                                     |
| 4.5.        | Compiler output for the rush program in Listing 4.4                  |
| 4.6.        | Implementation of loops in the rush VM compiler                      |
| 4.7.<br>4.8 | Compilation of 'break' statements in the rush VM compiler            |
| 4 8         | Authore vvenassembly module in text representation 41                |

| 4.9.  | Simple WebAssembly module in binary representation              | • | <br> | • | 41       |
|-------|-----------------------------------------------------------------|---|------|---|----------|
| 4.10. | Definition of instruction opcodes                               |   | <br> |   | 43       |
| 4.11. | The 'Compiler' struct definition of the WebAssembly compiler    |   | <br> |   | 43       |
|       | Entry point of the Wasm compiler                                |   |      |   | 44       |
|       | Compilation of logical operators for Wasm                       |   |      |   | 45       |
|       | Example rush program                                            |   |      |   | 45       |
|       | Commented compiler output of the rush program in Listing 4.14.  |   |      |   | 46       |
|       |                                                                 |   |      |   |          |
|       | LLVM IR representation of the program in Listing 1.1            |   |      |   | 49       |
|       | Parts of the struct definition of the rush LLVM 'Compiler'      |   |      |   | 51       |
|       | Simple rush program containing two functions                    |   |      |   | 51       |
|       | LLVM IR generated from the program in 4.18                      |   |      |   | 51       |
| 4.20. | Compilation of the 'main' function using LLVM                   |   | <br> |   | 52       |
| 4.21. | Compilation of call-expressions using LLVM                      |   | <br> |   | 53       |
| 4.22. | Compilation of expressions Using LLVM                           |   | <br> |   | 53       |
|       | Compilation of integer infix-expressions using LLVM             |   |      |   | 54       |
|       | Compilation of let-statements using LLVM                        |   |      |   | 55       |
|       | Invoking LD to link an LLVM output.                             |   |      |   | 56       |
|       | A rush program containing a block-expression                    |   |      |   | 57       |
|       |                                                                 |   |      |   | 57<br>57 |
| 4.21. | C transpiler output of rush program in Listing 4.26             | ٠ | <br> | • | 37       |
| 5.1.  | Example assembly program for explaining register allocation     |   |      |   | 60       |
| 5.2.  | Alignment of values on the stack                                |   |      |   | 62       |
| 5.3.  | A rush program trying to alter the variable behind an argument. |   |      |   | 63       |
|       | - • • •                                                         |   |      |   | 63       |
| 5.4.  | A rush program altering the variable behind an argument         |   |      |   |          |
| 5.5.  | The assembly implementation of the 'exit' subroutine            |   |      |   | 68       |
| 5.6.  | Example rush program containing two functions                   |   |      |   | 68       |
| 5.7.  | Compiler output from the rush program in Listing 5.6            |   |      |   | 69       |
| 5.8.  | Example rush program containing a pointer                       |   |      |   | 70       |
| 5.9.  | RISC-V assembly output generated from Listing 5.8               |   | <br> |   | 70       |
| 5.10. | Fields of the RISC-V 'Compiler' struct                          |   | <br> |   | 71       |
| 5.11. | The 'Instruction' enum in the RISC-V compiler                   |   | <br> |   | 71       |
|       | The 'IntRegister' enumeration in the RISC-V compiler            |   |      |   | 72       |
|       | A rush program calculating the sum of two integers              |   |      |   | 73       |
|       | Assembly output of the rush program in Listing 5.13             |   |      |   | 73       |
|       | The 'get_int_reg' method of the RISC-V rush compiler            |   |      |   | 73       |
|       | <del>-</del>                                                    |   |      |   |          |
|       | Parts of the 'expression' method in the RISC-V rush compiler    |   |      |   | 74       |
|       | Parts of the 'infix_expr' method in the RISC-V rush compiler    |   |      |   | 75       |
|       | Parts of the 'infix_helper' method of the RISC-V rush compiler. |   |      |   | 76       |
|       | The 'define_main_fn' method of the RISC-V rush compiler         |   |      |   | 77       |
|       | The 'epilogue' method of the RISC-V rush compiler               |   |      |   | 78       |
| 5.21. | The 'save_expr_on_stack' method in the RISC-V rush compiler     |   | <br> |   | 78       |
| 5.22. | The 'call_expr' method in the RISC-V rush compiler              |   | <br> |   | 79       |
| 5.22. | The 'call_expr' method of the RISC-V rush compiler. (cont.)     |   | <br> |   | 80       |
|       | The 'call_expr' method of the RISC-V rush compiler. (cont.)     |   |      |   | 81       |
|       | A rush program containing a while-loop                          |   |      |   | 82       |
|       | Compiler output of the rush program in Listing 5.23             |   |      |   | 82       |
|       | Example rush program                                            |   |      |   | 84       |
|       |                                                                 |   |      |   | 85       |
|       | Compiler output from the rush program in Listing 5.25           |   |      |   |          |
|       | The 'Instruction' definition in the rush x64 compiler           |   |      |   | 87       |
|       | The 'IntValue' definition in the rush x64 compiler              |   |      |   | 88       |
| 5.29. | Overview of the x64 'Compiler' struct definition                |   | <br> |   | 88       |

| 5.30. Stack space reservation for values                            | 89 |
|---------------------------------------------------------------------|----|
| 5.31. Register allocation in the rush x64 compiler                  | 90 |
| 5.32. Another example rush program with two functions               | 90 |
| 5.33. Trimmed compiler output for the rush program in Listing 5.32  | 90 |
| 5.34. A rush example function containing if-expressions and a loop  | 91 |
| 5.35. Trimmed compiler output for the rush function in Listing 5.34 | 92 |
| 5.36. Compilation of integer division in x64                        | 94 |
| 5.37. Compilation of float comparisons in x64                       | 95 |

# **Bibliography**

- [Bac+60] J. W. Backus et al. "Report on the algorithmic language ALGOL 60". In: 3.5 (May 1960). Ed. by Peter Naur, pp. 299–314. DOI: 10.1145/367236.367262.
- [Pra73] Vaughan R. Pratt. "Top down Operator Precedence". In: Proceedings of the 1st Annual ACM SIGACT-SIGPLAN Symposium on Principles of Programming Languages. POPL '73. Boston, Massachusetts: Association for Computing Machinery, 1973, pp. 41–51. ISBN: 9781450373494. DOI: 10.1145/512927.512931.
- [Wir77] Niklaus Wirth. "What can we do about the unnecessary diversity of notation for syntactic definitions?" In: Communications of the ACM 20.11 (Nov. 1977), pp. 822–823. DOI: 10.1145/359863.359883.
- [KR88] Brian W. Kernighan and Dennis M. Ritchie. *The C Programming Language*. 2nd ed. Murray Hill, New Jersey: Prentice Hall, 1988.
- [Tor91] Linus Torvalds. Linux. 1991. URL: https://github.com/torvalds/linux.
- [Wal98] John Waldron. Introduction to RISC Assembly Language Programming. Pearson Education, 1998. ISBN: 0-201-39 82 8-1.
- [Lev00] John R. Levine. *Linkers and Loaders*. San Francisco, CA: Morgan Kaufmann, 2000.
- [Lat02] Chris Lattner. "LLVM: An Infrastructure for Multi-Stage Optimization". MA thesis. Urbana, IL: Computer Science Dept., University of Illinois at Urbana-Champaign, Dec. 2002.
- [Dan05a] Sivarama P Dandamudi. *Guide to RISC processors*. Ottawa, Canada: Springer International Publishing, Feb. 2005. ISBN: 0-387-21017-2.
- [Dan05b] Sivarama P. Dandamudi. Introduction to Assembly Language Programming: For Pentium and RISC Processors. 2nd ed. Springer International Publishing, 2005. ISBN: 0-387-20636-1.
- [Wir05] Niklaus Wirth. Compiler Construction. Zürich, 2005. ISBN: 0-201-40353-6.
- [TN07] Allen B. Tucker and Robert E. Noonan. *Programming Languages: Principles and Paradigms*. 2nd ed. McGraw-Hill Education, Nov. 2007. ISBN: 978-007-125439-7.
- [Mak09] Ronald Mak. Writing Compilers and Interpreters. 3rd ed. Indianapolis, IN: Wiley Publishing, 2009. ISBN: 978-0-470-17707-5.
- [Fan10] Dominic Fandrey. Clang/LLVM Maturity Report. Karlsruhe, Germany, June 2010.
- [Hol12] Nils M. Holm. Practical Compiler Construction. Raleigh, NC: Lulu Press, 2012.
- [Lov13] Robert Love. Linux System Programming. 2nd ed. Sebastopol, CA: O'Reilly Media, May 2013. ISBN: 978-1-449-33953-1.
- [CA14] Bruno Cardoso Lopes and Rafael Auler. Getting started with LLVM core libraries. Birmingham, UK: Packt Publishing, Aug. 2014. ISBN: 978-1-78216-692-4.
- [Lin+14] Tim Lindholm et al. The Java Virtual Machine Specification, Java SE 8 edition. Boston, MA: Addison-Wesley Professional, May 2014. ISBN: 978-0-13-390590-8.
- [Kol17] Daniel Kolsoi. Inkwell. 2017. URL: https://github.com/TheDan64/inkwell.

- [Mog17] Torben Ægidius Mogensen. Introduction to Compiler Design. 2nd ed. Copenhagen, Denmark: Springer International Publishing, 2017. ISBN: 978-3-319-66965-6.
- [PH17] David A. Patterson and John L. Hennessy. Computer Organization and Design RISC-V Edition: The Hardware Software Interface. The Morgan Kaufmann Series in Computer Architecture and Design. Morgan Kaufmann, Apr. 2017. ISBN: 978-0-12-812275-4.
- [PW17] David A. Patterson and Andrew Waterman. The RISC-V Reader: An Open Architecture Atlas. Strawberry Canyon LLC, Oct. 2017. ISBN: 978-0-9992491-0-9.
- [RA17] Steven Rodgers and Richard A. Uhlig. "X86: Approaching 40 and Still Going Strong". In: (June 2017).
- [Wat17] Des Watson. A practical approach to compiler construction. en. 1st ed. Undergraduate Topics in Computer Science. Cham, Switzerland: Springer International Publishing, Apr. 2017. ISBN: 978-3-319-52787-1.
- [Zhi17] Igor Zhirkov. Low-Level Programming: C, Assembly, and Program Execution on Intel® 64 Architecture. 1st ed. Saint Petersburg, Russia: APress, June 2017. ISBN: 978-1-4842-2403-8.
- [Kus18] Daniel Kusswurm. Modern X86 Assembly Language Programming. 2nd ed. Geneva, IL, USA: APress, Dec. 2018. ISBN: 978-1-4842-4063-2.
- [Che19] Boris Cherny. *Programming TypeScript*. Sebastopol, CA: O'Reilly Media, May 2019.
- [KN19] Steve Klabnik and Carol Nichols. *The Rust Programming Language*. San Francisco, CA: No Starch Press, 2019. ISBN: 978-1-7185-0044-0.
- [Ser19] Alejandro Serrano Mena. Practical Haskell: A Real World Guide to Programming. 2nd ed. Utrecht, The Netherlands: APress, Apr. 2019. ISBN: 978-1-4842-4479-1.
- [WA19] Andrew Waterman and Krste Asanović. "The RISC-V Instruction Set Manual Volume I: Unprivileged ISA". In: (Dec. 2019). Ed. by Andrew Waterman, Krste Asanović, and Sivive Inc. url: https://riscv.org/technical/specifications/.
- [Led20] Jim Ledin. Modern Computer Architecture and Organization. Birmingham, UK: Packt Publishing, Apr. 2020. ISBN: 978-1-83898-439-7.
- [Hsu21] Min-Yih Hsu. LLVM Techniques, Tips, and Best Practices Clang and Middle-End Libraries. Birmingham, UK: Packt Publishing, Apr. 2021. ISBN: 978-1-83882-495-2.
- [Jef21] Clinton L Jeffery. Build Your Own Programming Language. Birmingham, UK: Packt Publishing, Nov. 2021. ISBN: 978-1-80020-480-5.
- [McN21] Timothy Samual McNamara. *Rust in Action*. In Action. New York, NY: Manning Publications, Aug. 2021. ISBN: 9781617294556.
- [CCW22] Kito Cheng, Jessica Clarke, and Andrew Waterman. RISC-V ABIs Specification. Nov. 2022. URL: https://github.com/riscv-non-isa/riscv-elf-psabi-doc.
- [Lu+22] H.J. Lu et al. System V Application Binary Interface. AMD64 Architecture Processor Supplement. Version 1.0. Dec. 2022. URL: https://gitlab.com/x86-psABIs/x86-64-ABI.
- [Ros22] WebAssembly Core Specification. Version 2.0. W3C, Apr. 19, 2022. URL: https://www.w3.org/TR/wasm-core-2/.
- [Sen22] Kumar N Sendil. *Practical WebAssembly*. Birmingham, UK: Packt Publishing, May 2022. ISBN: 978-1-83882-800-4.

# A. Complete Grammar of rush in EBNF Notation

```
_ grammar.ebnf _
   Program = { Item } ;
2
                        = FunctionDefinition | LetStmt ;
   7
   Parameter
                        = [ 'mut' ] , ident , ':' , Type ;
   Block = '{' , { Statement } , [ Expression ] , '}' ;
9
   Type = { '*' } , ( ident
                       | '(' , ')' );
12
   Statement
                 = LetStmt | ReturnStmt | LoopStmt | WhileStmt | ForStmt
13
                 | BreakStmt | ContinueStmt | ExprStmt ;
14
                 = 'let' , [ 'mut' ] , ident , [ ':' , Type ] , '='
  LetStmt
15
                  , Expression , ';' ;
16
                 = 'return' , [ Expression ] , ';' ;
   ReturnStmt
17
                 = 'loop', Block, [ ';' ];
= 'while', Expression, Block, [ ';' ];
= 'for', ident, '=', Expression, ';', Expression, ';', Expression, Block, [ ';' ];
   LoopStmt
18
19
   WhileStmt
  ForStmt
21
22 BreakStmt
                 = 'break' , ';'
23 ContinueStmt = 'continue'
                 = ExprWithoutBlock , ';'
| ExprWithBlock , [ ';' ] ;
24 ExprStmt
25
26
27 Expression
                      = ExprWithoutBlock | ExprWithBlock;
28 ExprWithBlock
                      = Block | IfExpr ;
                      = 'if' , Expression , Block , [ 'else' , ( IfExpr
29 IfExpr
                                                                  Block ) ];
30
   ExprWithoutBlock = int
31
                      I float
32
                      I bool
33
                      I char
34
                      | ident
35
36
                      | PrefixExpr
                      | InfixExpr
37
38
                      AssignExpr
39
                      CallExpr
                      | CastExpr
40
                      | '(' , Expression , ')' ;
41
42 PrefixExpr
                      = PREFIX_OPERATOR , Expression ;
                      = Expression , INFIX_OPERATOR , Expression ;
43 InfixExpr
  (* The left hand side can only be an `ident` or a `PrefixExpr` with the `*`

→ operator *)

  AssignExpr
                      = Expression , ASSIGN_OPERATOR , Expression ;
  CallExpr
                      = ident , '(' , [ ArgumentList ] , ')' ;
                     = Expression , { ',' , Expression } , [ ',' ] ;
= Expression , 'as' , Type ;
47 ArgumentList
  CastExpr
48
49
   ident = LETTER , { LETTER | DIGIT } ;
```

```
51
52
53
                                         'f' ) ;
54
    char = "'" , ( ASCII_CHAR - '\'
55
                  | '\' , ( ESCAPE_CHAR
56
57
                           | 'x' , 2 * HEX ) ) , "'" ;
58
    bool = 'true' | 'false';
59
60
    comment = '//' , { CHAR } , ? LF ?
61
            | '/*' , { CHAR } , '*/' ;
62
63
                = 'A' | 'B' | 'C' | 'D' | 'E' | 'F' | 'G' | 'H' | 'I'
64
    LETTER
                      | 'K' | 'L' | 'M' | 'N' |
                                                        'P'
                  'J'
                                                  '0'
                                                            ' Q '
65
                  'S'
                            | 'U' |
                                                        'Y'
                      1 'T'
                                     ' \ '
                                           ' W '
                                                  ١Х١
                                                              ' Z '
66
                  'b'
                        'c'
                            | 'd'
                                  'e' |
                                           'f' |
                                                  ' q '
                                                      | 'h'
                                                              'i' | 'i'
67
                  'k'
                      | 'l' | 'm' | 'n' | 'o' | 'p' | 'q'
                                                            | 'r' | 's'
68
                      | 'u' | 'v' | 'w' | 'x' | 'y' | 'z'
                  't'
69
                = '0' | '1' | '2' | '3' | '4' | '5' | '6'
70
    DIGIT
                191
71
                = DIGIT | 'A' | 'B' | 'C' | 'D' | 'E' | 'F' | 'a'
72
                | 'b' | 'c' | 'd' | 'e' | 'f' ;
73
74
                = ? any UTF-8 character ? ;
    ASCII_CHAR = ? any ASCII character ? ;
75
    ESCAPE_CHAR = '\' | 'b' | 'n' | 'r' | 't' ;
76
77
                        = '!' | '-' | '&' | '*' ;
    PREFIX_OPERATOR
78
                        = ARITHMETIC_OPERATOR | RELATIONAL_OPERATOR
    INFIX_OPERATOR
79
                         | BITWISE_OPERATOR | LOGICAL_OPERATOR ;
80
   ARITHMETIC_OPERATOR = '+' | '-' | '*' | '/' | '%' | '**' ;

RELATIONAL_OPERATOR = '==' | '!=' | '<' | '>' | '<=' | '>=' ;
81
82
                        = '<<' | '>>' | ' | '&' | '^' ;
83
    BITWISE_OPERATOR
                        = '&&' | '||' ;
= '=' | '+=' | '-=' | '*=' | '/=' | '%='
84
    LOGICAL_OPERATOR
85
    ASSIGN_OPERATOR
                         | '**=' | '<<=' | '>>=' | '|=' | '&=' | '^=' ;
86
```