

# The Conversion of Source Code to Machine Code

Explaining the Basics of Compiler Construction Using a Self-Made Compiler

Silas Groh, Mik Müller

January 29, 2023

CFG Wuppertal

#### **Abstract**

Programming languages are undoubtedly of great importance for various aspects of modern-day life. Even if they remain unnoticed, digital systems running programs written in some sort of programming language are ubiquitous. However, there are numerous ways of implementing a programming language. A language designer could choose an interpreted or a compiled approach for their language's implementation. Both ways of program execution come with their own advantages and disadvantages.

This paper aims to inform the reader about different means of program execution, focussing on compiler construction. However, we will only focus on the basics since implementing a programming language is often a demanding task. In order to include practical examples, we will explain concepts on the basis of our own programming language called *rush*. During the implementation of rush, the focus for this paper has shifted slightly. As the title suggests, we originally planned on only implementing a compiler. However, there are numerous architectures which a compiler could target and settling on just one felt like the reader would miss out on too much. Therefore, we have implemented rush using one interpreter, one virtual machine, and five compilers.

In chapter 1, we will give an introduction to implementing a programming language. Moreover, the rush programming language and its characteristics are presented. In chapter 2, the process of analyzing the program's syntax and semantics is explained. Chapter 3 focuses on how interpreters can be used in order to implement an interpreted programming language. Here, we will differentiate between a tree-walking interpreter and a virtual machine. The latter also serves as the target architecture for one of the five compilers. Chapter 4 illustrates how compilation to high-level targets works. As examples for high-level targets, we will present the compiler targeting the virtual machine, a compiler targeting WebAssembly, and a compiler which uses the popular LLVM framework. Chapter 5 focuses on how compilers targeting low-level architectures can be implemented. For this, we will present a compiler targeting RISC-V assembly and another compiler targeting  $x86\_64$  assembly. Lastly, Chapter 6 presents final thoughts and a conclusion on the topic of implementing a programming language.

<sup>&</sup>lt;sup>1</sup>high-level targets: in this case machine independent

<sup>&</sup>lt;sup>2</sup>low-level targets: specific to one target architecture and operating system

# **Contents**

| Ι. | Introduction |                                                         |  |  |  |  |  |  |  |
|----|--------------|---------------------------------------------------------|--|--|--|--|--|--|--|
|    | 1.1.<br>1.2. | Stages of Compilation                                   |  |  |  |  |  |  |  |
| 2. | Ana          | Analyzing the Source                                    |  |  |  |  |  |  |  |
|    | 2.1.         | Lexical and Syntactical Analysis                        |  |  |  |  |  |  |  |
|    |              | 2.1.1. Formal Syntactical Definition by a Grammar       |  |  |  |  |  |  |  |
|    |              | 2.1.2. Grouping of Characters Into Tokens               |  |  |  |  |  |  |  |
|    |              | 2.1.3. Constructing a Tree                              |  |  |  |  |  |  |  |
|    | 2.2.         | Semantic Analysis                                       |  |  |  |  |  |  |  |
|    | 2.2.         | 2.2.1. Defining the Semantics of a Programming Language |  |  |  |  |  |  |  |
|    |              | 2.2.2. The Semantic Analyzer                            |  |  |  |  |  |  |  |
|    |              | v                                                       |  |  |  |  |  |  |  |
|    |              | 2.2.3. Early Optimizations                              |  |  |  |  |  |  |  |
| 3. | Inte         | preting the Program                                     |  |  |  |  |  |  |  |
|    | 3.1.         | Tree-Walking Interpreters                               |  |  |  |  |  |  |  |
|    |              | 3.1.1. Implementation                                   |  |  |  |  |  |  |  |
|    |              | 3.1.2. Example                                          |  |  |  |  |  |  |  |
|    |              | 3.1.3. Supporting Pointers                              |  |  |  |  |  |  |  |
|    | 3.2.         | Using a Virtual Machine                                 |  |  |  |  |  |  |  |
|    | 0.2.         | 3.2.1. Defining a Virtual Machine                       |  |  |  |  |  |  |  |
|    |              | $\overline{\mathcal{G}}$                                |  |  |  |  |  |  |  |
|    |              |                                                         |  |  |  |  |  |  |  |
|    |              | 3.2.3. Comparing the VM to the Tree-Walking Interpreter |  |  |  |  |  |  |  |
|    |              | 3.2.4. The rush Virtual Machine                         |  |  |  |  |  |  |  |
|    |              | 3.2.5. How the Virtual Machine Executes A rush Program  |  |  |  |  |  |  |  |
|    |              | 3.2.6. Fetch-Decode-Execute Cycle of the VM             |  |  |  |  |  |  |  |
| 4. | Con          | Compiling to High-Level Targets                         |  |  |  |  |  |  |  |
|    |              | How a Compiler Translates the AST                       |  |  |  |  |  |  |  |
|    | 1.1.         | 4.1.1. The Compiler Targeting the rush VM               |  |  |  |  |  |  |  |
|    | 4.2.         | Compilation to WebAssembly                              |  |  |  |  |  |  |  |
|    | 4.2.         | 4.2.1. WebAssembly Modules                              |  |  |  |  |  |  |  |
|    | 19           |                                                         |  |  |  |  |  |  |  |
|    | 4.3.         | Using LLVM for Code Generation                          |  |  |  |  |  |  |  |
|    |              | 4.3.1. The Role of LLVM in a Compiler                   |  |  |  |  |  |  |  |
|    |              | 4.3.2. The LLVM Intermediate Representation             |  |  |  |  |  |  |  |
|    |              | 4.3.3. The rush Compiler Using LLVM                     |  |  |  |  |  |  |  |
|    |              | 4.3.4. Final Code Generation: The Linker                |  |  |  |  |  |  |  |
|    |              | 4.3.5. Conclusions                                      |  |  |  |  |  |  |  |
| 5. | Con          | piling to Low-Level Targets                             |  |  |  |  |  |  |  |
| ٥. |              | Low-Level Programming Concepts                          |  |  |  |  |  |  |  |
|    | J.1.         | 5.1.1. Sections of an ELF File                          |  |  |  |  |  |  |  |
|    |              | 5.1.2. Assemblers and Assembly Language                 |  |  |  |  |  |  |  |
|    |              |                                                         |  |  |  |  |  |  |  |
|    |              | 5.1.3. Registers                                        |  |  |  |  |  |  |  |

| 5.2. RISC-V: Compiling to a Modern RISC Architecture       6         5.2.1. Register Layout       6         5.2.2. Memory Access Through the Stack       6         5.2.3. Calling Convention       6         5.2.4. The Core Library       6         5.2.5. RISC-V Assembly       6         5.2.6. The rush Compiler Targeting RISC-V Assembly       6         5.3. x86_64: A Compiler for a CISC Architecture       7         5.3.1. Register Layout       7         5.3.2. Memory Access Through the Stack       7         5.3.3. Calling Convention       7         5.3.4. X86_64 Assembly       7         5.4. Conclusion       7         5.4.1. RISC vs CISC Architectures       7         6. Final Thoughts and Conclusions       7         List of Tables       7         List of Listings       7 |  | 5.1.4. Using Memory: The Stack and the Heap |                                                  |          |  |  |  |  |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|---------------------------------------------|--------------------------------------------------|----------|--|--|--|--|
| 5.2.1. Register Layout       6         5.2.2. Memory Access Through the Stack       6         5.2.3. Calling Convention       6         5.2.4. The Core Library       6         5.2.5. RISC-V Assembly       6         5.2.6. The rush Compiler Targeting RISC-V Assembly       6         5.3. x86_64: A Compiler for a CISC Architecture       7         5.3.1. Register Layout       7         5.3.2. Memory Access Through the Stack       7         5.3.3. Calling Convention       7         5.3.4. X86_64 Assembly       7         5.4.1. RISC vs CISC Architectures       7         6. Final Thoughts and Conclusions       7         List of Figures       7         List of Listings       7                                                                                                     |  |                                             | 5.1.5. Calling Conventions                       | 62       |  |  |  |  |
| 5.2.2. Memory Access Through the Stack       6         5.2.3. Calling Convention       6         5.2.4. The Core Library       6         5.2.5. RISC-V Assembly       6         5.2.6. The rush Compiler Targeting RISC-V Assembly       6         5.3. x86_64: A Compiler for a CISC Architecture       7         5.3.1. Register Layout       7         5.3.2. Memory Access Through the Stack       7         5.3.3. Calling Convention       7         5.3.4. X86_64 Assembly       7         5.4.1. RISC vs CISC Architectures       7         6. Final Thoughts and Conclusions       7         List of Figures       7         List of Listings       7                                                                                                                                            |  | 5.2.                                        | RISC-V: Compiling to a Modern RISC Architecture  | 64       |  |  |  |  |
| 5.2.3. Calling Convention       6         5.2.4. The Core Library       6         5.2.5. RISC-V Assembly       6         5.2.6. The rush Compiler Targeting RISC-V Assembly       6         5.3. x86_64: A Compiler for a CISC Architecture       7         5.3.1. Register Layout       7         5.3.2. Memory Access Through the Stack       7         5.3.3. Calling Convention       7         5.3.4. X86_64 Assembly       7         5.4.1. RISC vs CISC Architectures       7         6. Final Thoughts and Conclusions       7         List of Figures       7         List of Listings       7                                                                                                                                                                                                   |  |                                             | 5.2.1. Register Layout                           | 64       |  |  |  |  |
| 5.2.4. The Core Library       6         5.2.5. RISC-V Assembly       6         5.2.6. The rush Compiler Targeting RISC-V Assembly       6         5.3. x86_64: A Compiler for a CISC Architecture       7         5.3.1. Register Layout       7         5.3.2. Memory Access Through the Stack       7         5.3.3. Calling Convention       7         5.3.4. X86_64 Assembly       7         5.4. Conclusion       7         5.4.1. RISC vs CISC Architectures       7         6. Final Thoughts and Conclusions       7         List of Tables       7         List of Listings       7                                                                                                                                                                                                              |  |                                             | · · · · · · · · · · · · · · · · · · ·            | 65       |  |  |  |  |
| 5.2.5. RISC-V Assembly       6         5.2.6. The rush Compiler Targeting RISC-V Assembly       6         5.3. x86_64: A Compiler for a CISC Architecture       7         5.3.1. Register Layout       7         5.3.2. Memory Access Through the Stack       7         5.3.3. Calling Convention       7         5.3.4. X86_64 Assembly       7         5.4. Conclusion       7         5.4.1. RISC vs CISC Architectures       7         6. Final Thoughts and Conclusions       7         List of Figures       7         List of Listings       7                                                                                                                                                                                                                                                     |  |                                             |                                                  | 65       |  |  |  |  |
| 5.2.6. The rush Compiler Targeting RISC-V Assembly       6         5.3. x86_64: A Compiler for a CISC Architecture       7         5.3.1. Register Layout       7         5.3.2. Memory Access Through the Stack       7         5.3.3. Calling Convention       7         5.3.4. X86_64 Assembly       7         5.4. Conclusion       7         5.4.1. RISC vs CISC Architectures       7         6. Final Thoughts and Conclusions       7         List of Figures       7         List of Listings       7                                                                                                                                                                                                                                                                                            |  |                                             |                                                  | 66       |  |  |  |  |
| 5.3. x86_64: A Compiler for a CISC Architecture       7         5.3.1. Register Layout       7         5.3.2. Memory Access Through the Stack       7         5.3.3. Calling Convention       7         5.3.4. X86_64 Assembly       7         5.4. Conclusion       7         5.4.1. RISC vs CISC Architectures       7         6. Final Thoughts and Conclusions       7         List of Figures       7         List of Listings       7                                                                                                                                                                                                                                                                                                                                                               |  |                                             | · ·                                              | 66       |  |  |  |  |
| 5.3.1. Register Layout       7         5.3.2. Memory Access Through the Stack       7         5.3.3. Calling Convention       7         5.3.4. X86_64 Assembly       7         5.4. Conclusion       7         5.4.1. RISC vs CISC Architectures       7         6. Final Thoughts and Conclusions       7         List of Figures       7         List of Tables       7         List of Listings       7                                                                                                                                                                                                                                                                                                                                                                                                |  |                                             |                                                  | 69       |  |  |  |  |
| 5.3.2. Memory Access Through the Stack       7         5.3.3. Calling Convention       7         5.3.4. X86_64 Assembly       7         5.4. Conclusion       7         5.4.1. RISC vs CISC Architectures       7         6. Final Thoughts and Conclusions       7         List of Figures       7         List of Listings       7                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |  | 5.3.                                        |                                                  | 70       |  |  |  |  |
| 5.3.3. Calling Convention       7         5.3.4. X86_64 Assembly       7         5.4. Conclusion       7         5.4.1. RISC vs CISC Architectures       7         6. Final Thoughts and Conclusions       7         List of Figures       7         List of Tables       7         List of Listings       7                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |  |                                             |                                                  | 70       |  |  |  |  |
| 5.3.4. X86_64 Assembly       7         5.4. Conclusion       7         5.4.1. RISC vs CISC Architectures       7         6. Final Thoughts and Conclusions       7         List of Figures       7         List of Tables       7         List of Listings       7                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |  |                                             | · ·                                              | 70       |  |  |  |  |
| 5.4. Conclusion                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |  |                                             |                                                  | 70       |  |  |  |  |
| 5.4.1. RISC vs CISC Architectures                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |  | F 1                                         |                                                  | 70       |  |  |  |  |
| 6. Final Thoughts and Conclusions 7 List of Figures 7 List of Tables 7 List of Listings 7                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |  | 5.4.                                        |                                                  | 70<br>70 |  |  |  |  |
| List of Figures 7 List of Tables 7 List of Listings 7                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |  |                                             | 5.4.1. RISC VS CISC Architectures                | 70       |  |  |  |  |
| List of Tables 7 List of Listings 7                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |  | 6. Fina                                     | l Thoughts and Conclusions                       | 71       |  |  |  |  |
| List of Listings 7                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |  | List of                                     | Figures                                          | 73       |  |  |  |  |
|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |  |                                             |                                                  |          |  |  |  |  |
| Duly 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |  |                                             |                                                  |          |  |  |  |  |
| Bibliography 7                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |  | Bibliog                                     | raphy                                            | 77       |  |  |  |  |
| Appendix A. Complete Grammar of rush in EBNF Notation 7                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |  | Append                                      | lix A. Complete Grammar of rush in EBNF Notation | 79       |  |  |  |  |

# 1. Introduction

Computer programs are often written in formal, purposefully designed languages. These languages introduce many constraints in order to allow programmers to implement algorithms in a structured and precise manner. Since programming languages should be easy to write for a human whilst being easy to understand by a computer, they are often falsely regarded as mysterious. The fundamental challenge is that a computer is only able to interpret a sequence of CPU instructions instead of a written program. Therefore, the source program has to be translated into such a sequence of instructions before it can be executed by the computer. Because programming languages come with many formal constraints, the translation process can be defined formally as well. Therefore, this translation can be automated and implemented as an algorithm on its own. This translation process is referred to as compilation, and is usually performed by a program called a compiler. However, the output instruction sequence must represent the identical algorithm specified in the source code. It is apparent that compilation requires significant effort and must obey complex rules since it should translate the source program precisely without altering its meaning.

Another common method of program execution is to implement a program which interprets the source code directly. This executor is referred to as an *interpreter*. Although compilers and interpreters share some of their core principles, their major difference is that the interpreter omits translation. The implementation of an interpreter is often significantly easier and smaller since the interpreter only has to understand the source program in order to execute it. In other words, the step of translating the source program into another form can be completely dismissed. However, implementing an interpreter is only rational if it is written using a high-level language like C or Rust. Implementation of an interpreter in a high-level language is often favorable since the host language is able to save the programmer a lot of work. If an interpreter was to be implemented using a low-level language like assembly, there would not be much work done by the host language. Compilers therefore played an essential role in the early days of computing since high-level languages were yet to be developed.

The first compiler was implemented around 1956 and aimed to translate *Fortran* to computer instructions. However, the success of this programming endeavor was not assured until the program was completed. In total, the program involved roughly 18 man-years of work and is thereby regarded as one of the largest programming projects of the time. To this day, new compilers are created and innovations in the field of programming languages can be observed regularly. Therefore, compiler construction can still be considered a fundamental and relevant topic in computer science [Wir05, p. 6].

TODO: Advantages of high-level langs /p. 9|Dandamuri2017

# 1.1. Stages of Compilation

In order to minimize intricacy and to maximize modularity, compilation often involves several individual steps. Here, the output of step s serves as the input for step s + 1. However, partitioning the compiler into as many steps as possible is prone to cause inefficiencies during compilation. Separating the process of compilation into individual steps was the

predominant technique until about 1980. Due to limited memory of the computers at the time, a single-step compiler could not be implemented. Therefore, only the individual steps of the compiler would fit, as each step occupied a considerate amount of machine memory. These types of compilers are called *multi-pass compilers*. However, the output of each step would be serialized and written to disk, ready to be read by the next step. It is obvious that this partitioning leads to a lot of performance overhead, since disk access if often significantly slower than memory access. Nowadays, we can mitigate these performance issues by implementing the compiler as a single program. Therefore, the compiler can avoid slow disk operations by keeping intermediate structures solely in memory.



Figure 1.1 – Different Steps of Compilation

- 1. The lexical analysis (*lexing*) translates sequences of characters of the source program into their corresponding symbols (*tokens*) of the language's vocabulary. Tokens, such as identifiers, operators, and delimiters are recognized by examining each character of the source program in sequential order.
- 2. The syntactical analysis (parsing) transforms the previously generated sequence of tokens into a tree data structure which directly represents the structure of the source program.
- 3. The semantic analysis is responsible for validating that the source program follows the semantic rules of the language. Furthermore, this step often generates a new, similar data structure which contains additional type annotations and early optimizations.
- 4. Code generation traverses the data structure generated by step 3 in order to generate a sequence of target-machine instructions. Due to likely constraints considering the target instruction set, the code generation is often considered to be the most involved step of compilation.

#### [Wir05, pp. 6–7]

Many modern compilers tend to combine steps 1 and 2 into a single step. In this approach, the parser holds a lexer and instructs it to return the next token as the parser demands it. Using this approach, memory usage is minimized since the parser only considers a few tokens instead of a complete sequence. Figure 1.2 shows an altered chart considering that change.

#### **TODO: citation?**



Figure 1.2 – Different Steps of Compilation (altered)

# 1.2. Characteristics of the rush Programming Language

For this paper, we have developed and implemented a simple programming language called rush<sup>1</sup>. The language features a *static type system*, *arithmetic operators*, *logical operators*, *local and global variables*, *pointers*, *if-else expressions*, *loops*, and *functions*. In order to introduce the language, we will now consider the code in listing 1.1.

```
fn main() {
1
2
        exit(fib(10));
3
4
    fn fib(n: int) -> int {
5
        if n < 2 {
6
7
             n
        } else {
8
             fib(n - 2) + fib(n - 1)
9
10
    }
11
```

Listing 1.1 – Generating Fibonacci Numbers Using rush

This rush program can be used to generate numbers included in the Fibonacci sequence. In the code, a function named fib is defined using the fn keyword. This function accepts the parameter n, it denotes the position of the number to be calculated. Since int is specified as the type of the parameter, the function may be called using any integer value as its argument. However, the constraint  $n \in \mathbb{N}$  must be valid in order for this function to return the correct result<sup>2</sup>. In this example, the main function calls the fib function using the natural number 10. In rush, every valid program must contain exactly one main function since program execution will start there. Even though rush has a return statement, the body of the fib function contains no such statement. This is because blocks like the one of the function fib return the result of their last expression. The if-else construct must therefore also be an expression since it represents the last entity in the block, and it is not followed by a semicolon. In this example, if the input parameter n is less than 2, it is returned without modification. Otherwise, the function calls itself recursively in order to calculate the sum of the preceding Fibonacci numbers n-2 and n-1. Therefore, the result value of the entire if-expression is calculated by using one of the two branches. Since the if-else construct is also an expression, there is no need for redundant return statements. In line 2, the exit function is called. However, this function is not defined anywhere. Nevertheless, the code still executes without any errors. This is due to the fact that the exit function is a builtin function as it is used to exit a program using the specified exit code.

In the git commit 7f28ebc, the entire rush project includes 17121 lines of source code<sup>3</sup>. On the first sight, this might seem like a large number for a simple programming language. However, the rush project includes a lexer, a parser, a semantic analyzer, five compilers, one interpreter, and several other tools like a language server for IDE support. In the rush project, most of the previously presented stages of compilation are implemented as their own individual code modules. This way, each component of the programming language can be developed, tested, and maintained separately.

<sup>&</sup>lt;sup>1</sup>Capitalization of the name is intentionally omitted

<sup>&</sup>lt;sup>2</sup>Assuming the function should comply with the original Fibonacci definition

<sup>&</sup>lt;sup>3</sup>Blank lines and comments are not counted

# 2. Analyzing the Source

# 2.1. Lexical and Syntactical Analysis

As previously mentioned, the first step during compilation or program execution is the *lexical* and *syntactical analysis*. Program source text is, without previous processing, just *text* i.e., a sequence of characters. Before the computer can even begin to analyze the semantics and meaning of a program it has to first *parse* the program source text into an appropriate data structure. This is done in two steps that are closely related and often combined, the *lexical analysis* performed by a *lexer* and the *syntactical analysis* performed by a *parser*.

#### 2.1.1. Formal Syntactical Definition by a Grammar

Just like every natural language, most programming languages also conform to a grammar. However, grammars for programming languages most often are of type 2 or 3 in the Chomsky hierarchy, that is *context-free* and *regular* languages, whereas natural languages often are type 1 or 0. **TODO:** citation Additionally, it is not uncommon for parser writers to formally define the grammar using some notation. Popular options include  $BNF^1$  and  $EBNF^2$ , the latter of which we use here. This paper does not fully explain these notations, however Listing 2.1 shows a short example grammar notated using EBNF. For reference, Appendix A contains the full grammar of rush. **TODO:** example for type 1 language '([)]'? **TODO:** explain start symbol? **TODO:** formal definition of grammars G = (N, T, S, P)?

Listing 2.1 – Grammar for Basic Arithmetic in EBNF Notation

One thing worth explaining is the '-' at the end of line 5. It is used to exclude a set of symbols from the preceding rule. As there is no symbol following the dash in this case, the empty word, called  $\epsilon$ , is excluded from the preceding repetition. Thus, the repetition cannot be repeated zero times here, as that would produce the empty word, which is excluded. So the notation '{ ...}-' is used to represent repetitions of one or more times, whereas the same without the trailing dash represents repetitions of zero or more times.

<sup>&</sup>lt;sup>1</sup>Backus-Naur Form, named after the two main inventors [Bac+60]

 $<sup>^2</sup>$ Extended Backus-Naur Form, an extended version of BNF with added support for repetitions and options without relying on recursion, first proposed by Niklaus Wirth in 1977 [Wir77] followed by many slight alterations. The version used in this paper is defined by the ISO/IEC 14977 standard.

#### 2.1.2. Grouping of Characters Into Tokens

Before the syntax of a program is validated it is common to have a lexer group certain sequences of characters into *tokens*. The set of tokens a language has is the union of the set of all terminal symbols used in context-free grammar rules and the set of regular grammar rules. For the language defined in Listing 2.1 these are the five operators '+', '-', '\*, '/', '\*\*, and the 'integer' non-terminal.

```
____ crates/rush-parser/src/lexer.rs ___

pub struct Lexer<'src> {
    input: &'src str,
    reader: Chars<'src>,
    location: Location<'src>,
    curr_char: Option<char>,
    next_char: Option<char>,
}
```

Listing 2.2 - The rush 'Lexer' Struct Definition

The specifics of implementing a lexer are not explored in this paper, however a basic overview is still provided. The base principal of a lexer is to iterate over the characters of the input to produce tokens. Depending on the target language it might however be required to scan the input using an *n*-sized window i.e., observing *n* characters at a time. In the case of rush this *n* is 2, resulting in the Lexer struct not only storing the current character but also the next character as seen in Listing 2.2. For clarity, Table 2.1 shows the values of curr\_char and next\_char during processing of the input '1+2\*\*3'. Here every row in the table represents one point in time displaying the lexer's current state.

| Calls | State       | curr_char | next_char | Output Token |
|-------|-------------|-----------|-----------|--------------|
| 0     | 1 + 2 * * 3 | None      | None      |              |
| 0     | 1 + 2 * * 3 | None      | Some('1') |              |
| 0     | 1 + 2 * * 3 | Some('1') | Some('+') |              |
| 1     | 1 + 2 * * 3 | Some('+') | Some('2') | Int(1)       |
| 2     | 1 + 2 * * 3 | Some('2') | Some('*') | Plus         |
| 3     | 1 + 2 * * 3 | Some('*') | Some('*') | Int(2)       |
| 4     | 1 + 2 * * 3 | Some('*') | Some('3') |              |
| 4     | 1 + 2 * * 3 | Some('3') | None      | Pow          |
| 5     | 1 + 2 * * 3 | None      | None      | Int(3)       |

Table 2.1 – Advancing Window of a Lexer

As explained in Section 1.1, many modern language implementations have the lexer produce tokens on demand. Thus, a lexer requires one public method called something like next\_token reading and returning, as the name suggests, the next token. In Table 2.1 the column on the left displays how many times the next\_token method has been called by the parser. In the first three rows this count is still 0, as this happens during initialization of the lexer in order to fill the curr\_char and next\_char fields with sensible values before the first token in requested. The Pow token, composed of two '\*' symbols, requires the lexer to advance two times before it can be returned, which is represented by the two rows in which the call count is 4. A simplified Token struct definition for the example language from Listing 2.1 is shown in Listing 2.3.

In addition to the current and next character, a lexer also has to keep track of the current position in the source text for it to provide helpful diagnostics with locations to the user. This is done in the location field which is incremented every time the lexer advances to the

next character. While producing a token the lexer can then read this field once at the start and once after having read the token and save the two values in the token's span.

```
struct Token {
1
2
         kind: TokenKind,
3
         span: Span,
    }
4
5
    enum TokenKind {
6
         Int(u64),
7
8
9
                // +
         Plus,
         Minus, //
10
         Star,
11
         Slash, //
12
13
         Pow,
    }
14
15
    struct Span {
16
         start: Location,
17
         end: Location,
18
    }
19
```

Listing 2.3 – Simplified
'Token' Struct
Definition

A special case worth mentioning are comments. As explained later in Section 2.1.3, depending on the parser, comments may be simply ignored and skipped during lexical analysis, or get their own token kind and be treated similar to string literals.

TODO: maybe mention having spans inclusive or exclusive

#### 2.1.3. Constructing a Tree

The parser uses the generated tokens in order to construct a tree representing the program's syntactic structure. Depending on how the parser should be used, this can either be a Concrete Syntax Tree (CST) or an Abstract Syntax Tree (AST). The former still contains information about all input tokens with their respective locations, whilst the latter only stores the abstract program structure with just the relevant information for basic analysis and execution. Therefore, a CST is usually used for development tools like formatters and intricate linters and analyzers where it is important to preserve stylistic choices made by the programmer or to know the exact location of every token. However, an AST is enough for interpretation and compilation as it preserves the semantic meaning of the program. Figure 2.1 shows an AST for the program '1+2\*\*3' in the language notated in Listing 2.1 on page 4. In the case of rush an AST with limited location information is used, because rush's semantic analyzer is still basic enough to work with that, and,

as discussed, execution and compilation requires no CST.

Not every parser is the same and there are a few different strategies for implementing one. These strategies are categorized into top-down parsers and bottom-up parsers. The main difference between them being the kind of tree traversal they perform. Top-down parsers construct the syntax tree in a pre-order manner, meaning a parent node is always processed before its children nodes. Hence, the syntax tree is constructed from top to bottom, starting with the root node. Bottom-up parsers instead perform post-order traversal. That way, all child nodes



Figure 2.1 – Abstract Syntax Tree for '1+2\*\*3'

are processed before their parent node. This results in the tree being constructed from the leafs at the bottom upwards to the root.

Top-down and bottom-up parsers are further categorized into many more subcategories. The two we will focus on here are LL(k) parsers and LR(k) parsers. TODO: 'we will' ok? TODO: citation for naming convention These are named after the direction of reading the tokens, 'L' being from left to right, and the derivation they use. 'L' is the leftmost derivation and 'R' the rightmost derivation. TODO: what are derivations? + citation The parenthesized k represents a natural number with  $k \in \mathbb{N}_0$  describing the number of

tokens for lookahead. Often k is either 1 or 0 for a two or one wide window respectively. This window moves just like previously explained for the lexer, and observes k tokens, not characters, simultaneously. Since in most cases k is 1, it is common to omit specifying it and to just speak of 'LL' and 'LR' parsers.

Alternatively to utilizing lookahead tokens, it is possible to create parsers with backtracking. Using that approach, a parser has to guess which syntax construct follows, if it can't concretely know based on the first token. When it then later detects an unexpected symbol, instead of throwing a syntax error, it cancels and returns to the point of decision-making to try the next possible construct. This of course comes with overhead and usually unclear error messages, which is why this method is rarely used and the superior lookahead method is preferable.

An example for LR parsing is the *shift-reduce* parsing approach, which is partially explained later on page 11. One thing to note however, is that LR parsers are generally very complicated to implement manually. LL parsers on the contrary are usually much simpler to implement, but come with a limitation. By design, they must recognize a node by its first n = k + 1 tokens, where n is the window size. However, due to that restriction, not every context-free language can be parsed by a/an LL parser. An example for that is given in Listing 2.4.

Listing 2.4 – Example Language a Traditional LL(1) Parser Cannot Parse

Most LL parsers are *recursive-descent* parsers, including the rush parser. Implementation of such a recursive-descent parser is rather uncomplicated. Assuming the grammar respects the mentioned limitation, every context-free grammar rule is mapped to one method on a Parser struct. In the example grammar from Listing 2.1 on page 4 again, these are all the capitalized rules highlighted in yellow. Additionally, a matching node type is defined for each context-free rule, holding the relevant information for execution. In Rust the mapping from EBNF grammar notation to type definitions is very simple as displayed in Table 2.2. **TODO:** struct signature example?

**Table 2.2** – Mapping From EBNF Grammar to Rust Type Definitions

```
EBNF
                                       Rust
                                       struct A { b: B, c: C }
A = B , C ;
A = B , [ C ] ;
                                       struct A { b: B, c: Option<C> }
                                       struct A { b: B, c: Vec<C> }
A = B , \{C\};
A = B , { C }-;
                                       struct A { b: B, c: Vec<C> }
A = B \mid C;
                                       enum A \{ B(B), C(C) \}
                                       struct A { b: B, op: Op, c: C }
A = B , ( '+' | '-' ) , C ;
                                       enum Op { Plus, Minus }
                                       struct A { b: B, c: Option<(XOrY, C)> }
A = B , [ ( X | Y ) , C ] ;
                                       enum X0rY { X(X), Y(Y) }
```

#### **Operator Precedence**

As previously discussed, a traditional LL parser cannot parse the language from Listing 2.4. However, when comparing it to Listing 2.1 it might become obvious that the two grammars notate the same language. The first one simply provides additional information about the order of nesting different kinds of expressions, called *precedence*. For example, when parsing the expression '1+2\*3' the '2\*3' part should be nested deeper in the tree for it to be evaluated first. In Listing 2.1 this is achieved by recognizing multiplicative expressions as Terms and having additive expressions be composed of Terms. Listing 2.4 does not indicate the order of evaluation itself, so it must be provided externally.

Additionally, a precedence may be either left or right associative. Consider the input '1\*2\*3' it should be evaluated from left to right, so first '1\*2' and then the result times 3. Now consider '1\*\*2\*\*3'. Here the '2\*\*3' should be evaluated first and afterwards 1 should be raised to the result. That means while most operators are evaluated from left to right, that is, they are left associative, some operators like the power operator are evaluated from right to left and are therefore right associative. In Listing 2.1 left associativity is achieved by allowing simple repetition of the operator for an indefinite amount of times. Right associativity instead uses recursion on the right-hand side of the operator.

For LR parsers the precedence and associativity for each operator is encoded within the parser table. However, there is also a method called *Pratt-Parsing* that allows slightly modified recursive-descent LL parsers to parse such languages, given a map from tokens to precedences and their associativity. Often the grammars without included precedence are preferred, because they usually result in a simpler structure of the resulting syntax tree. This can be seen when comparing Figure 2.1 from earlier to



Figure 2.2 – Abstract Syntax Tree for '1+2\*\*3' Using Pratt-Parsing

Figure 2.2 which shows the resulting AST for the same input using the alternative language representation. Most notably, the rather long sequences of nodes with just a single child, like the path on the left simply resolving to a single Int(1) token, are gone in Figure 2.2.

#### **Pratt-Parsing**

As the rush parser makes use of Pratt-Parsing, most of the following code snippets are taken from there. First a mapping from a token kind to its precedence must be defined. The one for rush is found in Listing 2.5. It shows the prec method implemented on the TokenKind enum. The return type is a tuple of two integers, one for left and one for right precedence. For all but one token kind the left precedence is lower than the right one, resulting in left associativity. The higher the precedences are, the deeper in the tree the resulting expressions will be, and the earlier they are evaluated. All unrelated tokens are simply assigned a precedence of 0 for left and right.

The expression method on the Parser struct is then modified to take a parameter for the current precedence as seen in Listing 2.6. It then first matches on the current token kind to decide which expression to parse and stores the result in the lhs<sup>4</sup> variable. Afterwards it checks whether the left precedence of the now current token is bigger than the prec argument.

<sup>3&#</sup>x27;\*\* is the power operator here, so the input would be written as 123 using mathematical notation

<sup>&</sup>lt;sup>4</sup>Short for 'left-hand side'

```
crates/rush-parser/src/token.rs _
    pub(crate) fn prec(&self) -> (u8, u8) {
172
173
         match self {
             // ...
             TokenKind::Plus | TokenKind::Minus => (19, 20),
194
             TokenKind::Star | TokenKind::Slash | TokenKind::Percent => (21, 22),
195
             TokenKind::As \Rightarrow (23, 24),
196
             TokenKind::Pow => (26, 25), // inverse order for right associativity
197
             TokenKind::LParen => (28, 29), // for calls
198
199
             _ => (0, 0),
200
     }
201
```

**Listing 2.5** – Token Precedences in rush

When called from elsewhere, like in the condition of a while-loop or in a grouped expression, the prec argument has its minimum value of 0 as shown in Listing 2.7. In that case, this check will only fail when the whole expression is over, as every non-operator token is assigned a precedence higher than 0. If it does not fail, the infix\_expr method is called with the matching operator and the lhs. Afterwards lhs is overwritten with the returned value.

```
crates/rush-parser/src/parser.rs
     fn expression(&mut self, prec: u8) -> Result<'src, Expression<'src>> {
551
         let start_loc = self.curr_tok.span.start;
552
553
         let mut lhs = match self.curr_tok.kind {
554
             TokenKind::Int(num) => Expression::Int(self.atom(num)?),
555
             TokenKind::LParen => Expression::Grouped(self.grouped_expr()?),
568
569
             invalid => {
                 return Err(Error::new_boxed(/* ... */));
570
             }
575
         };
576
577
         while self.curr_tok.kind.prec().0 > prec {
578
             lhs = match self.curr_tok.kind {
579
                 TokenKind::Plus => self.infix_expr(start_loc, lhs, InfixOp::Plus)?,
580
                 TokenKind::Star => self.infix_expr(start_loc, lhs, InfixOp::Mul)?,
581
                 TokenKind::Slash => self.infix_expr(start_loc, lhs, InfixOp::Div)?,
583
                 TokenKind::Pow => self.infix_expr(start_loc, lhs, InfixOp::Pow)?,
                 // ...
                 _ => return Ok(lhs),
613
614
             };
615
616
         Ok(lhs)
617
     }
618
```

**Listing 2.6** – Pratt-Parser: Implementation for Expressions

The infix\_expr method in Listing 2.8 simply stores the right TODO: unclear it's the side 'right', not 'correct' precedence of the operator token, advances to the next token, and calls the expression method again for its right-hand side, but this time with the stored right\_prec as the minimum precedence. These simple calls and checks of precedences automatically result in correct grouping and nesting of the expressions.

This might be made clearer by Figure 2.3. It shows the tokens with their respective left and right precedence for the input '(1+2\*3)/4\*\*5', which represents the mathematical expression  $\frac{1+2\cdot3}{4^5}$ . The for this example irrelevant precedences have been grayed out. Starting

```
crates/rush-parser/src/parser.rs -
     fn grouped_expr(&mut self) -> Result<'src, Spanned<'src, Box<Expression<'src>>>> {
733
734
         let start_loc = self.curr_tok.span.start;
735
         // skip the opening parenthesis
736
         self.next()?;
737
         let expr = self.expression(0)?;
738
739
         self.expect_recoverable(
740
             TokenKind::RParen,
             "missing closing parenthesis",
741
             self.curr_tok.span,
742
         )?;
743
         // ...
    }
749
```

Listing 2.7 – Pratt-Parser: Implementation for Grouped Expressions

```
fn infix_expr(/* ... */) -> Result<'src, Expression<'src>> {
    let right_prec = self.curr_tok.kind.prec().1;
    self.next()?;
    let rhs = self.expression(right_prec)?;
    // ...
}
```

**Listing 2.8** – Pratt-Parser: Implementation for Infix Expressions

from the left, the parser first encounters a/an 'LParen' token and therefore decides to parse a 'grouped\_expr'. Inside that method, 'expression' is again called, again with 'prec' being 0. The difference is that now the current token is 'Int(1)'. That token is then parsed as a simple integer expression and stored in 'lhs'. Now, the left precedence of the 'Plus' token, 19, is higher than 'prec's value of 0, so the while-loop is entered. In there, the 'infix\_expr' method is called, which queries the right precedence of the 'Plus' token and calls 'expression' again, this time with a precedence of 20. Skipping to the precedence check, again the current token's left precedence is higher than 'prec', the precedences being 21 for the 'Star' token and 20 for 'prec'. Therefore, 'infix\_expression' is called once again, which call 'expression' again, but now during the next precedence check, the left precedence of the following 'RParen' token, 0, is not higher than the current precedence of 22. That means, the innermost 'expression' call returns with a single 3. This 3 is then used as the right-hand side for the multiplicative infix expression. The same 'RParen' precedence check again fails for the 'expression' call with a precedence of 20. Thus, the '2\*3' part together forms the right-hand side of the addition expression. Once more, 'RParen's left precedence of 0 is not larger than the initial precedence of 0, hence the parser returns to the 'grouped\_expr' call with the entire contents in the parentheses. Here the right parenthesis is skipped, and the method returns to the outermost 'expression' call, assigning the entire grouped expression to 'lhs'.

In order to understand the right associativity of the 'Pow' token and to improve intuition,



Figure 2.3 – Token Precedences for Input '(1+2\*3)/4\*\*5'

the reader is advised to continue going through this procedure for the rest of the example input. A curious reader might additionally want to parse the inputs '1\*2\*3', '1\*\*2\*\*3', and '1+(2+3)' by hand. It should then become clear how Pratt-Parsing achieves parsing with precedences without it being encoded in the Grammar, and with that, the node types.

#### **Parser Generators**

For most intents and purposes it is generally not recommended and necessary to implement parsers, and with that, lexers, manually. **TODO:** citation Instead, there are so-called parser generators that generate parsers based on some specification of the desired syntax and the required precedences. Often parser generators define a domain specific grammar notation for the syntax specification, although some parser generators also accept traditional grammar notations as input. Most parser generators target some variation of table-driven LR parsers. Table-driven parsers use a table, mapping all possible combinations of tokens on the parse stack and lookahead tokens to an action. For shift-reduce parsers the possible actions are shifting, that is reading the next input token and pushing it to the stack, and reducing, which pops some number of elements from the stack and in place pushes a node to it. The input syntax must therefore either be unambiguous or augmented by precedence rules, so that a complete parser table can be generated. **TODO:** mention 'nom'

# 2.2. Semantic Analysis

Before compilation can begin, both the syntax and the semantics of the program have to be validated. The *semantic analysis* is responsible for validating that the structure and logic of the program complies with the rules of the programming language. Often, semantic analysis directly follows the syntax analysis since the parser generates the input for the semantic analysis step.

## 2.2.1. Defining the Semantics of a Programming Language

Often, a programming language is not just defined by its grammar because the grammar cannot specify how programs should behave. Therefore, a programming language's behavior is often defined by a so-called semantic specification. This specification often describes how a program should behave during runtime and what semantic rules discriminate a valid program from an invalid one. Common rules include type checking, context of statements, or integer overflow behavior. Another example of a semantic rule is that a variable has to be declared before it is used. Defining the semantic rules of a programming language is often a demanding task since not all requirements are clear from the beginning. Because the semantic rules of a programming language can not be defined formally, a language designer often writes their specification in a natural language, meaning Chomsky type 0. However, due to the specification being written in a natural language, the specification can sometimes be ambiguous. Therefore, a well-written semantic specification should avoid ambiguity as much as possible. Furthermore, this specification is often written in English due to it being a well-adopted language across several academic fields like computer science. Since those rules define when a program is valid, they have to be checked and enforced before program compilation can start [Wat17, p. 21].

#### 2.2.2. The Semantic Analyzer

Because rush shares its semantic rules across all backends, it would be cumbersome to implement semantic validation in each backend individually. Therefore, it is rational to implement a separate compilation step which is responsible for validating the source program's semantics. Among other checks, the so-called *semantic analyzer*<sup>5</sup> validates types and variable references whilst performing type annotations. The last aspect is of particular importance since all compiler backends rely on type information at compile time. In order to obtain type information, the abstract syntax tree of the source program must be traversed, performing numerous other checks during the process. The analyzer behaves exactly like this. In order to preserve a clear boundary between the individual compilation steps, the parser only validates the program's syntax without performing further validation. Therefore, the analyzer traverses the abstract syntax tree previously generated by the parser.

In order to highlight why type information is often required at compile time, we will consider Listing 2.9. The code in this listing displays a basic rush program calculating the sum of two integers and uses the result as its exit code. In this example, the exit code of the program will be 5.

```
1 fn main() {
2   let two = 2;
3   let three = 3;
4   exit(two + three)
5 }
```

Listing 2.9 – A rush Program Which Adds Two Integers

In this example, the analyzer will first check if the program contains a main function. If this is not the case, the analyzer rejects the program because it violates rush's semantic specification. Furthermore, the analyzer checks that the main function takes no parameters and returns no value. In this example, there is a valid main function which complies with the previously listed constraints. Now, the analyzer traverses the function body of the main function. First, the analyzer examines the statements in the lines two and three. Since let statements are used to declare new variables, the analyzer will add the variables two and three to its current scope. However, unlike an interpreter, the analyzer does not insert the variable's value into its scope. Instead of the concrete values, the analyzer only considers the types of expressions. Therefore, in this example, the analyzer remembers that the variables two and three store integer values. This information will become much more useful when we consider line 4. Here, the analyzer checks that the identifiers two and three refer to valid variables. Just like most other programming languages, rush does not allow the addition of two boolean values for example. Therefore, the analyzer checks that the operands of the + operator have the same type and that this type is valid in additions. Because this validation requires information about types, the analyzer accesses its scope when looking up the identifiers two and three. Since those names were previously associated with the int type, the analyzer is now aware of the operand types and can check their validity. In this case, calculating the sum of two integers is legal and results in another integer value. Since rush's semantic specification states that the exit function requires exactly one integer parameter, the analyzer has to check that it is called correctly. Furthermore, the analyzer validates all function calls and declarations, not just the ones of builtin functions. Since the result of the addition is also an integer, the analyzer accepts this program since both its syntax and semantics are valid.

<sup>&</sup>lt;sup>5</sup>Later referred to as "analyzer"

As indicated previously, most compilers require type information whilst generating target code. For simplicity, we will consider a fictional compiler which can compile both integer and float additions. However, the fictional target machine requires different instructions for addition depending on the type of the operands. For instance, integer addition uses the intadd instruction while float addition uses the floatadd instruction. Here, type ambiguity would cause difficulties. If there was no semantic analysis step, the compilation step would have to implement its own way of determining the types of the operands at compile time. However, determining these types requires a complete tree-traversal of the operand expressions. Due to the recursive design of the abstract syntax tree, implementing this tree-traversal would require a significant amount of source code in the compiler. However, the implementation of this algorithm would be nearly identical across all of rush's compiler backends. Therefore, implementing type determination in each backend individually would enlarge the compiler source code, thus making it harder to understand. Since code duplication is considered inelegant, outsourcing this algorithm into a separate component is likely the best option. As a result of this, the analyzer implements such a tree-traversal algorithm for determining the types of subtrees. Because of the previously mentioned reasons, rush's semantic analyzer also annotates the abstract syntax tree with type information so that it can be utilized by later steps of compilation.

In order to obtain a deeper understanding of how the analyzer works, we will now consider parts of its implementation and how they behave when analyzing the example from above. However, before we can examine how the analyzer's implementation behaves, we should first highlight which attributes play a vital role in the analyzer.

```
pub struct Analyzer<'src> {
12
        functions: HashMap<&'src str, Function<'src>>,
13
        diagnostics: Vec<Diagnostic<'src>>,
14
        scopes: Vec<HashMap<&'src str, Variable<'src>>>,
15
16
        curr_func_name: &'src str,
        /// Specifies the depth of loops, `break` / `continue` legal if > 0.
17
18
        loop_count: usize,
        // ...
    }
26
```

**Listing 2.10** – Attributes of the analyzer struct

Listing 2.10 displays the struct fields of the semantic analyzer. The field functions in line 13 associates a function name to the function's signature. Therefore, if a function is called at a later point in time, the analyzer checks if the function exists and can compare if the arguments match the declared parameters. The next field, diagnostics contains a list of diagnostics. A diagnostic is a struct which represents a message, is intended to be displayed to the user of the compiler. Each diagnostics has a severity, such as warning or error for instance. After the analyzer has finished the tree-traversal, all diagnostics are displayed in a user-friendly manner. The scopes field in line 15 is responsible for managing variables. In rush, blocks using braces ({}) create new scopes. If the analyzer enters such a block, a new scope is pushed onto the scopes stack. Each scope maps a variable identifier to some variable-specific data. For instance, the analyzer keeps track of variable types, whether variables have been used later, if they are mutated, and the location of where they were declared. By saving this much information about each declared variable, the analyzer can produce very helpful and accurate error messages or warnings. Such a message containing various diagnostics is displayed in Listing 2.11, it occurs when another value is assigned to an immutable variable.

```
SemanticError at test.rush:3:5
  2 |
          let number = 5;
  3 |
          number += 5;
          ^^^^^
  4 | }
cannot re-assign to immutable variable `number`
Hint at test.rush:2:9
  1 | fn main() {
  2 |
          let number = 5;
              ~~~~~
  3 |
          number += 5;
variable not declared as `mut`
```

Listing 2.11 – Output When Compiling an Invalid rush Program

Moreover, the field curr\_func\_name saves the name of the current function. Furthermore, the loop\_count field is used to validate the uses of the break and continue statements. Because these statements are only valid inside loop bodies, the value of loop\_count must be > 0 when the analyzer encounters such a statement. This counter is incremented as soon as the analyzer begins traversal of a loop body. After the analyzer has traversed the loop's body, the counter is decremented again. Due to this design, nested loops do not cause issues while the validity of the above statements can be guaranteed.

Now that important attributes have been highlighted, we can now consider the example from Listing 2.9. First, the analyzer traverses and analyzes all functions and their bodies. For every rush function, the analyzer invokes an internal method responsible for validating functions. Among other tasks, this method sets the <code>curr\_func\_name</code> field to the name of the current function and inserts a new entry into the <code>functions</code> hashmap, associating the function's name with its signature. Because a <code>main</code> function is mandatory in every rush program, the analyzer simply checks that the <code>functions</code> hashmap contains an entry for the <code>main</code> function. Naturally, this lookup is performed after all functions have been analyzed since the <code>main</code> function can then exist in the hashmap. The code in listing 2.12 shows how validating the <code>main</code> function's signature works.

```
fn function_definition(
396
397
         &mut self,
398
         node: FunctionDefinition<'src>,
399
     ) -> AnalyzedFunctionDefinition<'src> {
         // set the function name
400
         self.curr_func_name = node.name.inner;
401
402
         if node.name.inner == "main" {
403
             // the main function must have 0 parameters
404
             if !node.params.inner.is_empty() {
405
                 self.error(
406
                  // ...
                  )
418
             }
419
420
             // the main function must return `()`
421
             if let Some(return_type) = node.return_type.inner {
422
                  if return_type != Type::Unit {
423
                      self.error(
424
                         . . .
432
                  }
433
             }
434
         }
435
         let block = self.block(node.block, false);
490
         AnalyzedFunctionDefinition {
535
             used: true, // is modified in Self::analyze()
536
             name: node.name.inner,
537
538
             return_type: node.return_type.inner.unwrap_or(Type::Unit),
539
540
             block.
         }
541
    }
542
```

Listing 2.12 - Analyzer Validating the Signature of the 'main' Function

This code displays the 'function\_definition' method of the analyzer. In this listing, only the code relevant for analyzing the 'main' function is shown. However, this method is used to analyze any function, not just 'main'. The method takes a 'FunctionDefinition' as its input and returns an 'AnalyzedFunctionDefinition'. Therefore, it is responsible for analyzing and annotating the definition of a function. Since a rush file might contain multiple functions, this method is invoked for each function declared.

In line 401, the method updates the current function name. The if-clause in line 403 checks if the method is currently analyzing the main-function. In this case, the code inside the if-clause is executed in order to perform special checks on the main-function. The next if-clause in line 405 checks if the node's params vector contains any items. If the vector contained any items, the main-function's declaration would include parameters. If this was the case, the analyzer would generate an error message stating that the main-function must not take any parameters. However, we have not yet explained how error handling in the analyzer works. When examining the signature of this method, it becomes obvious that it cannot return errors. Instead, the self.error method is invoked in line 406. This method uses information about the error to be generated in order to push a new 'Diagnostic' struct into the diagnostics vector. Therefore, the 'diagnostics' vector will contain any generated errors after analysis has been completed.

In rush, the main-function cannot return a value at runtime. Therefore, the return-type of the function always has to be the unit-type. Thus, the analyzer validates this constraint in the lines 422 and 423. The first if-clause checks if the function contains a manually defined result-type. In rush, every function definition without an explicitly defined result type defaults to the unit type. Therefore, the inner if-clause is only executed if the user has manually specified a result type of their main-function. In this case, the analyzer checks if the manually specified type differs from the required unit-type. If this is the case, the analyzer generates another error in line 424 which describes the issue.

After the signature of the main-function has been validated, the method begins traversal of the function's body. In line 490, the 'self.block' method of the analyzer is invoked using the body (node.block) of the method as its first argument. The second boolean argument specifies that the method should not push another scope onto the stack since this is handled by the current method. The return-value of this method-call is bound to a variable called 'block'. This variable represents the completely analyzed and annotated function body. Lastly, in the lines 535-541, an 'AnalyzedFunctionDefinition' is returned. Here, metadata like the function's analyzed parameters, its return-type, its name, or its body are specified. Unknown variables like 'params' have been defined in the parts of the code which are currently hidden.

During the traversal of the main function's body, the analyzer encounters two let statements in line 2 and 3. For analyzing this type of statement, the let\_stmt method of the analyzer is invoked. The code in Listing 2.13 shows the 'let\_stmt' method of the analyzer.

```
fn let_stmt(&mut self, node: LetStmt<'src>) -> AnalyzedStatement<'src> {
612
         // ...
         let expr = self.expression(node.expr);
617
         if let Some(shadowed) = self.scope_mut().insert(
644
             node.name.inner,
645
             Variable {
646
656
         ) {
657
680
         AnalyzedStatement::Let(AnalyzedLetStmt {
682
             name: node.name.inner,
683
684
             expr,
             mutable: node.mutable,
685
             used: true,
686
         })
687
     }
688
```

**Listing 2.13** – Beginning of the 'let\_stmt' Method

In line 617, the initializing expression of the let-statement is analyzed first in order to obtain information about its result data type. After the subtree of the expression has been traversed and analyzed, its data type is now known. The analyzer now inserts a new entry for the variable's name (e.g. 'two') into its current scope. This insertion is performed in line 644. The contents of the pushed variable struct are hidden but include its type or its span for instance. Since the span includes the location of where the variable was defined, it can later be used in error messages like the one displayed in Listing 2.11. Furthermore, the inserted information inside the struct includes whether the variable was declared as mutable. If a variable is mutable, it can be reassigned to. Because the variable in our example program was

not declared as mutable, the error seen in Listing 2.11 was generated as a result. However, what strikes the eye is that the insertion happens as a condition inside an if-clause. If the insertion returns true, the variable's name was already present in the current scope and its previous associated data has now been overwritten. The process of overwriting variables by redefining them is called *variable shadowing*. Here, the analyzer should display some additional hints or warnings, depending on whether the shadowed variable has been referenced before it was shadowed. If this was not the case, the analyzer will generate a warning message informing the user about an unused and therefore redundant variable. Among the previously mentioned data, the insertion includes the variable's data type which was obtained by prior analysis of the expression.

It is now apparent that the analyzer uses type information of expressions on many occasions just like this one. However, we have not yet explained how type determination and annotation works in the analyzer. In order to get an understanding of how the analyzer determines types of variables, we must consider how expressions are traversed. The code in Listing 2.14 is part of the method responsible for analyzing expressions.

```
fn expression(&mut self, node: Expression<'src>) -> AnalyzedExpression<'src> {
1007
1008
         let res = match node {
1009
              Expression::Int(node) => AnalyzedExpression::Int(node.inner),
1010
              Expression::Float(node) => AnalyzedExpression::Float(node.inner),
              Expression::Bool(node) => AnalyzedExpression::Bool(node.inner),
1011
              Expression::If(node) => self.if_expr(*node),
1032
              Expression::Block(node) => self.block_expr(*node),
1033
              Expression::Grouped(node) => {
1034
                  let expr = self.expression(*node.inner);
1035
1040
1048
         res
1049
```

Listing 2.14 – Analysis of Expressions During Semantic Analysis

The 'node' parameter specifies the expression node generated by the parser, it does not contain any type information since it is yet to be analyzed. It is also apparent that the method returns a value of the type 'AnalyzedExpression', which represents an analyzed and annotated expression. Therefore, this method consumes a non-analyzed expression and transforms it into an analyzed version of itself. For simple types of expressions like integers, floats, or booleans, further analysis is omitted. Since these types of expressions are constant, the method can directly return an analyzed version of the expression. For more complex types of expressions, like if-expressions for instance, this method calls the appropriate method responsible for analyzing this type of expression. This way, implementation of this method stays relatively simple and the maintainability of the codebase increases.

In this function, the recursive tree-traversal algorithm used in the analyzer is clearly visible. For instance, if the current expression is a grouped expression like '(1 + 2)', the code in the lines 1034-1040 is called. In line 1035, the 'expression' method calls itself recursively using the inner expression of the grouped expression as the call argument. Since grouped expressions contain another inner expression, a grouped expression inside another grouped expression is a legal construct in rush. Therefore, it is possible that the expression method calls itself multiple times recursively. Most of the other tree-traversing methods implement a similar recursive behavior as most types of AST nodes may contain themselves at some point.

```
pub fn result_type(&self) -> Type {
130
         match self {
131
             Self::Int(_) => Type::Int(0),
132
133
             Self::Float(_) => Type::Float(0),
             Self::Bool(_) => Type::Bool(0),
134
             Self::If(expr) => expr.result_type,
142
143
             Self::Block(expr) => expr.result_type,
             Self::Grouped(expr) => expr.result_type(),
144
         }
145
    }
146
```

**Listing 2.15** – Obtaining the Type of Expressions

Since we have now explained how tree traversal and analysis works in general, the question of how types are accessed and saved in the annotated syntax tree remains. The code in Listing 2.15 shows how the type of any analyzed expression can be obtained. For constant expressions like Int(\_), the determination of its type is straight-forward. This case is displayed in line 132. Here, the result\_type method returns Type::Int(0). In this implementation, the Type enum saves a count which specifies the amount of pointer indirection. For instance, the rush type \*\*int is represented as Type::Int(2) because there are two levels of pointer indirection. However, if the method is called on a constant integer expression, the resulting level of pointer indirection is zero. Therefore, this method is able to return the types of simple constant expressions with no additional effort. For more complex constructs like if-expressions, the corresponding analyzed AST node saves its result type directly. For instance, during analysis of block expressions, the responsible function checks if the block contains a trailing expression. If this is the case, the result type of the block expression is identical to the one of its trailing expression. For instance, the result-type saved in a field on an individual AST node is accessed in line 142. In line 144, the result type of a grouped expression is obtained by calling the 'result\_type' method recursively. By using the previously described method, the analyzer is able to get type information about each node of the tree, assuming that it has been analyzed previously.

In the case of a semantically malformed program, the analyzer must somehow continue the tree traversal. Otherwise, only one error could be reported at a time since every traversing method could return a potential error which would terminate the tree traversal. To mitigate this issue, the Unknown type was implemented. If the analyzer encounters a type conflict where one of the conflicting types is unknown, it does not report another error since the unknown type could have only been caused by a previous error. Therefore, errors do not cascade, meaning that an undeclared variable will not cause another type error.

**Listing 2.16** – Validation of Call-Arguments in the Analyzer

Below the let-statements in the source program, the exit function is called. Here, the analyzer uses the call\_expr method in order to analyze the validity of this function call.

The code in Listing 2.16 contains the part of this method which is responsible for validating that the arguments are compatible with the function's parameters. For this, the code in the snipped iterates over the provided arguments. In each iteration, the 'self.arg' method of the analyzer is invoked. This method validates that the provided argument matches the provided parameter.

In this example, the argument expression two + three is traversed during this analysis. Since the identifiers on the left- and right hand side have been declared by the two let-statements previously, obtaining their data types merely involves a lookup of the identifier names inside the current scope's hashmap. If an unknown variable was provided, the lookup in the hashmap would yield no value, thus causing an error message to be generated at this point. Because the type of the invalid variable is unknown, the placeholder Unknown type would be used in order to prevent cascading errors.

Because the two variables should be added, the method <code>infix\_expr</code> of the analyzer is called. This method is responsible for analyzing any kind of infix expression like 'n <code>|| m'</code>'. This method validates several constraints. For instance, the operands must both be of the same type. In this example, both operands of the addition are integers. Therefore, the analyzer accepts this infix-expression and is now aware that it yields another integer. After the infix-expression's result type has been determined, it is saved in its own <code>result\_type</code> struct field. Infix-expressions are an example for tree nodes which save their result type as a struct field on their own. Now that the analysis of the argument expression has completed, its compatibility with the declared parameter must be validated.

```
let arg = self.expression(arg);
1814
1815
     match (arg.result_type(), param_type) {
1816
          (Type::Unknown, _) | (_, Type::Unknown) => {}
1817
1818
          (Type::Never, _) => {
              self.warn_unreachable(call_span, arg_span, true);
1819
              *result_type = Type::Never;
1820
          (arg_type, param_type) if arg_type != param_type => self.error(
1822
1827
          _ => {}
1828
     }
1829
```

Listing 2.17 – Validation of Argument Type Compatibility in the Analyzer

Listing 2.17 shows a part of the arg function which is responsible for validating that a function call argument is compatible with the declared parameter. In the above example, this means that the exit function is to be called with exactly one integer argument. This code will produce an error message if the type of the call argument deviates from the one of the declared parameter. In order to validate the compatibility between the provided argument and the declared parameter, the method differentiates between several possible scenarios. In line 1817, the method detects the scenario in which the type of either the argument or the parameter is 'Unknown'. Here, the method should ignore this argument without producing another error.

The next match-arm in line 1818 presents the scenario in which the provided argument has the 'Never' type. In this case, the analyzer should only add a warning that the call-expression is unreachable. Furthermore, the result type of the entire call-expression will also be updated to reflect the never-type. However, this scenario does not cause an error to be generated. Line 1822 displays the final scenario in which the type of the argument differs

from the expected type of the parameter. In this case, the method will generate an error describing the situation. Again, the concrete error message is omitted for better overview.

In the case of the example program, the analyzer did not generate any error messages since the code presents both a syntactically and semantically valid rush program. Therefore, the analyzer accepts this program and returns its syntax-tree with type annotations.

#### 2.2.3. Early Optimizations

Another task of the analyzer can be to perform early optimizations. In compiler design, most of the optimizations are often performed with the target machine in mind. Therefore, the effects of these target-machine dependent optimizations can excel the ones caused by earlier optimizations. However, it is still rational to perform trivial optimizations, such as constant folding and loop conversion inside the analyzer. For instance, the rush expression 2+3 evaluates to 5 during compile time instead of run time. This evaluation of expressions during compile time is referred to as *constant folding*. Constant folding is often used in order to avoid the emission of otherwise redundant arithmetic instructions. As a result of this, the compiled program will run faster since less computation is being performed when the program is executed. In order to make such optimization possible, each expression node in the analyzed AST has a method named constant [Wir05, p. 54].

Listing 2.18 – Method for Determining if an Expression is Constant

This method is responsible for determining whether an expression is constant. The method returns true if its expression is a constant integer, float, boolean or char. Other types of expression, such as a call-expression cannot be constant since such a function call may cause side effects which cannot be determined during compile time. This method is vital for constant folding since both the left- and right-hand side of infix-expressions need to be constant in order to allow compile-time evaluation.

Among other optimizations implemented in the analyzer, loop transformation can also have a positive effect on the program's performance during runtime. The top listing displays part of a rush program which uses a while loop even though a loop would be faster. The other listing displays the same algorithm implemented using the faster loop.

```
while true {
    a += 1
}

loop {
    a += 1
}
```

The loop implementation is more efficient since the condition check is omitted before each iteration. Because the while loop checks that its head condition is true before it starts the

next iteration, the while loop will run slower than the loop in this example. However, this is only the case because the condition of the loop is a constant true. Therefore, using a condition which is always true is redundant and should therefore be omitted. If the analyzer detects such a scenario after a while-loop was analyzed, the output node will be converted into a conventional loop. Detection of this scenario is implemented in line 855 of Listing 2.19.

```
match (never_loops, condition_is_const_true) {
851
         // if the condition is always `false`, return nothing
852
853
         (true, _) => None,
         // if the condition is always `true`, return an `AnalyzedLoopStmt`
854
         (false, true) => Some(AnalyzedStatement::Loop(AnalyzedLoopStmt {
855
             block,
856
             never_terminates,
857
858
         // otherwise, return an `AnalyzedWhileStmt`
859
         (false, false) => Some(AnalyzedStatement::While(AnalyzedWhileStmt {
860
861
             cond,
862
             block,
863
             never_terminates,
864
         })),
865
```

**Listing 2.19** – Loop Transformation in the Analyzer

Another scenario in which a while loop can be restructured occurs if the condition always evaluates to false. This example is displayed in the listing below.

```
while false {
    // this loop will never iterate
}
```

Since the loop in the above listing never iterates, it is completely redundant and can therefore be omitted entirely. This scenario is detected in line 853 of Listing 2.19. This optimization improves runtime efficiency by a small amount since the code performing the very first condition check will not be compiled into the output program. Furthermore, the resulting output code will also be of slightly smaller size since the entire loop compilation can be omitted. Therefore, implementing such trivial optimizations can significantly contribute to a more efficient output program. However, compiler writers often implement significantly more of those early optimizations than the ones presented in the two examples from above.

TODO: Include tree conversion figure which uses constant folding + has annotated types

# 3. Interpreting the Program

After the syntactical and semantic analysis, there are now multiple different ways to continue. The first one shown here is an *interpreter*, which, contrary to a compiler, executes the analyzed program directly. That means, no output to be run by another process is produced and written to disk. Instead, everything is kept in memory and immediately evaluated by the interpreter. **TODO: citation** Therefore, compared to a compiler, the *interpreter* not only has to be installed on the developer's machine, but also on the target machine. This chapter presents two fundamentally different ways of implementing an interpreted programming language.

# 3.1. Tree-Walking Interpreters

A tree-walking interpreter is probably the simplest form of programming language implementation, which is why it is the first one explained here. **TODO:** mention slowness here? It requires no further intermediate steps after the analysis but simply accepts the AST as input. Just like the parser and analyzer, it walks the tree, hence the name, and therefore again requires one method per node type.

The struct definition is also rather small as seen in Listing 3.1. It stores a list of *scopes*, that being maps of variable names to their value at runtime, starting with the global scope at index 0 and ending with the current scope at any point in time. **TODO:** are rush's scoping rules already explained? **TODO:** is runtime / compile time already explained? **TODO:** are globals already mentioned? Why the 'Value' is wrapped inside a/an 'Rc<RefCell<Value>>' here is explained later in Section 3.1.3. Additionally, a map of function names to their tree node is saved. The 'Rc<\_>' here is only necessary to conform to Rust's borrowing rules without unnecessarily cloning.

```
crates/rush-interpreter-tree/src/interpreter.rs
   type ExprResult = Result<Value, InterruptKind>;
9
   type StmtResult = Result<(), InterruptKind>;
   type Scope<'src> = HashMap<&'src str, Rc<RefCell<Value>>>;
10
11
   #[derive(Debug, Default)]
12
   pub struct Interpreter<'src> {
13
        scopes: Vec<Scope<'src>>,
14
        functions: HashMap<&'src str, Rc<AnalyzedFunctionDefinition<'src>>>,
15
   }
16
```

**Listing 3.1** – Tree-Walking Interpreter: Type Definitions

In addition to the struct itself, result types for expressions and statements are defined. Statements either return '()' on success or an 'InterruptKind' on failure. Expressions use the same error type, but return a 'Value' on success, holding the evaluated result. The definitions for both 'Value' and 'InterruptKind' are visible in Listing 3.2. It is clearly visible

<sup>&</sup>lt;sup>1</sup>Rust's 'unit' type, a type containing no data, comparable to 'void' in other languages. Its notation describes an empty tuple.

that the interpreter simply makes use of Rust's types and does not need to worry about the hidden implementation details.

```
_ crates/rush-interpreter-tree/src/value.rs _
    pub enum Value {
6
7
        Int(i64),
        Float(f64),
8
9
        Char(u8),
        Bool(bool),
10
11
        Unit,
        Ptr(Rc<RefCell<Value>>),
12
    }
13
    //
    pub enum InterruptKind {
22
        Return(Value),
23
24
        Break,
25
        Continue,
26
        Error(interpreter::Error),
27
        Exit(i64),
28
    }
```

**Listing 3.2** - Tree-Walking Interpreter: 'Value' and 'InterruptKind' Definitions

'InterruptKind' describes the different ways the interpreter can be interrupted. The first three variants, 'return', 'break', and 'continue', are only partial interrupts, i.e., they do not necessarily stop execution of the whole program. For example, when the interpreter encounters a 'break' statement, it creates a 'InterruptKind::Break' and tracks back until it reached the innermost loop, where the interrupt is then caught and execution is continued after the loop. remaining two variants are for one runtime errors, produced by things like division by zero, and 'Exit(i64)', created by rush's builtin 'exit(int)' function. They cause the interpreter to backtrack all the way back to the AST's root and with that cancel execution, as this is the desired behavior for errors and exit calls.

The way 'Value' is implemented also makes it very easy to support dynamic typ-

ing<sup>2</sup>, since it can already be a value of any type. The rush interpreter only makes use of the analyzer's guarantees of result types, which could not be done with dynamic typing.

#### 3.1.1. Implementation

```
crates/rush-interpreter-tree/src/interpreter.rs
    pub fn run(mut self, tree: AnalyzedProgram<'src>) -> Result<i64, Error> {
23
         for func in tree.functions.into_iter().filter(|f| f.used) {
24
             self.functions.insert(func.name, func.into());
25
26
27
         let mut global_scope = HashMap::new();
28
         for global in tree.globals.iter().filter(|g| g.used) {
29
             global_scope.insert(/* ... */);
30
         }
40
         self.scopes.push(qlobal_scope);
41
         match self.call_func("main", vec![]) {
56
             Err(InterruptKind::Error(msq)) => Err(msq),
57
             Err(InterruptKind::Exit(code)) => Ok(code),
58
             Ok(\underline{\ }) \mid Err(\underline{\ }) \Rightarrow Ok(\underline{\ }),
59
60
    }
61
```

**Listing 3.3** – Tree-Walking Interpreter: Beginning of Execution

Listing 3.3 displays the public 'run' method of the interpreter that serves as its entry point. Since the order of functions in rush does not matter and a function can be called

<sup>&</sup>lt;sup>2</sup>In contrast to static typing, with dynamic typing the types of variables and results of expressions are only known at runtime rather than during compile time.

before its definition in the file, the interpreter must first populate its 'functions' map, before any function code is run. Afterwards the global variables are assigned their initial values. Only then is the code inside the main function called via the 'call\_func' method. After it returns, the entire interpreter either returns a produced runtime error, an exit code caused by a rush call to 'exit', or the success exit code '0' otherwise.

```
crates/rush-interpreter-tree/src/interpreter.rs _
    fn call_func(&mut self, func_name: &'src str, mut args: Vec<Value>) -> ExprResult {
81
        if func_name == "exit" {
82
            return Err(InterruptKind::Exit(args.swap_remove(0).unwrap_int()));
83
84
85
        let func = Rc::clone(&self.functions[func_name]);
87
        let mut scope = HashMap::new();
88
        for (param, arg) in func.params.iter().zip(args) {
89
            scope.insert(param.name, arg.wrapped());
90
91
92
        self.scoped(scope, |self_| match self_.visit_block(&func.block, false) {
93
            Ok(val) => Ok(val),
94
            Err(interrupt) => Ok(interrupt.into_value()?),
95
        })
96
   }
97
```

**Listing 3.4** – Tree-Walking Interpreter: Calling of Functions

The 'call\_func' method, visible in Listing 3.4, first checks whether a built-in functions was called. As the only built-in function in rush is 'exit', a simple equality check is enough here. In case the called function is in fact the exit function, the first call argument is asserted to be an integer via the 'unwrap\_int' method on 'Value', and an exit interrupt kind containing that integer is returned immediately. Otherwise, the previously saved tree node of the function to be executed is received from the 'functions' map. A new variable scope for the function's body is then initialized and filled with the arguments passed to the function. Afterwards, the function's block is evaluated by calling the 'visit\_block' method in a scoped environment. If the block returns an interrupt that is partial, it is turned into the appropriate return value by the 'into\_value' method call. For 'continue' and 'break' this is '()' and for 'InterruptKind::Return(Value)' it is the wrapped value. The other two fatal interrupt kinds are simply passed along when encountered.

## **3.1.2.** Example

```
fn main() {
        exit(plus_two(global));
2
3
4
5
   let mut global = 40;
6
7
   fn plus_two(num: int) -> int {
8
        return num + 2;
        qlobal += 4;
9
   }
10
```

**Listing 3.5** – Example rush Program

To provide a basic overview of a program's execution without too many implementation details, the evaluation of the rush program found in Listing 3.5 is presented in the following.

First, the main function gets called by the interpreter's entry point as discussed before. there, 'visit\_block' is called with the main function's block. The 'visit\_block' method then iterates over its contained statements and evaluates each one in order with the 'visit\_statement' In this case, that is only the call to 'exit'. Since calls in rush are considered expressions and expressions can also be used wherever statements are allowed, the 'visit\_statement' method only forwards to 'visit\_expression', which itself forwards to 'visit\_call\_expr'. The call expression then evaluates each argument expression by calling 'visit\_expression' again. Here, that involves another call expression, this time of the 'plus two' function. It again evaluates its arguments, that being the access of the global variable 'global' here, and then runs the function's block. The call stack at the point of reaching the return statement, is dis-

```
run(/* ... */)
call_func("main", vec![])
visit_block(/* ... */)
visit_statement(/* ... */)
visit_expression(/* ... */)
visit_call_expr(/* ... */)
visit_expression(/* ... */)
visit_expression(/* ... */)
visit_call_expr(/* ... */)
visit_call_expr(/* ... */)
visit_statement(/* ... */)
```

Figure 3.1 – Call Stack at the Point of Processing the Return Statement

played in Figure 3.1. Now the expression 'num + 2' is evaluated by first evaluating both sides of the '+' sign on their own and then adding the results.

Following that, an interrupt of kind 'return' is constructed, holding the sum. By making use of Rust's '?'<sup>3</sup> operator, the interrupt causes early returns in all function calls walking up the call stack to 'call\_func'. Thus, the statement 'global += 4;' is never reached. The 'exit' function now finally has a value for its argument and can be called. However, as described earlier, exit is built-in and does not call 'visit\_block', but instead constructs an 'InterruptKind::Exit(\_)' with the result value, so '42' in this case.

## 3.1.3. Supporting Pointers

Adding support for pointers in a tree-walking interpreter is actually not as straight forward as it is for the following language implementations, which all have a manually managed memory layout. It also depends a lot on the implementing language. In languages with a garbage collector, for example Go or Java, the pointer functionality of that language can just be reused, and the cleanup is already managed. However, unlike many modern languages, Rust does not have a garbage collector. Instead, it makes use of the so-called borrow checker that validates all references at compile time. Using default Rust references for pointers while conforming to Rust's borrowing rules turns out to be quite complicated though. But it is not required to use them. Rust provides an additional method of having pointers to values besides their reference system. The 'Rc<->' type, short for reference counter, stores its contained value on the heap and provides shared access to it, without any one owner. The held value is freed as soon as no more references to it exist. This makes it ideal for implementing pointers in a tree-walking interpreter.

However, as mentioned, a reference counter only provides **shared** access to the contained value. In Rust that implies not being able to mutate the value, as that would again break the borrowing rules. In order to still support mutable variables while having pointers to

<sup>&</sup>lt;sup>3</sup>Can be used after expressions returning a 'Result<\_, \_>' to return early in case they are the 'Err(\_)' variant.

them, we can make use of another type the Rust standard library conveniently provides. A 'RefCell<\_>' implements so-called *interior mutability*<sup>4</sup> by enforcing Rust's borrowing rules at runtime. By now wrapping values inside both a reference counter and a RefCell, it is possible to support mutable variables and pointers.

<sup>&</sup>lt;sup>4</sup>Types in Rust that have interior mutability allow mutation through shared references

# 3.2. Using a Virtual Machine

#### TODO: For more general info: [Wat17, p. 209]

Just like a tree-walking interpreter, a virtual machine presents a way of implementing an interpreter for a programming language. However, the way a virtual machine operates fundamentally differs from a tree-walking interpreter. For rush, we have implemented a virtual machine backend in order to compare it to the previously explained tree-walking interpreter.

#### 3.2.1. Defining a Virtual Machine

Often, one might encounter the term *virtual machine* when talking about emulating an existing type of computer using a software system. This emulation often includes simulating additional devices like the computer's display or its disk. In this context however, a *virtual machine*<sup>5</sup> is a software entity which emulates how a computer interprets instructions. Just like a real computer, a virtual machine executes low-level instructions directly. However, since the VM is unable to traverse the AST, it depends on a compiler generating its input instructions.

Since a physical processor and a virtual machine share some fundamental traits, the architecture of a virtual machine is often a slight deviation from the von Neumann architecture. The von Neumann architecture was first introduced by John Neumann in the year 1945. Von Neumann originally presented a design which allows implementing a computer using relatively few components. Following the von Neumann architecture, a processor would usually contain components like an  $ALU^6$ , a control unit, multiple registers, memory, and basic IO [Led20, p. 172]. The ALU is often designed so that it performs logical and mathematical operations as fast as possible. However, to keep its implementation simple, it lacks the ability to fetch and execute instructions from memory directly. Therefore, the processor contains a control unit which manages the fetch-decode-execute cycle. The fetch-decode-execute cycle is a simplification of the steps a processor performs in order to execute an instruction. The list below explains the individual steps of the fetch-decode-execute cycle.

#### TODO: Figure which shows the cycle

- (Fetch): The processor's control unit loads the next instruction from the adequate memory location. The instruction is then placed into the processor's internal instruction register where it is available for further analysis.
- (Decode): The processor's control unit examines the fetched instruction in order to determine if additional steps must be taken before or after instruction execution. Such steps may involve accessing additional registers or memory locations.
- (Execute): The control unit dispatches the instruction to a specialized component of the processor. The target component is often dependent on the type of instruction since each processor component is optimized with one specific type of instruction in mind. For instance, the control unit may invoke the ALU in order to execute a mathematical instruction.

A computer's processor performs this fetch-decode-execute cycle repeatedly from the moment it is powered on until the point in time when it is powered down again. For relatively simple processors, each cycle is executed in an isolated manner because instructions are executed in a sequential order. This means that the execution of the instruction i is delayed

<sup>&</sup>lt;sup>5</sup>May later be shortened to "VM"

<sup>&</sup>lt;sup>6</sup>Short for "arithmetic logic unit"

until execution of i-1 has completed [Led20, pp. 208-209].

For virtual machines, executing the input instructions in sequential order is often also the simplest solution. Often, a virtual machine executes instructions similarly to the fetch-decode-execute cycle. Although the von Neumann architecture is relatively simple, one does not always have to adopt it when implementing a virtual machine. Since virtual machines are purely abstract constructs, meaning that they are implemented using software, design constraints are usually kept to a minimum. Therefore, a virtual machine can also be implemented with the high-level constructs of the source language in mind. For instance, the VM might feature specialize break, continue, or loop instructions which are not present in modern-day CPUs. Designing the architecture of a virtual machine can sometimes be a challenging task since choosing an adequate set of features may involve a lot of testing iterations. Because neither the compiler nor the VM exist in the beginning, one should carefully plan the implementation of their VM's architecture. From the point where the architecture is clear, implementing the VM should normally be a straight-forward task.

#### 3.2.2. Register-Based and Stack-Based Machines

One of the main decisions to be made when designing a VM is how it implements temporary storage. Physical processors often use registers in order to make larger computations feasible. Registers are a limited set of very fast, low capacity storage units. On modern architectures, like  $x86\_64$ , each general-purpose register is able to hold as much as 64 bits of information. However, there is always only a limited amount of registers available since they are physical components of the computers CPU. Therefore, programs often only utilize registers for storing temporary values, such as intermediate results of a large computation. The main alternative to using registers is a stack-based design. A popular example for a stack-based virtual machine is WebAssembly [Sen22, p. 44]. For more information on WebAssembly, we will present a compiler targeting WebAssembly in Chapter 4. For compiler writers, register allocation is often a demanding task. This problem is described in more detail in Chapter 5. Since register allocation is not required in a compiler targeting stack-based machines, its implementation is often significantly easier compared to a compiler targeting a registerbased machine. Therefore, one might choose to implement a stack-based virtual machine in order to minimize complexity of both the compiler and the interpreter. However, a stackbased design also introduces several issues on its own. For example, register-based machines might regularly outperform stack-based machines. A reason for this is that use of the stack usually requires a lot of push or pop operations which could have otherwise been omitted.

## 3.2.3. Comparing the VM to the Tree-Walking Interpreter

One significant benefit of virtual machines is that they execute programs much faster compared to most tree-walking interpreters. A reason for this speedup is that tree-traversal involves a lot of overhead which is omitted when instructions are interpreted directly. The code in Listing 3.6 displays a recursive function implemented in rush.

```
5 fn rec(n: int) -> int {
6    if n == 0 {
7       0
8    } else {
9       rec(n - 1)
10    }
11 }
```

Listing 3.6 – A Recursive rush Program



Figure 3.2 – Abstract Syntax Tree and VM Instructions of a Recursive rush Program

Figure 3.2 displays a heavily simplified syntax tree and rush VM instructions representing the function displayed in Listing 3.6. The root node of the tree represents the rec function. Since the function only contains a single expression, the if-expression node is the only child of the root node. The if-expression contains a condition, an if-branch, and an else-branch. Since the function should not call itself again if n is equal to 0, the if-branch returns 0. In the else-branch however, the rec function calls itself recursively. When the above program is executed using the tree-walking interpreter, the algorithm traverses the entire tree of the rec function every time it recurses. Since rec is a recursive function, the tree-walking interpreter would have to traverse it n times. In this example, the AST of the program is relatively simple. However, the complexity of the tree grows as the source program evolves. Since loops and recursive functions execute the code in their bodies repeatedly, the tree traversal of the body presents an inefficiency. Here, the inefficiency solely lies in the repeated tree-traversal, not in the repetition introduced by an iterative or recursive algorithm. In order to improve efficiency, an algorithm could traverse the AST once, saving its semantic meaning in the process. Then, the semantic meaning of the previously traversed tree could be interpreted repeatedly without the additional overhead. This behavior is used in the rush VM since it interprets instructions previously generated by a compiler. A compiler targeting the virtual

machine's architecture first traverses the AST and outputs a sequence of instructions. The instructions on the right side of Figure 3.2 represent the program in Listing 3.6. Every time the call instruction in line 23 is executed, the VM only needs to jump to the instruction in line 10 in order to execute the rec function recursively. Since repeated traversal of the syntax tree is omitted, rush programs will run significantly faster using the VM compared to the tree-walking interpreter. Using the VM, executing the rec function using an input of n=1000 took around 160  $\mu$ s. However, executing the identical code using the tree-walking interpreter took around 427  $\mu$ s<sup>7</sup>. Therefore, the rush VM executed the identical code roughly 2.6 times faster than the tree-walking interpreter. However, the initial delay caused by compilation was not considered in this benchmark.

#### 3.2.4. The rush Virtual Machine

The rush virtual machine is a stack-based interpreter implemented using the Rust programming language. The machine's architecture is completely fictional and includes a *stack* for storing short-term data, *linear memory* for storing variables, and a *call stack* for managing function calls. Like most virtual machines, the rush VM uses a fetch-decode-execute cycle in order to interpret its programs.

```
pub struct Vm<const MEM_SIZE: usize> {
16
17
        /// Working memory for temporary values
        stack: Vec<Value>,
18
        /// Linear memory for variables.
19
        mem: [Option<Value>; MEM_SIZE],
        /// The memory pointer points to the last free location in memory.
21
        /// The value is always positive, however using `isize` is beneficial in order
       to avoid casts.
        mem_ptr: isize,
23
        /// Holds information about the current position (like ip / fp).
24
25
        call_stack: Vec<CallFrame>,
   }
26
```

Listing 3.7 – Struct Definition of the Vm

Listing 3.7 displays the struct definition of the rush VM. In line 18, a field called 'stack' is declared. Like explained above, the rush VM uses a stack for storing temporary values. Even though the stack is implemented using a 'Vec', it behaves identical to a LIFO stack data structure. In line 20, the 'mem' field is declared. This field represents the linear memory capable of storing variables. However, each memory cell is implemented to hold an option of a value. Therefore, each memory cell may also hold a None value representing uninitialized memory. In line 23, the 'mem\_ptr' field is declared. It, too, serves an important role in managing the linear memory. The exact responsibilities of the so-called memory pointer will be explained shortly. Lastly, a field named 'call\_stack' is declared. This field also behaves like a stack and is responsible for managing function calls and returns. More information on this field will be provided in later parts of this section.

The instructions in Figure 3.2 can be interpreted by the rush VM. Here, the output program is structured as functions which each contain a list of their instructions. Since function and variable names are replaced by indices, strings can be entirely omitted in the output instructions. This often leads to a decrease in code size and an increase in runtime

<sup>&</sup>lt;sup>7</sup>Average from 10000 iterations. OS: Arch Linux, CPU: Ryzen 5 1500, RAM: 16 GB

speed. For better understanding, we have annotated the individual functions with their human-readable names.

The first block of instructions can be called the 'prelude' since its only task is to call the 'main' function and declare global variables. Global variables need to be initialized at the beginning of a program so that they can be accessed later in the program. If global variables were present in the example, the prelude would contain the instructions used for initializing them. If the prelude was omitted, the main function would instead contain these instructions since it is executed at program start. However, recursion of the 'main' function is legal in rush. Therefore, each time the 'main' function recurses, all global variables would be restored to their initial values. In order to prevent this bug, the rush VM uses a prelude function which is guaranteed to run only once.

Linear memory in the VM is represented as an array which saves the runtime value of a variable in each index. Since an array is used, the memory of the VM is limited. However, a large memory size if often enough to run most of the possible programs. In the rush VM, each storage cell can be accessed using two addressing modes. When using the absolute addressing mode, the exact index of the memory cell is specified. For instance, if the value of variable 'd' of Figure 3.3 was to be retrieved, the VM would need to access the storage cell with the index 4. However, the absolute position of a variable in memory can only be determined at runtime. In a recursive function, each recursion adds more variables to the scope, thus allocating more memory. Here, the exact number of recursions the function performs would have to be known at compile time. Of course, this presents an impossible task, thus making writing a compiler targeting the VM impossible. However, the rush VM also implements a relative addressing mode which can be used without knowledge about the absolute position of the memory cell. For instance, the variable 'd' can be addressed by index 0 using the relative addressing mode. In Figure 3.3, the memory pointer (shortened to "mp") is set to 4. Considering the value of the memory pointer, the absolute address of any relative address can be calculated at runtime. Here, the absolute address a is the sum of the relative index i and the runtime memory pointer m. Therefore, the absolute address of any relative address can be calculated at runtime like this: a = i + m. By also implementing this relative addressing mode, compilers targeting the rush VM can generate code without knowing the runtime behavior of a program.



Figure 3.3 – DRAFT: Linear Memory of the rush VM

In order to get a deeper understanding of the addressing modes, a practical example can be considered. The code in Listing 3.8 displays a rush program in which a pointer to a variable is created. First, the integer variable num is created. In line 3, a pointer variable called to\_num is created by referencing the num variable.

```
1 fn main() {
2    let mut num = 42;
3    let to_tum = #
4 }
```

**Listing 3.8** – Minimal Pointer Example in rush

In the rush VM, absolute addressing is only used for global variables and pointers. Since a pointer specifies the address of another variable, its runtime value will be the absolute address of its target variable. In the VM, the absolute address of a variable is calculated as soon as it is referenced using the & operator. For this purpose, the reltoaddr instruction exists. This instruction calculates the absolute address of its operand and pushes the result onto the stack. Here, the operand is the relative address of the variable to be referenced. Listing 3.9 shows the VM instructions generated from the rush program in Listing 3.8.

```
1 main:
2    setmp 2
3    push 42
4    svari *rel[0]
5    reltoaddr 0
6    svari *rel[-1]
```

**Listing 3.9** – VM Instructions for the minimal Pointer Example

The first instruction setmp (set memory pointer) increases the memory pointer by two. This is because the main function contains two local variables whose space is to be allocated at the start of the function. For instance, one might encounter mp being incremented by 0 since the corresponding function contains no local variables. The next instruction push pushes the value 42 onto the stack. In line 4, the svari (set variable immediate) assigns the top value on stack to the specified relative address. Here, 42 is popped off the stack since it is used by the svari instruction. Next, the instruction stores the previously popped value at the relative address at the relative address 0 specified in the operand. Now, the variable num with an initial value of 42 has been created. Next, the to\_num variable is created by referencing the num variable. In line 5, the reltoaddr (relative to address) instruction is used to calculate the absolute memory address of the num variable. This instruction calculates the absolute address of its operand at runtime using the algorithm described above. Then, the instruction pushes the calculated address onto the stack so that it can be used by following instructions. Here, the relative address 0 is used since the svari instruction has previously saved 42 at this location. Therefore, the value of the variable num is saved at the relative address 0. In line 6, the svari instruction is used again. This time, it is used to save the value of the to\_num variable. Since the absolute address of the referenced variable was previously calculated, it now exists on top of the stack. Now, the instruction saves the absolute address of num at the relative address -1. This is because the compiler targeting the VM assigns variables to higher relative addresses first. The compiler then progresses into lower relative memory as more variables of the function are declared. To summarize the above paragraph, this example uses relative addressing in order to declare local variables of a function. However, absolute addressing is also used when variables are referenced in order to create pointers. Therefore, each addressing mode serves a separate and important purpose.

| $\begin{aligned} & \text{prelude} \\ & fp = 0 \\ & ip = 1 \end{aligned}$ |  |  |
|--------------------------------------------------------------------------|--|--|
|--------------------------------------------------------------------------|--|--|

Figure 3.4 – Call Stack of the rush VM

## 3.2.5. How the Virtual Machine Executes A rush Program

By considering the minimal pointer example from above, we now have a rough idea how the VM might execute instructions. In order to get a better understanding of how the rush VM works exactly, we will explain how it executes the program in Listing 3.6. For this, we should consider the instructions in Figure 3.2 again. The first instruction of the prelude function is 'setmp'. This instruction adjusts the memory pointer by the amount specified in the instruction's operand. In this case however, the memory pointer remains unmodified since the operand of the instruction is 0. Next, the 'call 1' instruction calls the 'main' function. In order to understand how function calls work in this VM, we must consider the call stack of the rush VM. Before the call-instruction, the caller pushes any call-arguments onto the stack so that they can be used as parameters by the callee. Figure 3.4 displays the state of the VMs call stack after the 'call 1' instruction would be executed. During execution of a call-instruction, the VM pushes a new stack frame onto its call stack. Listing 3.10 shows how a call frame is implemented.

```
struct CallFrame {
    /// Specifies the instruction pointer relative to the function
    ip: usize,
    /// Specifies the function pointer
    fp: usize,
}
```

Listing 3.10 - Struct Definition of a CallFrame

In this implementation, each call frame holds two important pieces of information. In line 31 of Listing 3.10, the 'ip' field is declared. It specifies the *instruction pointer* which holds the index of the current instruction. Since the 'call' instruction was interpreted previously, the instruction pointer of the new call frame is set to 0 as execution should continue at the first instruction of the called function. The 'fp' field is declared in line 33. This field specifies the function pointer which holds the index of the current function.

After the function call, 'fp' is set to 1 since the main function is called and instruction should start at the first instruction of the main function. Figure 3.4 shows how the call stack of the VM looks like after the 'call' instruction has been interpreted. Function calls are managed in a stack in order to allow returning from functions. If the VM encounters a 'ret'<sup>8</sup> instruction, it should leave the current function. However, it should also know where to resume its fetch-decode-execute cycle. For this, the VM just simply pops the top element from its call-stack. Now, the top element on the stack contains the call-frame of the caller function. In this call frame, 'ip' still points to the 'call' instruction which was responsible for calling the function. Since 'ip' is incremented automatically after most instructions, the VM resumes instruction execution at the first instruction after the call-instruction. This way, function calls are implemented in a simple but robust manner.

Now that the call-instruction has been interpreted, the VM begins executing the first

<sup>&</sup>lt;sup>8</sup>Short for "return"

instruction of the main-function. Since the main-function only calls the rec function with the argument 1000, there are no new concepts to consider in this function. After the call instruction in line 7, the VM starts executing the instructions of the rec function. At the beginning of the rec function, the memory pointer is incremented by 1. This might seem erroneous since the rec function contains no visible variable declarations in its body. However, this behavior is correct since function parameters count as variable declarations. Since the function takes one parameter, the memory pointer is incremented by one cell. Next, the instruction svari saves the value of the parameter which was previously pushed onto the stack at the relative address 0. In line 12, the relative address of the memory cell containing the value of the parameter is pushed onto the stack. It is then consumed by the gvar instruction in line 13. At this point the top element on the stack contains an address-value referring to the target of the gvar instruction. Therefore, the instruction first pops the top element from the stack. In this case, the value of the popped element is the relative address 0. Then, the instruction retrieves the value of the target variable and pushes it onto the stack.

In line 14, the constant value 0 is pushed onto the stack. Next, the eq instruction pops two elements from the stack in order to test them for equality. Then, the result of the comparison is pushed onto the stack as a boolean value. In this case, the instruction compares if the current value of n is equal to 0. In line 16, the jmpfalse instruction is executed. This instruction jumps to the specified instruction index if the value on top of the stack is false. In this case, if the value on the stack is false, the parameter n was not equal to 0. Now, the VM would jump to the instruction in line 19. Here, the value of the parameter n is pushed onto the stack using the previously explained push and qvar instructions. Now, the top item on the stack is the value of the parameter n. In line 21, the push instruction pushes a constant 1 onto the stack. Next, the sub instruction pops the first two elements from the stack in order to subtract their values from each other. In this case, the instruction subtracts 1 from the value of n and pushes the result onto the stack. Next, the rec calls itself recursively using a previously explained call instruction. Since the call argument is the top element on the stack, the result of the subtraction is used as the argument of the recursive call. Next, the function decrements the memory pointer in order to deallocate used memory using the setmp instruction in line 24. At the end of a function, the memory pointer is always decremented by the amount it was incremented at the beginning of the function. By deallocating the now unused memory, the compiler prevents the code from leaking memory at runtime. Lastly, the ret instruction is used to return from the function. Now we have considered what happens if the value of n was not equal to 0.

However, if the result of the comparison in line 15 is true, meaning that n is equal to 0, the jmpfalse instruction in line 16 does nothing. In this case, the VM continues to the push instruction in line 17. Here, the constant value 0 is pushed onto the stack. Next, the VM interprets the jmp instruction in line 18. Unlike jmpfalse, this instruction performs its jump without any condition. In this case, the instruction jumps to the instruction at index 14 of the current function. The instruction at index 14 is setmp in line 24. Since functions also return values by placing them on top of the stack, the return-value would be 0 in this case. Since we have covered what the instructions in the lines 24 and 25 do, we can summarize that the function returns the value 0 in this case.

## 3.2.6. Fetch-Decode-Execute Cycle of the VM

Now that the semantic meaning of the instructions in Figure 3.2 has been explained, we will explain how the fetch-decode-execute cycle works in the VM. The code in Listing 3.11 displays the 'run' method of the rush VM.

```
pub fn run(&mut self, program: Program) -> Result<i64> {
168
         while self.call_frame().ip < program.0[self.call_frame().fp].len() {</pre>
169
             let instruction = &program.0[self.call_frame().fp][self.call_frame().ip];
170
171
             // if the current instruction exists the VM, terminate execution
172
             if let Some(code) = self.run_instruction(instruction)? {
173
174
                 return Ok(code);
175
             };
176
177
         0k(0)
178
     }
179
```

Listing 3.11 - The run Method of the rush VM

This method manages the entire fetch-decode-execute cycle of the VM. It is immediately apparent that this method looks relatively simple considering that it plays such of a vital role in the VM. Since the fetch-decode-execute cycle executes instructions repeatedly, the main construct in the function is a while-loop beginning in line 169. The condition of the loop checks that the current instruction pointer refers to a legal instruction inside the current function. This way, the VM comes to a halt if it reaches the end of an instruction sequence. In line 179, the next instruction to be interpreted is saved as the variable 'instruction'. This line represents the *fetch* step since the next instruction is fetched from memory and placed in a spot where it can be used by the following steps.

In the body of the loop, the current instruction is executed using the 'self.run\_instruction' method. This method is responsible for executing the previously fetched instruction. If execution of the instruction fails, the method returns a runtime error, such as an *integer-overflow* error. Furthermore, this method may return an optional integer representing the exit code of the program. If the method returns such a code, instruction execution comes to a halt instantly and the VM exists using the retrieved code. However, if the method returns none of these two possible types, the fetch-decode-execute cycle continues. What might seem odd is that in this code, one cannot observe the instruction pointer being incremented. Therefore, one might think that the VM stays in an endless loop without ever progressing to the next instruction. In order to answer the final question of how the current instruction is executed, and the instruction pointer is incremented, we will now examine the code in Listing 3.12.

```
fn run_instruction(&mut self, inst: &Instruction) -> Result<Option<i64>> {
181
182
         match inst {
183
             Instruction::Nop => {}
184
             Instruction::Push(value) => self.push(*value)?,
             Instruction::Jmp(idx) => {
191
192
                 self.call_frame_mut().ip = *idx;
                 return Ok(None);
193
194
             Instruction::Exit => {
261
                 return Ok(Some(self.pop().unwrap_int()));
262
263
             Instruction::Add => {
272
273
                 let rhs = self.pop();
                 let lhs = self.pop();
274
                 self.push(lhs.add(rhs))?;
275
             }
276
             // ...
357
         self.call_frame_mut().ip += 1;
358
         Ok(None)
359
360
```

Listing 3.12 - Parts of the run\_instruction Method of the rush VM

The code in Listing 3.12 displays parts of the run\_instruction method. This method mainly consists of a match-expression which determines which code to run depending on the current instruction. In this example, the implementations of several instructions are visible. In line 183, the 'Nop' instruction is matched. It is apparent that there is no instruction-specific code executed in this case since "nop" stands for "no operation". Therefore, the VM will ignore any 'Nop' instructions it encounters. In line 184, the code for the 'Nop' instruction is displayed. Since this instruction should push the operand onto the stack at runtime, the helper method 'self.push' is invoked. This method pushes the provided parameter onto the 'stack' field of the VM. However, the function first validates that the stack will not exceed its maximum capacity. If the call to push would cause the capacity of the stack to be exceeded, the method returns a runtime error describing the issue. In line 261, the code for executing the 'Exit' instruction can be seen. This instruction can be used to terminate the execution of the program prematurely. It is inserted by the compiler if it encounters a call to the 'exit' function.

In line 191, the code responsible for executing the jumping instruction 'Jmp' can be seen. This instruction only sets the instruction pointer to the target index specified in the instruction operand. Therefore, a jump involves very little overhead and is implemented using very little effort. For some special instructions, such as the jump-instructions, the instruction pointer should not be incremented since it would interfere with the jump. Since the instruction pointer is incremented at the end of the method, the return-statement in line 193 is used to terminate this method prematurely.

In line 272, the code responsible for executing the 'Add' instruction can be observed. This instruction first pops two elements from the stack since they represent the operands of the underlying mathematical computation. Then, the 'add' helper function is called on the left-hand side of the computation. This helper function then performs the actual addition computation. This way, the run\_instruction method stays organized and simple. Although just the code for addition is shown, most of the other infix-expressions are later executed

similarly. It is apparent that the execution of these instructions involves relatively little difficulty. After the instruction has been executed, the instruction pointer is finally incremented in line 358. If no error occurred during the execution of these instructions, nothing is returned since execution should proceed with the next instruction.

This method represents both the *decode* and *execute* step since it first matches (*decode*) and then interprets (*execute*) the current instruction. Surprisingly, the parts of the method shown in this example can all be understood using relatively little effort and show that implementation of a VM is usually not that demanding. Now that we have explained how some important parts of the rush VM work, the question of how its input instructions are generated remains. Therefore, the compiler targeting the rush VM is presented in the following chapter.

As a conclusion, a VM is often a reasonable approach if an interpreted programming language is to be implemented. The main advantages of a VM are increased speed, reduced memory usage at runtime, and less runtime errors due to type-checking performed by its compiler. The main downsides include the need for a compiler targeting the VM, thus making its implementation more demanding compared to a tree-walking interpreter. Furthermore, debugging the VM is often significantly more demanding than debugging a tree-walking interpreter. In the past, commercial software has shown that implementing a programming language to run on a virtual machine can indeed be used successfully and on a larger scale. For instance, the Java Virtual Machine (JMV) is the software component responsible for executing the compiled Java bytecode. Through the JVM, the compiled Java program preserves its platform independence whilst using the benefits introduced by a VM [Lin+14, Chapter 1.2].

# 4. Compiling to High-Level Targets

In the previous chapter, we have learned how an interpreted programming language can be implemented. Another method of implementing a programming language is to create a compiler for this language. However, the question of how such a compiler works exactly still remains.

# 4.1. How a Compiler Translates the AST

Often, a compiler traverses an AST generated by the analyzer in order to translate it to some sort of output. For each AST node, the compiler usually calls a separate function or method which is specialized in translating this specific node type. These individual functions often return some sort of value representing the translated node. Otherwise, each individual function may also insert generated instructions into an internal field of the compiler. Here, the



Figure 4.1 – Abstract Syntax Tree for '1+2 < 4'

newly generated instruction is inserted the output sequence of instructions. In this case, each function often also returns metadata about the previously compiled node. For instance, this data may include the register or memory location of a previously compiled expression, so that other AST nodes can refer to it later. In transpilers, meaning compilers translating one high-level language into another one, each node-specific function often returns a tree node representing code in the output language.

Listing 4.1 displays a simplified syntax tree of the rush expression '1+2 < 4'. The number after each expression represents the order in which most compilers would traverse this tree. Compilation of the expression starts at the root node of the tree. Here, most compilers will begin translation by first compiling the child nodes using post-order traversal. Post-order traversal is frequently used because the compilation of a node often depends on the output of its child nodes. In this example, translation of the root comparison expression depends on the information returned by compiling its left- and right-hand sides. Therefore, the compiler first considers the node 'Expression 3' which represents the add-expression 1+2. However, due to post-order traversal, this node is not actually traversed since the compiler skips straight to its child nodes. Therefore, the left child 'Expression 1' is traversed as the very first node. Next, its sibling node 'Expression 4' is traversed too. Since post-order traversal involves considering a root node after the traversal of its children, 'Expression 3' is traversed after its children have been considered. Here, the compiler considers the operand of the expression in order to generate the appropriate output instruction. Therefore, the instruction responsible for the addition is inserted after the Expression node 3 has been traversed. Since 'Expression 3' and its children are now completely traversed, and its output instruction has been inserted, the compiler now considered the right-hand side of the comparison. Here, the node 'Expression 2' only consists of the constant integer value 4. Now, all child nodes of 'Expression 5'

have been traversed, thus the compiler now considers this node itself. Here, the compiler notices that the expression should check if the left-hand side is less than the right-hand side. Therefore, the compiler inserts an instruction performing this comparison using the results of the left and right child as the operands. Since the compiler must be aware of the operands of the instruction, each method or function involved in the tree-traversal returns an entity describing the location of the runtime value of its previously compiled node. For instance, if the target architecture uses registers, every method translating an expression must return the register containing the value of the expression at runtime. Therefore, such a describing entity can either be a register or a memory location describing the location a value which a node may yield. By returning information like this, a root node will have information about its children after they have been traversed. A root node might rely on the values returned by its children, therefore it is traversed at last, thus creating the demand for post-order traversal.

```
1 r0 = 1

2 r1 = 2

3 r2 = add r0, r1

4 r3 = 4

5 r4 = lt r2, r3
```

**Listing 4.1** – Simple Pseudo-Instructions For a Fictional Architecture

Listing 4.1 displays a sequence of instructions for a fictional architecture. This sequence could have been generated from the previously discussed tree in Figure 4.1. It is apparent that the order of instructions matches the order in which the tree was traversed. The instructions in line 1 and 2 represent the tree-nodes 'Expression 1' and 'Expression 2' respectively. Here, the value 1 is assigned to a register named 'r0' whilst the value 2 is assigned to the register 'r1'. The add instruction in line 3 appears after the instructions in line 1 and 2 since their tree nodes were traversed first. Furthermore, the instruction uses register r0 and r1 as its operands and therefore depends on them containing a value. Therefore, the add instruction can use the registers returned by compiling its child nodes as its operands. Next, the constant integer value 4 is assigned to the register 'r3'. Lastly, the comparison instruction 'lt' is inserted using the result of the addition and the register containing 4 as its operands. Here, it is apparent that the instruction generated by the node which was traversed at last is also inserted at the end.

Therefore, using post-order traversal in order to generate output instructions targeting a register-based architecture is often required. This example illustrates how a simple compiler might operate. However, a similar algorithm is often found in even the most complex compilers.

## 4.1.1. The Compiler Targeting the rush VM

Since the rush VM interprets instructions directly, there must be a compiler targeting its architecture. For this purpose, we have implemented a compiler translating rush source code into instructions which can be understood by the VM. Since the VM's architecture was developed with the features of rush in mind, the compiler sometimes requires surprisingly little effort for translating certain AST-nodes. For instance, the compiler translates infix-expressions, such as 'n + m', into instructions using the infix\_expr method. This method is displayed in Listing 4.2.

```
fn infix_expr(&mut self, node: AnalyzedInfixExpr<'src>) {
522
523
         match node.op {
             InfixOp::Or | InfixOp::And => {
524
536
             op => {
537
                  self.expression(node.lhs);
538
539
                  self.expression(node.rhs);
                  self.insert(Instruction::from(op));
540
541
         }
542
     }
543
```

Listing 4.2 – Compilation of Infix-Expressions Targeting the VM

This method differentiates between the 'Or' / 'And' operators and other possible operators since the former require some extra code to be emitted. However, due to simplicity, the focus only lies on the compilation of the latter, meaning any other type of infix-expression. In line 538, the left-hand side expression is compiled first. In the next line, the right-hand side is compiled too. Finally, in line 540, the matching instruction representing the infix-operator is inserted. The final instruction is generated by a helper function which converts an infix-operator into a matching instruction. It is to be mentioned that most of the other rush compilers required significantly more code in order to implement the translation of infix-expressions. The reason for the simplicity of this implementation is the fact that the rush VM implements one instruction for each infix-operator. Since this VM implements all features of the source language, complex hacks for implementing some operations are not required and thereby simplify the compiler's implementation.

```
fn expression(&mut self, node: AnalyzedExpression<'src>) {
445
446
         match node {
447
             AnalyzedExpression::Int(value) =>
         self.insert(Instruction::Push(Value::Int(value))),
             AnalyzedExpression::Infix(node) => self.infix_expr(*node),
457
             AnalyzedExpression::Call(node) => self.call_expr(*node),
458
             AnalyzedExpression::Grouped(node) => self.expression(*node),
459
         }
460
    }
461
```

**Listing 4.3** – Compilation of Expressions Targeting the VM

The code in Listing 4.3 shows parts of the 'expression' method of the rush VM compiler. When we examine the method's signature, it becomes apparent that it consumes an AnalyzedExpression. However, the method does not return anything which represents the runtime value of the expression. This is possible because the results of the expressions displayed in the snipped are pushed onto the stack directly. In line 447, the code for translating a constant integer expression can be seen. Here, a push instruction using the constant int value as its operand is used. By pushing the values of expressions onto the stack directly, most tree-traversing methods do not need to return values. Due to this, short and elegant code like the one in Listing 4.2 can be implemented. In other compilers, the method responsible for compiling expressions would usually return the register which contains the value of the compiled expression at runtime. Due to the values being saved on the VM's stack, other parts of the compiled program can still use the runtime values of compiled expressions. In line 457, the method for compiling call-expressions would be invoked. Similar to

the 'expression' method in the analyzer, this method also dispatches the traversal of more complex expressions to specialized methods in order to keep this method simple.

Another reason for the compiler's simplicity regarding infix-expressions is that the rush VM includes a special instruction for the mathematical power operation. In rush, the expression 'n \*\* m' can be used to denote following mathematical term:  $n^m$ . Since many real architectures lack such a power instruction, most other rush compilers demanded implementation of special edge-cases in order to make compiling power-expressions feasible. On the opposite, implementing a rush compiler targeting the VM has proven to be less demanding since the VM supports a power-instruction. Furthermore, the VM also includes an 'exit' instruction which terminates the fetch-decode-execute cycle instantly. These examples showed how a carefully chosen target architecture simplifies the implementation of its compiler by a great deal.

However, there is also one aspect of the VM which made implementation of the compiler targeting the VM more demanding than usual. For instance, in most Assembly dialects, labels can be used to allow jumps between blocks of code. However, the VM intentionally does not support the use of such labels. Since the VM would have to look up the exact instruction index of a label at runtime, each jump targeting a label would involve some additional overhead. This overhead is eliminated by the assembler during assembly of a program. Since the assembler performs these lookups during translation, the CPU does not have to deal with label lookups at runtime. Like seen in the previous examples, jumping VM instruction require the exact index of the target instruction as their operands. Therefore, the exact target index to which the instruction should jump must be known. To illustrate this issue, we will consider how loops are implemented in the VM. The rush code in Figure 4.2 presents a program containing a loop. In the loop's body, the variable 'n' is incremented by 1. Next, the 'break' keyword is used to terminate loop execution. Therefore, the total amount of iterations is 1.

```
setmp 1
                                                  push 0
   fn main() {
                                                  svari *rel[0]
1
2
       let mut n = 0;
                                                  push *rel[0]
                                                  clone
3
       loop {
4
            n += 1;
                                                  gvar
5
            break;
                                              7
                                                  push 1
                                                  add
   }
7
                                              9
                                                  svar
                                                  jmp 11
                                              10
                                                  jmp 3
```

Figure 4.2 – Representation of Loops in the VM

The rush VM instructions of the 'main' function are displayed on the right side of Figure 4.2. Here, lines 2 and 3 are responsible for declaring a variable named 'n'. The instructions in the lines 4–9 are used to increment 'n' by 1. A new instruction which we have not covered so far is the 'clone' instruction. This instruction clones the top item on the stack it without prior calls to 'pop'. Therefore, after the instruction has been executed, two identical values exist on the top of the stack. This instruction is only used in assign-expressions in order to duplicate the address value of the assignee variable.

After 'n' was incremented, the instruction in line 10 jumps to the instruction index 11. However, the last valid index is 10, it is represented by the 'jmp 3' instruction in line 11. If this jump is executed, the VM has no next instruction to fetch and therefore stops its fetch-decode-execute cycle. Since this instruction jumps to a position outside the loop, it

represents the break statement in line 5 of the source program. The 'jmp' instruction in line 11 is responsible for the repetition introduced by the loop. This instruction jumps to the first instruction of the loop's body in line 4. Therefore, the instructions inside the loop's body are executed repeatedly. The difficulty presented by this design is that the index of the jump's target instruction must be known before the target instruction is inserted. The code in Listing 4.4 displays a part of the method responsible for compiling loops for the rush VM.

```
320
    fn loop_stmt(&mut self, node: AnalyzedLoopStmt<'src>) {
321
         // save location of the loop head (for continue stmts)
         let loop_head_pos = self.curr_fn().len();
322
         self.loops.push(Loop::default());
323
         self.block(node.block, true);
331
         self.insert(Instruction::Jmp(loop_head_pos));
337
338
         // correct placeholder `break` / `continue` values
339
         let loop_ = self.loops.pop().expect("pushed above");
340
         let pos = self.curr_fn().len();
341
         self.fill_blank_jmps(&loop_.break_jmp_indices, pos);
342
         self.fill_blank_jmps(&loop_.continue_jmp_indices, loop_head_pos);
343
    }
344
```

Listing 4.4 – Implementation of Loops in the rush VM Compiler

In line 331, the loop's body is compiled and instructions generated during this process are inserted into the output sequence. The next statement in line 337 inserts an instruction responsible for jumping back to the start of the loop's body. The variable 'loop\_head\_pos' was previously defined and species the index of the first instruction of the loop's body. Therefore, this inserted instruction performs continuous jumps, thereby introducing the repetition for which the loop is desired.

In line 340, the top loop is popped from the 'loops' stack. In line 322, this loop was previously pushed onto the identical stack. This stack is an internal field used by the compiler in order to save information about loops. The top item on this stack always represents the loop currently traversed by the compiler. Each loop item saves two lists, each containing the indices of jump-instructions whose target index needs to be adjusted. The first list contains the indices of jump-instructions generated by 'break' statements which were encountered during traversal of the loop's body. The second list saves the indices of jump-instructions emitted by 'continue' statements inside the loop's body. For instance, if the compiler encounters a 'break' statement, the code in Listing 4.5 is executed.

```
fn statement(&mut self, node: AnalyzedStatement<'src>) {
253
             AnalyzedStatement::Break => {
268
                 // the jmp instruction is corrected later
269
                 let pos = self.curr_fn().len();
270
                 self.curr_loop_mut().break_jmp_indices.push(pos);
271
                 self.insert(Instruction::Jmp(usize::MAX));
272
273
             // ...
         }
287
    }
288
```

Listing 4.5 - Compilation of 'break' Statements in the rush VM Compiler

Here, the 'pos' variable saves the index of the jump-instruction to be inserted. In line 271, this index is then inserted into the list containing the placeholder indices of the current loop. Lastly, a 'jmp' instruction is inserted containing a placeholder target index. Therefore, at the end of the compilation of a loop's body, there will be a list containing the indices of instructions whose target indices need to be adjusted. In line 342 of Listing 4.4, the 'self.fill\_blank\_jmps' method is used to set the target indices of the specified jump-instructions to 'pos'. We will omit the explanation of this method because it only iterates over the passed list of indices, replacing the target of the jump-instruction at the current index during the process.

As a conclusion, design, and implementation of the compiler targeting the rush VM has presented itself as a reasonable task. Altering the target architecture to mitigate difficulties which occurred during the implementation of the compiler was often extremely helpful. Therefore, compared to the rush compiler targeting RISC-V, implementation of this compiler was significantly simpler. Furthermore, the rush VM uses a stack-based design which made implementing its compiler less demanding as well.

# 4.2. Compilation to WebAssembly

- 1. what is WASM and why
- 2. modules
  - binary and text format (sections, leb128)
  - globals, functions, imports, exports, ...
  - uncommon to target binary format
- 3. WASI
- 4. basic implementation
  - leb128
  - sections
  - self.function\_body
- 5. example program

The first 'external' compilation target presented here is *WebAssembly*, or *WASM* for short. **TODO:** research origins and goals of WASM and explain them here Unlike the name implies, WebAssembly is not only used in web applications. By itself, it is only a specification that can be implemented by runtimes in any context. Most modern browsers include

such a WebAssembly runtime, but there are also standalone ones, for example wasmtime and wasmer.

### 4.2.1. WebAssembly Modules

Every valid WebAssembly file must contain exactly one module. **TODO: confirm** The WebAssembly specification defines two different representations for these modules. First, there is a human-readable text representation, called  $WAT^1$ , closely resembling S-Expressions<sup>2</sup>. This is comparable to assembly languages for CPU architectures and is the typical target for compilers. Secondly, WebAssembly modules can also be represented using its binary format, which is optimized for size and comparable to the final binary files produced by assemblers. Most often these binary modules are constructed from a text module by using a tool such as wat2wasm from the  $WebAssembly\ Binary\ Toolkit\ (WABT)$ . However, the rush WebAssembly compiler instead opts to target the binary format directly, highlighting a few reasons for why most compilers should not do this. Listing 4.6 and Listing 4.7 on page 45 show the same basic WebAssembly module once as WAT and once as a commented hex dump of the same module in its binary representation as produced by https://webassembly.github.io/wabt/demo/wat2wasm/.

# 4.3. Using LLVM for Code Generation

LLVM is a software project intended to simplify the construction of a compiler generating highly-performant output programs. It originally started as a research project by Chris Lattner for his master's thesis at the University of Illinois at Urbana-Champaign [Lat02]. Since then, the project has been widely adopted by the open source community. In 2012, the project was rewarded the ACM Software System Award, a prestigious recognition of significant software which contributed to science. From the point when popularity of the framework grew, it was renamed from Low Level Virtual Machine to the acronym it is known by today. Today, it can be recognized as one of the largest open source projects [CA14, preface]. Among many other projects, the Rust programming language depends on the LLVM compiler in order to generate its target-specific code [McN21, p. 373]. Furthermore, the Clang C / C++ compiler uses LLVM as its code generating backend [Hsu21, preface]. Therefore, production ready compilers for popular programming languages have been implemented using the LLVM framework. Besides open-source projects, many companies have also incorporated LLVM in their commercial software. For instance, since 2005, Apple has started incorporating LLVM into some of its products [Fan10, pp. 11-15]. A recent example of software developed by Apple which uses LLVM is the Swift programming language which is mainly used for developing IOS apps [Hsu21, preface].

# 4.3.1. The Role of LLVM in a Compiler

In a compiler system, LLVM is responsible for generating target-specific code and performing optimizations. The framework is known for performing very effective optimizations during code generation so that the translated program runs faster at runtime and uses less memory. In order to use LLVM, the system provides and API which is usable by earlier steps of compilation. Typically, a compiler frontend only analyzes the source program to create an AST. Next, a separate step of compilation invokes the LLVM framework which carries on

<sup>&</sup>lt;sup>1</sup>WebAssembly Text

<sup>&</sup>lt;sup>2</sup>TODO: what are S-Expressions

```
1 (module
2 (func (export "addTwo") (param i32 i32) (result i32)
3    local.get 0
4    local.get 1
5    i32.add))
```

Listing 4.6 – Simple WebAssembly Module in Text Representation

```
0000000: 0061 736d
                                      ; WASM_BINARY_MAGIC
   0000004: 0100 0000
                                      ; WASM_BINARY_VERSION
   ; section "Type" (1)
                                      ; section code
   0000008: 01
   0000009: 07
                                      ; section size
   000000a: 01
                                       ; num types
7
   ; func type 0
   000000b: 60
                                      ; func
                                      ; num params
   000000c: 02
                                      ; i32
   000000d: 7f
10
                                      ; i32
11
   000000e: 7f
                                      ; num results
   000000f: 01
12
                                      ; i32
   0000010: 7f
13
14 ; section "Function" (3)
                                      ; section code
15 0000011: 03
16 0000012: 02
                                      ; section size
17 0000013: 01
                                      ; num functions
  0000014: 00
                                      ; function 0 signature index
18
   ; section "Export" (7)
19
                                      ; section code
   0000015: 07
20
                                       ; section size
   0000016: 0a
21
                                      ; num exports
   0000017: 01
                                      ; string length
   0000018: 06
23
   0000019: 6164 6454 776f
                              addTwo ; export name
24
                                      ; export kind
   000001f: 00
25
   0000020: 00
                                       ; export func index
27
   ; section "Code" (10)
28
   0000021: 0a
                                      ; section code
                                      ; section size
29
   0000022: 09
   0000023: 01
30
                                      ; num functions
31 ; function body 0
32 0000024: 07
                                     ; func body size
   0000025: 00
                                      ; local decl count
33
34 0000026: 20
                                      ; local.get
                                      ; local index
35 0000027: 00
  0000028: 20
                                      ; local.get
                                      ; local index
   0000029: 01
37
                                      ; i32.add
   000002a: 6a
   000002b: 0b
39
                                       ; end
   ; section "name"
40
   000002c: 00
                                      ; section code
41
                                      ; section size
   000002d: 0a
                                      ; string length
   000002e: 04
   000002f: 6e61 6d65
                                name ; custom section name
                                      ; local name type
45
   0000033: 02
                                      ; subsection size
   0000034: 03
46
47
   0000035: 01
                                       ; num functions
48 0000036: 00
                                       ; function index
   0000037: 00
                                       ; num locals
49
```

Listing 4.7 – Simple WebAssembly Module in Binary Representation

from this point. This component traverses the AST and uses the API of LLVM in order to construct an intermediate representation of the program. This way, the framework will be able to understand the semantic meaning of the program. Next, LLVM compiles the input program to an output which is specific to an arbitrary target architecture. As of today, LLVM features many target architectures so that a compiler designer does not have to worry about portability of the output program [Hsu21, preface]. Listing 4.3 shows how LLVM integrates into the previously discussed steps of compilation. Therefore, the framework represents the back end component of a compiler.



Figure 4.3 – Steps of Compilation When Using LLVM

#### TODO: continue here

The updated Figure 4.3 now includes two new steps: "LLVM IR generation" and "LLVM backend". The first step generates the target-independent input used by the second step to generate target-specific code. This step traverses the AST in order to generate a semantically equivalent program formulated using LLVM's intermediate representation. A compiler writer must only implement the first three steps of this figure as the last step is represented by the framework itself.

### 4.3.2. The LLVM Intermediate Representation

The LLVM intermediate representation (IR) represents the source program in a low-level manner. However, even though this IR is low-level, it is still target-independent. Furthermore, the IR also contains detailed type information which is usually uncommon for low-level representations of a program. Therefore, high-level type information is preserved while the benefits of a low-level representation are introduced. This allows LLVM to perform significant more aggressive optimizations compared to other compiler solutions or frameworks. Therefore, programs compiled using LLVM as the backend will often run significantly faster due to the many aggressive optimizations introduced by the framework.

LLVM provides many APIs for interacting with the IR in memory. This way, it can be created by a frontend without the need for a separate file containing the IR. If the file containing the IR is written and read by the individual parts of the compiler, the same performance issues introduced by multipass compilation would emerge. Therefore, LLVM provides official APIs for the C++ and C programming languages. However, there are many unofficial bindings, such as for Rust, Go, or Python. For instance, a compiler frontend written in Rust can leverage LLVM, although the system is written in C++. Since LLVM must be able to perform complex program analysis before it can optimize a program, its IR introduces many rules and constraints. For instance, a program formulated using the IR should always obey the following hierarchy:

- The top most hierarchical structure is the so-called *module*. It represents the current file being compiled.
- Each module contains several *functions*. Often, each function in the source program is represented using a function in the IR.
- Each function contains several basic blocks. A basic block contains a sequence of

- instructions. Blocks should always be terminated using a jump, return, or unreachable instruction. However, a basic block must never be terminated twice.
- Each basic block contains a sequence of *instructions*. Each instruction holds a semantic meaning and represents a part of the source program. For instance, LLVM provides instructions for integer addition or function calls.

[Hsu21, p. 211-213].

The IR provides a low-level enough representation in order to allow optimizations in the early stages of compilation. However, due to the high-level type information contained in the IR, LLVM is able to perform many aggressive optimizations on the IR during later stages of compilation. This way, the framework can communicate a lot of information to the linker which can then use this information for link-time optimizations. The virtual instruction set of LLVM is therefore designed as a low-level representation with high-level type information. This instruction set describes a virtual architecture which is able to represent an abstraction for most of the common types of processors. Although it is low-level, the IR avoids machine specific constraints like registers or calling conventions. Instead, the virtual architecture provides an infinite set of virtual registers which can hold the value of primitives like integers, booleans, floating-point numbers, and pointers. All registers in the IR use the  $SSA^3$  form in order to allow more optimizations. In order to enforce the correctness of the type information included in the IR, the operands of each instruction all obey LLVM's type rules. Therefore, LLVM only processes a program which contains valid type information [Lat02, p. 14-17].

In order to understand how a program can be formulated using the LLVM IR, we consider the rush program for calculating Fibonacci numbers again. For reference, the rush program used in this example can be found in Listing 1.1 on page 3. The code in Listing 4.8 displays the identical rush program formulated in LLVM IR. The IR was generated by the LLVM targeting rush compiler<sup>4</sup>. This compiler is presented in a later chapter since right now, only its output is of relevance. The IR displayed in this listing shows a module named 'main'.

<sup>&</sup>lt;sup>3</sup>Short for "static single assignment", widely used in optimizing compilers

<sup>&</sup>lt;sup>4</sup>Generated in Git commit 7f28ebc, automatically built with this document

```
; ModuleID = 'main'
1
   source_filename = "main"
2
   target triple = "x86_64-alpine-linux-musl"
3
4
   define internal i64 @fib(i64 %0) {
5
6
   entry:
     %i_lt = icmp slt i64 %0, 2
7
8
      br i1 %i_lt, label %merge, label %else
9
                                                         ; preds = %entry, %else
10
   merge:
     %if_res = phi i64 [ %i_sum3, %else ], [ %0, %entry ]
11
      ret i64 %if_res
12
13
                                                         ; preds = %entry
14
      \%i_sum = add i64 \%0, -2
15
      %ret_fib = call i64 @fib(i64 %i_sum)
16
17
      \%i_sum1 = add i64 \%0, -1
      %ret_fib2 = call i64 @fib(i64 %i_sum1)
18
      %i_sum3 = add i64 %ret_fib2, %ret_fib
19
      br label %merge
20
   }
21
22
   define i32 @main() {
23
   entry:
24
25
      %ret_fib = call i64 @fib(i64 10)
26
      call void @exit(i64 %ret_fib)
      unreachable
27
28
29
   declare void @exit(i64)
```

**Listing 4.8** – LLVM IR Representation of the Program in Listing 1.1

In the lines 5 and 23, two functions are defined using the 'declare' keyword. It is apparent that the functions in the LLVM module represent the functions from the source rush program. The function's name in the IR matches the name in the source file as it increases readability of the program. When examining the signature of the 'fib' function in line 5 of the IR, it becomes apparent that the function returns a runtime value of the type 'i64'. In rush, the 'int' type represents a 64-bit signed number. Therefore, the i64 LLVM type represents the rush int type. Furthermore, we can observe that the function takes a parameter named %0 of the type 'i64'. It represents the 'n' parameter in the rush source program.

In line 6, the start of the "entry" block of the 'fib' function is declared using the block's name followed by a colon. Since LLVM can perform more optimizations on variables if they are declared in the 'entry' block of a function, the rush compiler uses the 'entry' block solely for variable declarations. In line 7, the 'icmp slt' is used in order to compare the runtime value of the parameter '%0' to a constant 2. The boolean result is then saved in a virtual register named '%i\_lt'. Since LLVM's virtual registers may have arbitrary names, the rush compiler uses names which will make reading of the generated IR easier. In line 8, the block is terminated using the 'br' instruction. The instruction will only jump under the condition that the value of %i\_lt is true. Here, the 'merge' and the 'else' labels are used as operands of the branch-instruction. Conditional jumps in LLVM always require an alternative jump to perform if the condition is false at runtime. Due to constraints introduced by its internal optimizations, LLVM only allows jumps to target blocks contained in the same function.

<sup>&</sup>lt;sup>5</sup>Short for "integer compare (signed less than)"

<sup>&</sup>lt;sup>6</sup>Short for "branch"

Therefore, two labels of blocks in the current function are used as the operands of this instruction. As the names 'merge' and 'else' suggest, this branch-instruction presents the essential part of the if-expression in the source program. If the condition was true at runtime, the instruction would jump to the 'merge' block in line 10. What might seem odd is that there is no 'if' block. In fact, the rush compiler has even compiled this block into LLVM IR. However, since that block only jumped to the 'merge' block, LLVM's optimizations removed it entirely.

In line 11, of the 'merge' block, the 'phi' instruction is used. These so called  $\phi$ -nodes are necessary due to the SSA form used in the IR. In short, a phi-node produces a different value depending on the basic block where control came from. Since the if-construct is an expression in rush, LLVM must know if the result of the 'entry' or the 'else' branch is to be used as the result of the entire if-expression. As a solution to this problem, these phi-nodes associate a value to an origin branch. In this example, the phi-node yields the value of the parameter '%0' (n) if control came from the 'entry' block. In the source program, 'n' should be returned without modification if it is less than 2. Therefore, the runtime result of the phi-node is '%0' if it is less than 2 at runtime. Otherwise, if control came from the 'else' block, the phi-node's result is taken from the virtual register '\%i\_sum3'. However, we have not covered where this virtual register is declared. For this, we consider the instructions in the 'else' block, starting in line 15 with the 'add' instruction. In this case, the instruction subtracts 2 from the parameter '%0' and saves the result in '\%i\_sum'. However, an addition instruction using a negative operand is used since LLVM's optimization decided that this instruction is likely beneficial. This is done in order to create the argument value for the first recursive call to 'fib'. Next, the 'call' instruction is used in order to perform the recursive call. Here, the '\%i\_sum' register is used as an argument to the call-instruction. The return value of the function call is saved in the '\%ret\_fib' register. The same behavior is used in order to call 'fib(n - 1)'. However, in that case, 1 is subtracted from the parameter and saved in '\%i\_sum1'.

Next, the 'add' instruction in line 19 is used in order to calculate the sum of the return values of the recursive calls This sum is then saved in the virtual register '\%i\_sum3'. Therefore, this register is used in the phi-node in line 11 so that the result of the recursive calls is used as the result of the if-expression. Next, the 'br' instruction jumps to the 'merge' block. However, this jump happens unconditionally since the instruction does not consider a condition and only has one target label. After the jump to the 'merqe' block, the previously explained  $\phi$ -node is encountered. Finally, the 'ret' instruction in line 12 is used in order to use the result of the if-expression as the return-value of the function. Since the 'main' function does not introduce any new concepts, we will omit detailed explanation of its contents. However, in line 27, the 'unreachable' instruction is used in order to state that it is never executed. This is necessary because LLVM requires that every basic block is terminated at its end. The 'exit' function terminates the program using a system call and therefore terminates the basic block. However, LLVM does not regard call-instructions as diverging and therefore disallows the call to 'exit' as a way to terminate the basic block. Since LLVM does not know that the 'exit' function terminates program execution, an 'unreachable' instruction is inserted to communicate a block termination to LLVM.

It is to be mentioned that the original IR generated by the rush compiler looks slightly different because LLVM has already performed all of its aggressive optimizations on this code. By considering the example from above, it became apparent that the IR represents many source language constructs in a high-level way. For instance, function calls can be used without considering the complex rules introduced by low-level calling conventions. Here, calling and returning from a function can be implemented using very little effort. Furthermore, virtual registers allow the compiler frontend to omit register allocation entirely. Lastly, the

LLVM IR can subjectively be seen as very readable since registers, basic blocks, and functions may contain custom, human-readable labels. Moreover, most instructions have a relatively reasonable name which allows readers to guess what the instruction is doing without them reading any LLVM documentation.

## 4.3.3. The rush Compiler Using LLVM

In order to get acquainted to the LLVM framework practically, we have implemented a rush compiler which uses the framework as its backend. However, the first problem emerged soon since the LLVM project only provides official C / C++ bindings to be used by other programs. Nonetheless, the entire rush project is written in the Rust programming language. Therefore, a third-party Rust wrapper around LLVM is required. We have settled on using the *Inkwell* Rust crate since it exposes a safe rust API for using LLVM for code generation [Kol17].

This compiler uses the annotated AST generated by the semantic analyzer in order to translate it into LLVM IR. Here, each type of AST node is translated using its own individual function. For instance, an expression AST node is translated into IR by the expression method of the compiler. This way, translation of individual AST nodes can be organized in order to increase maintainability. To understand how this rush compiler leverages LLVM in order to translate programs, we should first consider some implementation details. The code in Listing 4.9 displays the top part of the 'Compiler' struct definition.

```
pub struct Compiler<'ctx, 'src> {
    pub(crate) context: &'ctx Context,
    pub(crate) module: Module<'ctx>,
    pub(crate) builder: Builder<'ctx>,
    // ...
}
```

Listing 4.9 - Parts of the Struct Definition of the rush LLVM Compiler

The context field in line 27 represents a container for all LLVM entities including modules. Next, the module field contains the underlying LLVM module. In line 29, the builder field contains a helper struct provided by Inkwell which allows generation of IR solely in memory. All the types of the above fields are provided by the Inkwell crate and are therefore used to interact with the framework. In order to get a deeper understanding of how this compiler works exactly, we will now consider how the program in Figure 4.4 is translated into IR.

The source program on the left side contains the 'foo' and the 'main' functions. These functions are declared in the lines 5 and 14 of the output IR. The 'foo' function takes two parameters ('n' and 'm'). It uses the two parameters and calculates their sum in order to use it as the exit code of the program. In line 7 of the IR, the parameter 'n' and the variable 'm' are added together. What strikes the eye is that the declaration of 'm' cannot be seen in the IR. Instead, the constant value 3 of the variable is used in the addition instruction. Therefore, the program uses less memory since a redundant mutable variable is not saved in memory. This again shows how advanced LLVM optimization is and how it benefits the program. The result of this addition is then used in order to call the 'exit' function. This function call takes place in line 8 of the IR. Therefore, the exit code of the program will be 5. During translation, the compiler first iterates over all declared functions in order to add them to the LLVM module. Listing 4.10 displays parts of the method responsible for translating the 'main' function.

```
define internal i1 @foo(i64 %0) {
                                              6
                                                 entry:
                                              7
                                                   \%i_sum = add i64 \%0, 3
   fn main() {
                                             8
                                                   call void @exit(i64 %i_sum)
1
2
       foo(2)
                                             9
                                                   unreachable
3
   }
                                             10
4
                                             11
   fn foo(n: int) {
                                                 declare void @exit(i64)
5
                                             12
       let mut m = 3;
6
                                             13
7
       exit(n + m)
                                                 define i32 @main() {
                                             14
   }
8
                                             15
                                                 entry:
                                                   %ret_foo = call i1 @foo(i64 2)
                                             16
                                                   ret i32 0
                                             17
                                             18
```

Figure 4.4 – Translation of a Simple rush Program to LLVM IR

```
321
    fn main_fn(&mut self, node: &'src AnalyzedBlock) {
         // ...
         let fn_type = self
334
             .module
335
             .add_function(fn_name, fn_type, Some(Linkage::External));
336
337
         // create basic blocks for the function
338
         let entry_block = self.context.append_basic_block(fn_type, "entry");
339
         let body_block = self.context.append_basic_block(fn_type, "body");
340
341
342
         // set the current function to `main`
         self.curr_fn = Some(Function {
343
             name: "main",
344
             llvm_value: fn_type,
345
346
             entry_block,
         });
347
348
         // compile the body
349
         self.builder.position_at_end(body_block);
350
         self.block(node, true);
351
         self.builder.position_at_end(entry_block);
369
370
         self.builder.build_unconditional_branch(body_block);
    }
371
```

**Listing 4.10** – Compilation of the 'main' Function Using LLVM

In the lines 334–336, the 'main' function is added to the current LLVM module. The definitions of the variables 'fn\_name' and 'fn\_type' are not visible. The first variable specifies the name of the function to be inserted, whilst the latter describes the function's signature. The return type of the function is specified by the fn\_type variable. In most cases, the return-type of the function is an integer since C libraries can then use the function as its 'main' function. In cases where the generated code should not depend on C libraries, 'fn\_name' will be '\_start' and 'fn\_type' will state that the function returns void. In the lines 339 and 340, this method adds the 'entry' and 'body' basic block. Next, the 'entry' and 'body' block are appended to the newly created function. Therefore, the main-function now contains these two basic blocks. In the lines 343–347, the 'curr\_fn' field of the compiler is updated. This field holds information about the current function being compiled. In line 345, the 'llvm\_value' field is of particular importance since all later additions of basic blocks, e.g., during loop compilation require an Inkwell 'FunctionValue'. Therefore, this field of the current function can later be accessed if a basic block should be appended. Furthermore, the

'entry\_block' field in line 346 is used every time a pointer is declared. However, the reason for this behavior is explained later.

Using the Inkwell crate, most instructions generated will be automatically appended to the end of the current basic block. Therefore, the position of the instruction builder is changed to the end of the newly created 'body' block. Since this block contains the beginning of the main-function's body, the 'block' method of the compiler is called in line 351. In this case, this method first creates a new scope, then compiles all the statements which the block contains. Lastly, the method attempts to compile the block's optional expression. If the content of the body of the main-function does not lead to the insertion of more basic blocks, the 'body' block will contain the entire contents of the function after the method call.

In line 2 of the example rush program, the 'main' function calls the foo function using the argument value 2. In order to understand how this compiler translates function calls, we will now consider Listing 4.11.

**Listing 4.11** – Compilation of Call-Expressions Using LLVM

The code in Listing 4.11 displays a small part of the 'call\_expr' method of the rush LLVM compiler. This snipped shows the statement inserting the LLVM 'call' instruction. For this, the 'build\_call' method of the builder is called using the target function, call arguments, and the name of the result register. Since the variable 'func' represents the called function, it was previously declared by looking up the function name in the module. The 'args' variable is of type 'Vec<BasicMetadataValueEnum>' and therefore represents a list of Inkwell values representing the arguments used for the call. This variable was also defined previously by iterating over the node.args vector containing expressions. This vector is contained in the provided AST node representing the call-expression. Each argument expression is then compiled, and its result is placed into the 'args' output vector. However, we cannot understand how results of expressions are handled in this compiler without considering Listing 4.12.

```
fn expression(&mut self, node: &'src AnalyzedExpression) -> BasicValueEnum<'ctx> {
873
874
         match node {
875
             AnalyzedExpression::Int(value) => self
                 .context
876
                  .i64_type()
877
                  .const_int(*value as u64, true)
878
                  .as_basic_value_enum(),
879
             AnalyzedExpression::Infix(node) => self.infix_expr(node),
910
         }
911
    }
912
```

**Listing 4.12** – Compilation of Expressions Using LLVM

The code in Listing 4.12 shows parts of the expression method of this compiler. When consider the method's signature, it becomes apparent that it uses an 'AnalyzedExpression'

in order to generate a 'BasicValueEnum'. The return type of the function is of particular importance. Using Inkwell, most inserted instructions yield a symbolical value at compile time. This value represents a virtual register which will contain a real value at runtime of the program. Therefore, the 'BasicValueEnum' returned by the function represents the virtual register which will hold the result of the expression at runtime. This way, symbolical values can be used at compile time, thus presenting a high-level abstraction for generating the IR. The lines 875-879, show how a constant integer expression is compiled. Here, a constant int value of the 'i64' type is created and transformed into a 'BasicValueEnum' which is then used as the method's return value. For more complex expressions, the 'expression' method invokes other methods which are specialized on this type of expression. For instance, if an infix-expression like '3 \* n' is compiled, this method calls the 'infix\_expr' method in line 910. Here, the current AST node is passed to the specialized function as a call argument.

```
InfixOp::Mul => self.builder.build_int_mul(lhs, rhs, "i_prod"),
InfixOp::Div => self.builder.build_int_signed_div(lhs, rhs, "i_prod"),
InfixOp::Rem => self.builder.build_int_signed_rem(lhs, rhs, "i_rem"),
InfixOp::Pow => self.__rush_internal_pow(lhs, rhs),
```

Listing 4.13 – Compilation of Integer Infix-Expressions Using LLVM

The code in Listing 4.13 shows a part of the 'infix\_helper' method which is responsible for compiling parts of infix-expression. Line 1021 contains the code for inserting the 'mul' multiplication instruction. Here, the variables 'lhs' and 'rhs' are used as arguments for the 'build\_int\_sub' method call. They, too, represent virtual registers which will contain the value of the left- and right-hand side at runtime. Furthermore, the string containing 'i\_prod' specifies the name of the virtual register containing the product of the multiplication performed by the instruction. In this example, compiling basic integer multiplication has proven to be really simple since only one instruction needs to be inserted. This simplicity applies to most infix operations performed on integers. However, compiling mathematical power operations has proven to be more demanding since LLVM does not provide an instruction for performing these operations. Line 1024 is executed if the method needs to compile such an integer power operation. In order to mitigate this issue, the '\_\_rush\_internal\_pow' method is called instead of a method provided by Inkwell. This method first declares the 'core::pow' function in order to call it directly after. This function implements an algorithm for power operations given an integer base and exponent. However, this function is implemented in IR directly by hardcoding the required calls to Inkwell into this function. Therefore, even complex calculations like this one can be implemented even though LLVM does not provide a straight-forward way to accomplish them directly.

In line 6 of the source program, a let-statement is used to declare the mutable variable 'm' with the initial value 3. However, there is never a value assigned to this variable. This variable is only mutable so that the compiler has to use stack memory for it. Non-mutable variables are inlined by the compiler in order to save resources during runtime. In order to understand how the compiler translates let-statements, we will now consider Listing 4.14.

```
fn let_stmt(&mut self, node: &'src AnalyzedLetStmt) {
609
         let rhs = self.expression(&node.expr);
610
611
         // if the variable is mutable, a pointer allocation is required
612
         match node.mutable {
613
             true => {
614
                 // allocate a pointer for the value
615
                 let ptr = self.alloc_ptr(node.name, rhs.get_type());
616
617
                 // store the rhs value in the pointer
618
                 self.builder.build_store(ptr, rhs);
619
620
                 // insert the pointer into the current scope (for later reference)
621
                 let var = Variable::new_mut(ptr, node.expr.result_type());
622
                 self.scope_mut().insert(node.name, var);
623
             }
624
             // ...
         };
629
    }
630
```

Listing 4.14 – Compilation of Let-Statements Using LLVM

The code in Listing 4.14 displays parts of the 'let\_stmt' method of this compiler. This method is responsible for compiling let-statements. In line 610, the initializer expression of the statement is compiled. The 'rhs' variable then specifies the virtual register which contains the result of the expression at runtime.

The code in the block after line 614 is only executed if the variable was declared as mutable. Otherwise, the variable would be constant and therefore require no space in memory. Therefore, in order to present relevant code in this example, the 'm' variable in the source program had to be declared as mutable. In line 616, the 'alloc\_ptr' method is used in order to create a new Inkwell pointer value. The first argument of the call specifies that the name of the pointer should be identical to the name of the variable. The second argument passes the type of the initializer expression to the method. The statement in line 619 is used in order to insert a store instruction. Here, the instruction should store the value of the initializer expression in the newly created pointer. Since pointers present a way to use stack memory, also non-pointer variables in the source program are internally compiled to an IR program using pointers. Finally, in line 622, the newly defined variable is inserted into the current scope of the compiler. Every variable inside the scope saves its Inkwell value and its type since these fields are required when the variable is used later. The code in Listing 4.15 shows the 'alloc\_ptr' method of the compiler.

```
fn alloc_ptr(&mut self, name: &str, llvm_type: BasicTypeEnum<'ctx>) ->
635
        PointerValue<'ctx> {
         // save current insertion point
636
637
         let curr_block = self.curr_block();
638
         // insert at `entry` block
639
         self.builder.position_at_end(self.curr_fn().entry_block);
640
641
         // allocate the pointer
642
         let res = self.builder.build_alloca(llvm_type, name);
643
644
         // jump back to previous insert position
645
         self.builder.position_at_end(curr_block);
646
647
648
         res
    }
649
```

Listing 4.15 – Pointer Allocation in the LLVM Compiler

This method exists in order to create a new Inkwell pointer value. Like hinted previously, pointers are declared in the 'entry' block of each function in order to allow for more aggressive optimizations. In line 640, this method places the builder cursor at the end of the entry-block of the current function. Next, in line 643, a 'alloca' LLVM instruction is inserted. This instruction is responsible for allocating a new pointer which points to stack memory. After the instruction has been inserted, the builder position is reset to where it was before the method was called. Finally, the pointer is returned so that it is usable for other parts of the compiler.

### 4.3.4. Final Code Generation: The Linker

### TODO: MORE ON LINKERS: [PW17, p. 40]

After LLVM has compiler a program, it outputs an *object file* representing the compiled source program. Object files contain the binary machine code output of a compiler or an assembler. In the case of LLVM, they contain the target-specific machine code generated from the intermediate representation. There are many different formats for representing object files, such as *ELF* on Unix-like systems. However, object files are usually still *relocatable*<sup>7</sup> and not directly executable. In order to create an executable program from object files, a *linker* is used.

A linker or *link editor* is a program which takes one or more object files in order to combine them into a single file. Often, the output of the linker is a file which can be executed by the operating system. For instance, a linker might take an object file generated by a compiler in order to create the final executable program. During *linking*, a linker often perform numerous tasks, such as *relocation* or *symbol resolution*. For instance, a linker might also include *library code* in the executable if the object file depends on external functionality provided by that library. A common example for this library code is the functionality provided by a C standard library. In order to combine these modules, an essential part of the liker's actions is presented by relocation and code modification.

During relocation, the linker assigns definitive addresses to numerous parts of the program. Relocation is required, for instance, when the program to be translated consists of multiple modules referencing each other. Here, the order in which the individual parts of the program

<sup>&</sup>lt;sup>7</sup>Load addresses of position-dependent code may still be changed

will be placed in memory is not known. Therefore, any absolute addresses in the program are not determined [Zhi17, p. 74]. However, we will not explain these concepts further since they are not of particular relevance for understanding the purpose of a linker.

```
[Lev00, pp. 1-15]
```

The shell command in Listing 4.16 presents an example liker invocation. In this example, the LLVM compiler has generated an object file named input.o. The flag -dynamic-linker is used in order to tell the linker which dynamic linker should be used. Next, some library files in the directory '/usr/lib/' are included. These files belong to an implementation of the C standard library and are required so that the 'exit' function works properly. Furthermore, the 'input.o' file is specified so that the linker includes it.



Figure 4.5 – How a Linker Works

```
1 ld -dynamic-linker /lib64/ld-linux-x86-64.so.2 \
2  /usr/lib/crt1.o \
3  /usr/lib/crti.o \
4  -lc input.o \
5  /usr/lib/crtn.o -o output
```

**Listing 4.16** – Using LD to link the LLVM output

This way, the shell command would generate an executable program named 'output' from an object file named 'input.o'. Therefore, a linker often presents the final step of translating a source program into an executable which the computer can understand. However, the linker is completely independent of the previous stages of compilation and is therefore not displayed in the figures. Even though the linker program LD is used in this example, the choice of the linker is completely irrelevant as long as the linker supports the program generated by the compiler.

### 4.3.5. Conclusions

As a conclusion, implementing a compiler which leverages LLVM presents a lot of advantages. For instance, the language will be able to support many backend architectures. Most of the demanding work is being done by LLVM, therefore implementing the compiler will proof to be less difficult and error-prone. Moreover, LLVM performs a lot of very effective optimizations which would otherwise have to be implemented by the compiler designer. However, these optimizations often involve a lot of work and are therefore unpractical to implement for simpler languages. Therefore, LLVM presents a robust, production-ready, and scalable backend which is even used in real-world compilers. However, by depending on LLVM, the resulting compiler will often be less portable since cross-compilation still presents an issue if used across programming language boundaries.

Finally, in order to understand how LLVM's optimizations can positively impact application performance at runtime, we will consider the Fibonacci benchmark again. In this benchmark, the 42nd Fibonacci number is calculated using the program displayed in Listing 1.1 on page 3. However, the 10 in line 2 was replaced by a 42. Running a binary compiled

using the rush LLVM compiler took around 1.3 seconds. However, executing the binary generated using the rush  $x86\_64$  compiler took around 2.17 seconds<sup>8</sup>. Therefore, the program compiled using LLVM ran roughly 1.66 times faster.

<sup>&</sup>lt;sup>8</sup>Average from 100 iterations. OS: Arch Linux, CPU: Ryzen 5 1500, RAM: 16 GB

# 5. Compiling to Low-Level Targets

In the previous chapter, we have learned how compilation to high-level targets can present an easy way of implementing a compiler. These compilers generated outputs which were still platform independent and portable. However, compilers can also target a specific computer architecture directly, thus removing another layer of abstraction. The concept of compiling to a specific architecture directly is similar to the VM since its compiler also targets its architecture directly. However, implementing a compiler targeting the architectures presented in this chapter has proven to be a lot more demanding since the VM uses a fictional architecture purposefully developed for this paper. Reasons for this chapter's difficulty mostly include target-specific constraints which were still irrelevant in the previous chapter. This chapter contains the term "low-level" in its name since the presented compilers generate code for specific target-architectures directly.

# 5.1. Low-Level Programming Concepts

Programming using high-level languages does not require knowledge about the target architecture of the program. However, in this chapter, two compilers targeting the low-level assembly language are presented. In order to make sections in which these compilers are explained more approachable, some of the most important low-level programming concepts are explained in this section. We will only explain concepts which play a significant role later on.

### 5.1.1. Sections of an ELF File

Since a program needs to be representable in a low-level manner, special formats are often required. ELF stands for "executable and linkable format" and is often found on unix-like systems like Linux. Programs using the ELF format can be represented in three different types of files. For instance, object files generated by a compiler, like in the previous LLVM section, might use the ELF format. Furthermore, libraries using *shared object files* might also leverage the ELF format. Most executable program use the format in order to represent a structured container for instructions, data, and additional information. This way, the unit is mostly self-contained and can be executed by the operating system easily. Therefore, ELF describes the format of a class of files and not just of an individual type of file [Zhi17, p. 74-76].

Even though a processor only has access to one physical memory unit for both instructions and program data, most assembly programs like to separate these types of memory into their separate components. Therefore, an object file and assembly program is divided into so-called *sections* [Zhi17, p. 19]. In ELF programs, some of the important sections are:

- .text stores the logic of the program represented using CPU instructions
- .rodata stores read-only global data, it is often used for global constants.
- .data stores mutable global data, such as mutable global variables

Even though a typical ELF file also contains other sections, this list only includes entries which are of importance later in later sections of the paper [Zhi17, p. 76].

### 5.1.2. Assemblers and Assembly Language

Assembly language describes a type of low-level programming languages which are directly influenced by the target architecture. Since the assembly code provides a slight abstraction over the computer's hardware, the assembly code must be translated to machine code before it can be executed. This process is performed by a program named the *assembler* and is often called *assembly*. A typical characteristic of assembly code is that it seems relatively cryptic to a human reading it. Furthermore, compared to high-level languages like C, assembly code is relatively low-level since it can be used to interact with the hardware directly.

For instance, the RISC-V instruction 'add a0, a0, 2' would be used in integer addition. This example contains most characteristics of an assembly language program. Like hinted previously, the name of the instruction is a mnemonic. In this case, 'addi' stands for "add immediate". Furthermore, the exact semantic meaning of the instruction is not immediately apparent. At last, the instruction for adding integers differs for most CPU architectures. For instance, an equivalent instruction for the x\_86 architecture could be 'add %rdi, 2'. Therefore, the fact that instructions differ on each target architecture is clearly apparent.

As seen above, the application code in form of instructions is placed in the '.text' section of an ELF binary. In most assembly dialects, the programmer is able to partition the code manually



Figure 5.1 – Level of Abstraction provided by Assembly

using sections. If the assembly code is assembled to an ELF object file, the '.text' section of the assembly should contain all the instructions. Just like parts of the LLVM IR, the '.text' section of an assembly program obeys a hierarchy. This section contains many labels which mark the beginning of a new basic block. However, these basic blocks do not come with all the constraints which are to be followed when using LLVM IR. For instance, a block in assembly might even contain no instructions, does not have to be terminated and could even be terminated twice. Here, terminating instructions mean jump- and return-instructions. Therefore, the constraints in assembly are much weaker than the ones introduced by LLVM. Because an assembly usually omits complex optimizations, struct rules constraining the assembly code can be omitted too.

Since assembly provides significantly less abstraction than high-level languages, the question of how much abstraction is lost emerges. In order to understand how much abstraction is provided by assembly, we should consider Figure 5.1. Here, the highest level of abstraction is provided by high-level programming languages like C or Rust. In the context of this paper, rush presents roughly the same level of abstraction like these languages. The next lower level of abstraction is provided by assembly. Now, the program is no longer independent of its target architecture and is much more demanding to furmulate. However, the next lower level of abstraction below assembly is represented by code in machine language. As of today, one only rarely encounters a programmer writing programs using this level of abstraction. Since the machine language program is represented in binary, it is nearly impossible for a human to write or understand. However, the machine language is also just an abstraction of the computer's hardware and operating system. Therefore, assembly provides enough

abstraction to be comprehensible for a human whilst being a low-level representation of the program. Although assembly provides more abstraction than the two levels at the bottom of the figure, programmers rarely program in assembly directly. Some benefits of using assembly language to formulate a program are increased runtime efficiency and decreased code size whilst having fine-grained control over the hardware and operating system. Therefore, it might be reasonable for a compiler to generate assembly code from the source program. This way, the program is translated into a low-level, target-specific representation which allows the program to be executed on the target machine directly. However, an assembler is still required in order to translate the assembly output of the compiler. Since most assemblers output object files, a linker is required to create the final executable program. Therefore, a compiler targeting the assembly of a specific architecture depends on these two additional steps before the program can be executed [Dan05, p. 5-6].

One might argue that the compiler could output object files directly. However, doing so rarely creates any significant benefits other than the omitted dependence on the assembler. Furthermore, implementing a compiler using this approach often significantly increases the complexity of the compiler since it now has to perform the role of the assembler as well. However, the compiler could also emit binary data directly, thus making its implementation significantly more demanding.

### 5.1.3. Registers

Most processors contain numerous registers in order to hold data, instructions, and state information. In a computer, registers are sometimes used to hold data stored in variables. However, most of the time, registers are used as temporary storage in large computations. Registers are used in the latter case because they are way faster than memory. Therefore, an algorithm using registers instead of memory will usually run faster. However, even though registers are faster than memory, they are not used all the time. This is because CPU registers are a limited resource, meaning every register-based CPU



**Figure 5.2** – Relationship Between Registers, Memory, and the CPU

only has a finite amount of registers available. Registers should therefore be used carefully and only when they are really needed [Wat17, pp. 212-214].

Figure 5.2 shows how the CPU interacts with its registers and the computer's memory. The connection between the set of registers and the CPU is marked as a short and thick arrow because the connection between the CPU and its registers is very short and fast. Although the registers are often physically parts of the CPU, they are displayed as separate entities in this example. For the memory, the arrow is long and thin. This represents the long and therefore slow connection between the CPU and the memory modules. In most modern computers, this connection often spans across several centimeters on the motherboard. Therefore, higher latency during memory access is inevitable [Dan05, pp. 20-21].

Nearly every CPU architecture will include an individual register layout, differing in size, count, and type. Therefore, the exact information about registers always depends on which CPU is used. As a rule of thumb, having more registers in an architecture will almost always improve performance of that architecture. Furthermore, not every register is identical. There might be platforms with some general-purpose, some floating-point, and some special-

purpose registers. For instance, there may be registers which contain information about the CPU's current instruction. Furthermore, common architectures also include special-purpose registers for modifying the machine's memory. Therefore, there is nearly always an appropriate register matching the requirement [Dan05, Chapter 2].

```
4 li a0, 40
5 li a1, 2
6 add a0, a0, a1
```

Listing 5.1 – Example Assembly Program for Explaining Register Allocation

For a compiler, the limited amount of registers presents a big challenge. Some architectures may require that the operands of arithmetic or logical instructions have been explicitly loaded into registers beforehand. In this case, registers would be required in nearly all computations. In order to understand how registers management can present a challenge, Listing 5.1 should be considered. This snippet displays RISC-V assembly instructions which calculate the sum of the two integets 40 and 2. In line 4, the integer value 40 is placed in the register 'a0'. In line 5, another integer, this time 2 is placed in 'a1'.

Next, the 'add' instruction is used to calculate the sum of these two integers. Since the first operand of the instruction

specifies the register in which the result should be placed, the original value of 40 in the register 'a0' would be overwritten by the instruction [PW17, reference]. Through this example, it becomes apparent that other instructions might use registers unexpectedly. Therefore, subtle bugs can be created if an instruction overwrites registers.

In compilers, the process of register allocation is responsible for managing how registers are used. This process attempts to use registers in a way leading to maximized efficiency of the output program. Since accessing registers is often faster, a register allocator will try to use registers as often as possible. Production-ready compilers often try to keep as much of the frequently accessed data in registers. For instance, this frequently accessed data may also include variables which are normally saved in memory. It is apparent that in most programs, the number of variables will certainly exceed the capacity provided by registers. Therefore, register allocation has to detect when no free registers are available anymore. In this case, the compiler has to save data in memory instead of registers. This process of saving excess data in memory instead of registers is called register spilling. Since register spilling introduces a performance penalty, register allocation algorithms often attempt to prevent it as much as possible. Therefore, sophisticated algorithms for register allocation are often mandatory as long as the factor of output code performance is non-trivial.

Apart from just managing the use of registers, most allocation algorithms are responsible for many other register-related tasks. For instance, register allocation should detect when a variable is no longer needed so that its register can be freed. Moreover, register allocation has to ensure that no conflicts between registers are introduced. Such a conflict may be that a register is accidentally overwritten by an instruction in a completely unrelated basic block. It is apparent that an algorithm performing all these tasks can not be implemented in an ad-hoc manner. Instead, this process often requires complex graph algorithms for determining which registers can be used and freed. Therefore, implementing register allocation in a compiler is often a very demanding task [Wat17, pp.212-214]. Since register allocation represents a complex topic, it will not be explained any further.

## 5.1.4. Using Memory: The Stack and the Heap

#### TODO: @RubixDev: write this subsection

- Repetition: When is the stack used instead of registers?
- Insert figure of the stack.

## 5.1.5. Calling Conventions

Programmers often use procedures or functions in order to structure their programs, both to make the program easier to understand and to allow shared logic to be reused. Therefore, procedures allow the programmer to focus on one portion of the tasks at the time. Often, parameters and returned values act as the interface between the procedure and the remaining code. Therefore, procedures present one way of how a high-level programming language might provide abstraction. During a procedure call in a high-level language like rush, many individual steps need to be performed in the program's low-level representation. Since assembly does not support the use of high-level functions, most architectures specify their own calling convention which describes how low-level function calls are to be performed. On most architectures, a low-level procedure call might take place like the following steps:

- The caller places the call arguments in a place where the procedure can access them.
- Control is transferred to the procedure, often using a jump or specialized call instruction. This specialized instruction often saves the return address in a special register so that function returns are easier.
- The first task the called function performs is acquiring local storage resources needed by the procedure. Often, registers are spilled if required. Furthermore, stack memory for the procedure's local variables is often allocated in this step.
- Internal code of the procedure is run by executing the function body's instructions.
- The procedure's result value is placed somewhere so that the caller can access it. Here resources allocated in step 3 are also released again.
- Control is transferred back to the caller using a specialized return-instruction. A specialized instruction is often required since procedures can be called from multiple places in the program. Therefore, the return-instruction jumps back to the instruction which had initiated the function call.

In order to leverage optimal performance during function calls, passed arguments are usually kept in registers. However, in the previous subsection about registers, we have learned that most architectures state that only subset of their registers are to be used as function arguments. Therefore, if a function is called using more than eight arguments, every remaining argument has to be spilled to memory. In this case, each additional parameter of the function is placed inside the stack frame of the called function so that it can access it. [PH17, p. 98].

For instance, a target architecture might provide four registers which can hold function arguments. Now, a function is called using six arguments. Here, register spilling is mandatory since there are two more arguments than registers. Figure 5.3 illustrates how such a call would be managed on this fictional architecture.

Here, the registers 'r0'—'r3' contain the first four argument values. The excess arguments which could not fit into registers are now placed on the stack. Even though argument 5 and 6 are placed on top of the stack in this example, the exact location of spilled call arguments differs for every architecture.

For managing the calling convention in assembly, most functions contain a *prologue* and an *epilogue*. These blocks of the function are responsible for allocating and deallocating resources which the function might use. For instance, the stack- and frame-pointer is often adjusted in the prologue. This way, space on the stack is allocated for variable declarations found in the called function. However, the modification offset always depends on how much memory the called function will require during runtime. Therefore, a compiler would have to keep track of the count of variable declarations made by a function. Apart from allocating

stack space, the prologue is also often used for storing special registers on the stack.



Figure 5.3 – DRAFT: Spilling Registers for Function Calls

The epilogue however is required for deallocating all resources which were previously allocated by the function's prologue. For instance, the stack-related pointers are modified to reflect their state before the function call. Furthermore, any special registers previously saved on the stack are now restored. Obviously, a return-statement should always jump to the function's epilogue instead of terminating the function directly. At the end of most epilogues, a return-instruction jumping back to the caller location is often found. Therefore, the prologue is executed as the first code when calling the function and the epilogue is called as the last code before this function terminates.

It is to be mentioned that implementing a compiler which respects the calling convention of its target architecture is not required. However, following the convention is often reasonable in order to preserve  $ABI^1$  compatibility of the compiled program.

<sup>&</sup>lt;sup>1</sup>Short for "application binary interface", allows calling foreign functions from another program

## 5.2. RISC-V: Compiling to a Modern RISC Architecture

The RISC-V ISA<sup>2</sup> is a new and modern reduced instruction set architecture focusing on simplicity and expandability. The initial version was developed at UC Berkely in the context of another related research project. Since its introduction in 2011, the architecture has been rapidly growing in popularity. Since the beginning, the project has been managed and led by the RISC-V foundation, consisting of many individuals contributing to the project. Today, corporate members of the RISC-V foundation include companies like Google, Microsoft, Samsung, and IBM. Therefore, the general popularity and commercial attraction of the technology is apparent. However, unlike most previous ISAs, the RISC-V architecture is a completely open-source project and is therefore not controlled by a single large corporate entity. This can be regarded as a large competitive advantage over other popular RISC architectures like ARM. In the past, many ISAs have failed due to them being too restrictive with their licensing, thus preventing widespread commercial adoption. However, RISC-V is completely open and free to use, so that many companies like Google can leverage the technology commercially whilst contributing to the project. Unlike most of the previous ISAs, which were developed during the 1970s or 80s, RISC-V is one of the few ISAs which were developed during this decade. Therefore, it seems like RISC-V could be a significant architecture to be used in all sorts of devices in the near future [PW17, preface].

## 5.2.1. Register Layout

Most RISC architectures typically have a large number of registers [Dan05, Chapter 2]. When compared to other popular architectures, the truth of this statement becomes clear. For instance, the  $x86\_32$  architecture has 8 registers. The popular RISC architecture ARM-32 contains twice that amount, meaning 16 registers. However, RISC-V includes 32 registers, meaning drastically more than the previously mentioned architectures. Moreover, these 32 registers only include registers holding integer values. Just for floating-point numbers, the ISA even provides another 32 registers. Like previously explained, using more registers usually leads to increased efficiency of the output program. Therefore, a register allocation algorithm targeting the RISC-V architecture could be more aggressive compared to one targeting x\_86 for instance.

The Table 5.1 shows most of the registers which the RISC-V platform provides. For the registers in this table, their official ABI names have been used in order to make the section easier to read. The 'zero' register is a special integer register on RISC-V. Like its name suggests, it always holds the value of a constant 0.

| Register   | Purpose            |  |
|------------|--------------------|--|
| zero       | Hardwired zero     |  |
| ra         | Return address     |  |
| sp         | Stack pointer      |  |
| t0 — t6    | Temporary          |  |
| fp         | Frame Pointer      |  |
| a0, a1     | Function argument, |  |
|            | return value       |  |
| a2 — a7    | Function argument  |  |
| s1 — s11   | Saved register     |  |
| fa0, fa1   | FP args, return    |  |
|            | value              |  |
| fa2 — fa7  | FP args            |  |
| fs0 — fs11 | FP saved registers |  |
| ft0 — ft11 | FP temporaries     |  |

Unlike other registers, it is read-only, meaning that it can never be overwritten by accident.

The 'ra' register saves the *return address*. For instance, this register is used when the 'call' instruction is executed. First, the address of the instruction after the call-instruction is saved in 'ra'. Next, the CPU jumps to the instruction at the specified location. If a

<sup>&</sup>lt;sup>2</sup>Short for: "instruction set architecture"

return-instructions was to be used later, the value in 'ra' would be used to jump back to where the call was made.

The 'sp' and 'fp' registers are used for managing stack memory. Their purpose is explained in a following subsection. The 't0' — 't6' registers are often used to store temporary values used in larger computations. Next, the register 'a0' and 'a1' can both serve as function call arguments and return values of functions. How functions are called using registers will also be explained in a following subsection. The remaining a-registers 'a2' — 'a7' can only be used as function call arguments. The saved registers 's1' — 's11' are typically preserved across function calls. However, this concept will also be explained in more detail later. All the previously named registers hold integer values. Depending on the exact RISC-V architecture, all registers either hold 32-bit or 64-bit integers.

Just like their integer counterparts, the floating-point registers 'fa0' and 'fa1' are used as function call arguments and as return values. The registers 'fa2' — 'fa7' can only be used as function arguments holding floating-point numbers. Just like the 'sx' registers, the 'fs0' — 'fs11' registers are usually preserved across function calls. Last, the 'ft0' — 'ft11' registers can be used as temporary registers for floating-point numbers. [PW17, pp. 18f, p. 34], [WA19, p. 155].

Now, it has become apparent that RISC-V includes many registers grouped into categories. Every category is meant to be used in the specified manner, however, these groups are merely a suggestion of how each register should be used. More information on the special registers like 'sp', 'fp', and 'ra' is given in the next sections.

TODO: Did I mention how generous the register count in RISC-V is?

### 5.2.2. Memory Access Through the Stack



Figure 5.4 – DRAFT: Stack Memory on the RISC-V architecture

- In the current implementation, 'sp' includes legal memory locations
- 'fp' points to the first address after the last legal one

## 5.2.3. Calling Convention

- Prologue
  - Decrement stack pointer
  - Adjust frame pointer
  - Save return address

- Epilogue
  - Restore stack- and frame pointer
  - Load previously saved return address
  - Return to the caller instruction

An interesting observation is that labels can be called as if they were functions. However, the 'call' instruction is only a pseudoinstruction which jumps to the target label whilst performing additional work during the process. The prologue is responsible for allocating space on the stack by incrementing sp and fp according to the amount of memory required in its function. Here, the values of the registers for the frame pointer and the return address are saved on the stack.

This way, their values can be restored later before returning from this function. In line 15, the frame pointer is adjusted according to the modifications made to the stack pointer. The comment in the next line indicates the end of the prologue. In short, the prologue performs important initialization before any other code of the newly called function is executed.

Just like the name suggests, the epilogue is used to free any resources allocated by the function's prologue. For instance, the frame pointer is restored to its state before the function call. Furthermore, the epilogue restores the value saved in the 'ra' register in order to restore the return address. Since the return address is vital for returning to the next instruction after the function call in the caller function, it is restored here so that this function returns control to the correct caller instruction. Since the return address of the caller function is saved in the prologue, nested function calls will cause no issues. If the prologue would not save the return address on the stack, a nested function call would overwrite the return address of the parent function, therefore creating a bug [PW17, p.33]. Again, this design contains a lot of similarities to the call-design of the rush VM. In the VM, the process of saving and restoring the return address was managed automatically by the VM. However, here, the programmer has to manually pay attention to saving and restoring this important piece of data. Therefore, implementing a call stack is definitely more demanding in RISC-V assembly than in the rush VM.

## 5.2.4. The Core Library

Like seen in the example code in listing xxx

Since the exit-function is defined in a separate file, this scenario presents referencing an external library which is later resolved by the linker.

## 5.2.5. RISC-V Assembly

The Listing 5.2 shows a rush program containing two functions and a global variable. In line 1 of the rush program, the mutable global variable 'm' is defined with the initial value 42. In the main-function, 'm' is incremented by 1 in line 4. Next, in line 5, the 'foo' function is called using 'm' as its call argument. In line 6, a return-statement is used to terminate the main-function explicitly. The body of the 'foo' function only contains a call to the 'exit' function. Therefore, the 'foo' function only exits using the specified parameter 'n' as the exit code. In this case, the exit code of the displayed program will be 43. The code in Listing 5.5 on page 68 shows the output assembly generated from this program by the rush RISC-V compiler<sup>3</sup>. Because the assembler code of the 'foo' function would take up too much space in the assembler program, it is intentionally omitted from this listing. Since the excluded

<sup>&</sup>lt;sup>3</sup>Generated in Git commit 7f28ebc, automatically built with this document

function does not introduce any new concepts anyway, omitting it will not lead to a loss of explained concepts.

```
let mut m = 42;
1
2
3
    fn main() {
4
         m += 1;
5
         foo(m);
6
         return;
    }
7
8
9
    fn foo(n: int) {
10
         exit(n)
11
```

Listing 5.2 – Foo

In line 1, the '.global' assembler directive is used to declare the global symbol '\_start' [PW17, p. 36]. On most architectures, the '\_start' label indicates a program's entry point, therefore marking the first instruction to be executed [Zhi17, p. 19]. In line 5, the '\_start' label is defined by placing a colon after its name. In line 6, the 'call' instruction is used to call the 'main..main' function. This pseudoinstruction<sup>4</sup> will jump to the first instruction of the specified target label whilst saving the address of the next instruction after the 'call' instruction in the register 'ra' [PW17, p. 22]. Since this register holds the address of the next instruction after the function call, similarities to the call-stack of the rush VM can be noticed. What strikes the eye here is that the already familiar 'main' function is prepended by the 'main..' prefix. Since this rush compiler implements name mangling<sup>5</sup>, every function declared in a rush program will contain this prefix. However, unlike high-level function calls in LLVM, this call instruction is used alongside the previously explained low-level calling conventions of RISC-V.

In the next line, the 'li' instruction is used to load the constant integer 0 into the register 'a0' [PW17, reference]. Like explained in the previous section about the RISC-V calling convention, the register 'a0' is used for the first integer call argument. In line 8, the 'exit' function is called, however, one cannot see the definition of this function in the current file. This is because the exit function is provided by the rush RISC-V corelib which was explained previously. Since 0 was previously placed inside the register for the first integer call argument, the 'exit' function is called using 0 as the argument. Therefore, the instructions in the lines 7–8 are responsible for terminating the program using the exit code 0. These two instructions are always inserted at the end of the '\_start' label in order to terminate the program appropriately in case the rush code does not call 'exit' on its own. This is required in order to prevent a segmentation fault which occurs if the program is not terminated properly.

Due to the function call in line 6, we will now shift our focus on the 'main..main' label in line 10. In line 11, the first line of the 'main' function, a comment indicates the beginning of the function's prologue. Just like demanded by the RISC-V calling convention, the rush compiler emits code for a *prologue* and an *epilogue* into each function.

As described in the previous sections about calling conventions, one task of the prologue is allocating stack space. In this prologue, the 'addi' instruction in line 12 subtracts 16 from

<sup>&</sup>lt;sup>4</sup>A macro generating multiple instructions from one pseudoinstruction. Therefore, the actual count of ISA instructions remains low whilst convenience features can be used in assembly [Dan05, p. 68].

<sup>&</sup>lt;sup>5</sup>Compilers often mangle names in order to create a unique name for every function [Lev00, pp. 119-120]

the value stored in the 'sp'. Here, an addition instruction is used even though subtraction is required. In RISC-V, the 'addi' instruction requires one register and one immediate value as its operands. Due to the third operand being an *immediate* value, the trailing 'i' (*immediate*) appears in the instruction's name. Since this immediate value can be negative, an additional instruction for immediate subtraction is redundant [PW17, reference]. This example shows how the RISC-V ISA omits redundant instructions in places where it is feasible. In this case, the stack pointer is decremented by 16 since two 8-byte values are stored on the stack in the lines 13 and 14. Here, the values of the registers for the frame pointer and the return address are saved on the stack.

```
1
    .global _start
2
3
    .section .text
4
5
    _start:
        call main..main
6
7
        li a0, 0
        call exit
8
9
    main..main:
10
        # begin prologue
11
12
        addi sp, sp, -16
        sd fp, 8(sp)
13
        sd ra, 0(sp)
14
        addi fp, sp, 16
15
16
        # end prologue
        # begin body
17
        li a0, 1
18
        ld a1, m
19
20
        add a2, a1, a0
        sd a2, m, t6
21
        ld a0, m
                                             # m
22
        call main..foo
23
                                             #
24
        j epiloque_0
        return
25
        # end body
26
    epiloque_0:
27
28
        ld fp, 8(sp)
        ld ra, 0(sp)
29
        addi sp, sp, 16
31
        ret
53
    .section .data
54
55
    m:
         .dword 0x0000000000000002a
56
```

**Figure 5.5** – Compiler Output from the Rush Program in Listing 5.2

The comment in line 17 indicates the start of the function's body. First, the previously explained 'li' instruction in line 18 places a constant 1 in register 'a0'. Next, the 'ld' instruction in line 19 is used in order to load the value of the global variable 'm' into the register 'a0' [PW17, reference]. Global variables, like 'm' in this example are saved under the '.rodata' section or under the '.data' section if they are mutable. In this example, 'm' is not declared as mutable and therefore saved under the '.rodata' section. The start of the '.rodata' section is represented by the ".section" assembler directive found in line 53. Here, a label called 'm' is defined. In this label, the '.dword' directive is used to define the global initializer value of the variable. In RISC-V, this directive stores 64 bit of information in successive memory doublewords [PW17, p. 39]. The initializer value of the global variable is 42 and is represented as '0x2a' using hexadecimal in the assembly code. Since these data labels require their contents to be specified in hexadecimal, the trailing comment shows the base 10, humanreadable version of the number. Because global variable are not saved on the stack, special instructions like 'ld' are required to interact with global variables stored in the program's data sections.

At this point, the register 'a0' would contain 1 and 'a1' would contain 42. In line 20, the 'add' instruction is used in order to save the sum of 'a0' and 'a1' in the register 'a2'. Now, the value saved in 'a2' would be 43. Next, the 'sd' instruction in line 21 saves

the value of the register 'a2' at the memory location of the global variable 'm', meaning that 'm' is updated to reflect its new value 43. It now becomes apparent that these instructions are responsible for the add-assign expression in line 4 of the rush program. Another interesting observation is that the last operand of the 'sd' instruction specifies the temporary register 't6'. The instruction uses this register for saving temporary data during the process of saving data in 'm' [PW17, reference].

In line 22, the previously explained 'ld' instruction is used in order to load the value of the same variable into the register 'a0'. Then, the 'call' instruction in line 23 is used in order to call the 'foo' function using the value of m as its argument. However, one cannot easily observe how call arguments are passed here. Like explained previously, the first integer argument of a function call must be placed in the register 'a0'. Since 'm' was loaded into 'a0' previously, it will be used as the call argument for 'foo' automatically. Therefore, the 'foo' function is called using 43 as the first argument.

Since the 'foo' instruction only calls the 'exit' function, its explanation will not be beneficial for introducing new concepts. Therefore, we will omit the explanation of the assembler output of the 'foo' function. The final instruction of the main-function's body is the 'j' instruction in line 24. This instruction will cause the CPU to jump to the address of the specified label. In this example, the CPU will jump to the first instruction of the 'epilogue\_0' label [PW17, p. 17]. Therefore, the rush compiler uses the 'call' instruction for jumps caused by function calls and the 'j' instruction for jumps between blocks of the current function.

Like explained previously, every function has a *prologue* and an *epilogue*. Since one of the tasks handled by the epilogue is releasing resources allocated by the prologue, the function's stack pointer is incremented in line 30. Finally, the 'ret' instruction in line 31 is used in order to jump back to the instruction whose address is specified in the 'ra' register [PW17, reference].

#### 5.2.6. The rush Compiler Targeting RISC-V Assembly

#### 5.3. x86\_64: A Compiler for a CISC Architecture

- Present Some example instructions
- Register layout & count
- Calling convention
- INFORMATION: x86: An intel blog stated ca. 3600 Instructions in 2015 [RA17]
- RISC-V Reader: ASCII adjust after addition  $\rightarrow$  aaa  $\rightarrow$  obsolete, collectively occupy 1.6% of opcode space[PW17, p. 4]
- "Modular vs. Incremental ISAs Intel was betting its future on a high-end microprocessor, but that was still years away. To counter Zilog, Intel developed a stop-gap processor and called it the 8086. It was intended to be short-lived and not have any successors, but that's not how things turned out. The high-end processor ended up being late to market, and when it did come out, it was too slow. So the 8086 architecture lived on—it evolved into a 32-bit processor and eventually into a 64-bit one. The names kept changing (80186, 80286, i386, i486, Pentium), but the underlying instruction set remained intact. —Stephen P. Morse, architect of the 8086" [Mor17]. ⇒ only meant as a temporary architecture

#### 5.3.1. Register Layout

For registers: [Kus18, p. 7]

5.3.2. Memory Access Through the Stack

5.3.3. Calling Convention

5.3.4. X86\_64 Assembly

5.4. Conclusion

5.4.1. RISC vs CISC Architectures

TODO: RISC vs. CISC [Dan05, p. 5-6]

• RISC-V was less demanding and better

## 6. Final Thoughts and Conclusions

### Acknowledgements

We would like to express out sincere gratitude towards our supervisor - Sonja Sokolović for her invaluable supervision and support during the creation of this paper. Our gratitude extends to our school, the CFG Wuppertal which allowed us to pursue this paper as an additional research project.

# **List of Figures**

| Abstract Syntax Tree for '1+2**3'                                    | 6                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
|----------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Token Precedences for Input '(1+2*3)/4**5'                           | 10                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
| Call Stack at the Point of Processing the Return Statement           | 25<br>29<br>31<br>33                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
| Abstract Syntax Tree for '1+2 < 4' Representation of Loops in the VM | 38<br>41<br>46<br>51<br>56                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
|                                                                      | 59<br>60<br>63<br>65<br>68                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
|                                                                      | Token Precedences for Input '(1+2*3)/4**5'  Call Stack at the Point of Processing the Return Statement Abstract Syntax Tree and VM Instructions of a Recursive rush Program  DRAFT: Linear Memory of the rush VM  Call Stack of the rush VM  Abstract Syntax Tree for '1+2 < 4'  Representation of Loops in the VM  Steps of Compilation When Using LLVM  Translation of a Simple rush Program to LLVM IR  How a Linker Works  Level of Abstraction provided by Assembly  Relationship Between Registers, Memory, and the CPU |

### **List of Tables**

| 2.1. | Advancing Window of a Lexer                        | 5  |
|------|----------------------------------------------------|----|
| 2.2. | Mapping From EBNF Grammar to Rust Type Definitions | 7  |
| 5.1. | Common Registers in RISC-V [WA19, p .155]          | 64 |

# **List of Listings**

| 1.1.  | Generating Fibonacci Numbers Using rush                   | 3               |
|-------|-----------------------------------------------------------|-----------------|
| 2.1.  | Grammar for Basic Arithmetic in EBNF Notation             | 4               |
| 2.2.  | The rush 'Lexer' Struct Definition                        | 5               |
| 2.3.  | Simplified 'Token' Struct Definition                      | 6               |
| 2.4.  | Example Language a Traditional LL(1) Parser Cannot Parse  | 7               |
| 2.5.  | Token Precedences in rush                                 | 9               |
| 2.6.  | Pratt-Parser: Implementation for Expressions              | 9               |
| 2.7.  |                                                           | 10              |
| 2.8.  |                                                           | 10              |
| 2.9.  | A rush Program Which Adds Two Integers                    | 12              |
| 2.10. | Attributes of the analyzer struct                         | 13              |
| 2.11. | Output When Compiling an Invalid rush Program             | 14              |
| 2.12. | Analyzer Validating the Signature of the 'main' Function  | 15              |
| 2.13. | Beginning of the 'let_stmt' Method                        | 16              |
| 2.14. | Analysis of Expressions During Semantic Analysis          | 17              |
| 2.15. | Obtaining the Type of Expressions                         | 18              |
| 2.16. | Validation of Call-Arguments in the Analyzer              | 18              |
| 2.17. | Validation of Argument Type Compatibility in the Analyzer | 19              |
| 2.18. | Method for Determining if an Expression is Constant       | 20              |
| 2.19. | Loop Transformation in the Analyzer                       | 21              |
| 3.1.  | Tree-Walking Interpreter: Type Definitions                | 22              |
| 3.2.  | 0 1 11                                                    | $\frac{22}{23}$ |
| 3.3.  | 0 1                                                       | $\frac{23}{23}$ |
| 3.4.  |                                                           | $\frac{25}{24}$ |
| 3.5.  | ŭ -                                                       | $\frac{24}{24}$ |
| 3.6.  |                                                           | $\frac{24}{29}$ |
| 3.7.  |                                                           | $\frac{25}{30}$ |
| 3.8.  |                                                           | 32              |
| 3.9.  | 1                                                         | $\frac{32}{32}$ |
|       | 1                                                         | 33              |
|       |                                                           | 35              |
|       |                                                           | 36              |
| 0.12  |                                                           | -               |
| 4.1.  | 1                                                         | 39              |
| 4.2.  |                                                           | 40              |
| 4.3.  | Compilation of Expressions Targeting the VM               | 40              |
| 4.4.  |                                                           | 42              |
| 4.5.  | 1                                                         | 43              |
| 4.6.  |                                                           | 45              |
| 4.7.  | ı v ı                                                     | 45              |
| 4.8.  |                                                           | 48              |
| 1 Q   | Parts of the Struct Definition of the rush LLVM Compiler  | 50              |

| 4.10. | Compilation of the 'main' Function Using LLVM               |  |  |   |   | 51 |
|-------|-------------------------------------------------------------|--|--|---|---|----|
| 4.11. | Compilation of Call-Expressions Using LLVM                  |  |  |   |   | 52 |
| 4.12. | Compilation of Expressions Using LLVM                       |  |  |   |   | 52 |
| 4.13. | Compilation of Integer Infix-Expressions Using LLVM         |  |  |   |   | 53 |
| 4.14. | Compilation of Let-Statements Using LLVM                    |  |  |   |   | 54 |
| 4.15. | Pointer Allocation in the LLVM Compiler                     |  |  |   |   | 55 |
| 4.16. | Using LD to link the LLVM output                            |  |  | • | • | 56 |
| 5.1.  | Example Assembly Program for Explaining Register Allocation |  |  |   |   | 61 |
| 5.2.  | Foo                                                         |  |  |   |   | 67 |

### **Bibliography**

- [Bac+60] J. W. Backus et al. "Report on the algorithmic language ALGOL 60". In: 3.5 (May 1960). Ed. by Peter Naur, pp. 299–314. DOI: 10.1145/367236.367262.
- [Wir77] Niklaus Wirth. "What can we do about the unnecessary diversity of notation for syntactic definitions?" In: Communications of the ACM 20.11 (Nov. 1977), pp. 822–823. DOI: 10.1145/359863.359883.
- [Lev00] John R. Levine. *Linkers and Loaders*. San Francisco, CA: Morgan Kaufmann, 2000.
- [Lat02] Chris Lattner. "LLVM: An Infrastructure for Multi-Stage Optimization". MA thesis. Urbana, IL: Computer Science Dept., University of Illinois at Urbana-Champaign, Dec. 2002.
- [Dan05] Sivarama P. Dandamudi. Introduction to Assembly Language Programming: For Pentium and RISC Processors. 2nd ed. Springer International Publishing, 2005. ISBN: 0-387-20636-1.
- [Wir05] Niklaus Wirth. Compiler Construction. Zürich, 2005. ISBN: 0-201-40353-6.
- [Fan10] Dominic Fandrey. Clang/LLVM Maturity Report. Karlsruhe, Germany, June 2010.
- [CA14] Bruno Cardoso Lopes and Rafael Auler. Getting started with LLVM core libraries. Birmingham, England: Packt Publishing, Aug. 2014. ISBN: 978-1-78216-692-4.
- [Lin+14] Tim Lindholm et al. The Java Virtual Machine Specification, Java SE 8 edition. Boston, MA: Addison-Wesley Professional, May 2014. ISBN: 978-0-13-390590-8.
- [Kol17] Daniel Kolsoi. Inkwell. 2017. URL: https://github.com/TheDan64/inkwell.
- [Mor17] Stephen P. Morse. "The Intel 8086 Chip and the Future of Microprocessor Design". In: (Apr. 2017), pp. 8–9. DOI: 10.1109/MC.2017.120.
- [PH17] David A. Patterson and John L. Hennessy. Computer Organization and Design RISC-V Edition: The Hardware Software Interface. The Morgan Kaufmann Series in Computer Architecture and Design. Morgan Kaufmann, Apr. 2017. ISBN: 978-0-12-812275-4.
- [PW17] David A. Patterson and Andrew Waterman. *The RISC-V Reader: An Open Architecture Atlas.* Strawberry Canyon LLC, Oct. 2017. ISBN: 978-0-9992491-0-9.
- [RA17] Steven Rodgers and Richard A. Uhlig. "X86: Approaching 40 and Still Going Strong". In: (June 2017).
- [Wat17] Des Watson. A practical approach to compiler construction. en. 1st ed. Undergraduate Topics in Computer Science. Cham, Switzerland: Springer International Publishing, Apr. 2017. ISBN: 978-3-319-52787-1.
- [Zhi17] Igor Zhirkov. Low-Level Programming: C, Assembly, and Program Execution on Intel® 64 Architecture. 1st ed. Saint Petersburg, Russia: APress, June 2017. ISBN: 978-1-4842-2403-8.
- [Kus18] Daniel Kusswurm. Modern X86 Assembly Language Programming. 2nd ed. Geneva, IL, USA: APress, Dec. 2018. ISBN: 978-1-4842-4063-2.

- [WA19] Andrew Waterman and Krste Asanović. "The RISC-V Instruction Set Manual Volume I: Unprivileged ISA". In: (Dec. 2019). Ed. by Andrew Waterman, Krste Asanović, and Sivive Inc. URL: https://riscv.org/technical/specifications/.
- [Led20] Jim Ledin. Modern Computer Architecture and Organization. Birmingham, England: Packt Publishing, Apr. 2020. ISBN: 978-1-83898-439-7.
- [Hsu21] Min-Yih Hsu. LLVM Techniques, Tips, and Best Practices Clang and Middle-End Libraries. Birmingham, England: Packt Publishing, Apr. 2021. ISBN: 978-1-83882-495-2.
- [McN21] Timothy Samual McNamara. *Rust in Action*. In Action. New York, NY: Manning Publications, Aug. 2021. ISBN: 9781617294556.
- [Sen22] Kumar N Sendil. *Practical WebAssembly*. Birmingham, England: Packt Publishing, May 2022. ISBN: 978-1-83882-800-4.

# A. Complete Grammar of rush in EBNF Notation

```
Program = { Item } ;
1
2
                        = FunctionDefinition | LetStmt;
3
   FunctionDefinition = 'fn' , ident , '(' , [ ParameterList ] , ')'
                         , [ '->' , Type ] , Block ;
                        = Parameter , { ',' , Parameter } , [ ',' ] ; 
= [ 'mut' ] , ident , ':' , Type ;
   ParameterList
7
   Parameter
    Block = '{' , { Statement } , [ Expression ] , '}' ;
    Type = \{ '*' \}, ( ident
10
                        | '(' , ')' );
11
12
                  = LetStmt | ReturnStmt | LoopStmt | WhileStmt | ForStmt
   Statement
13
14
                  | BreakStmt | ContinueStmt | ExprStmt ;
   LetStmt
                  = 'let' , [ 'mut' ] , ident , [ ':' , Type ] , '='
15
                  , Expression , ';';
16
                  = 'return' , [ Expression ] , ';' ;
   ReturnStmt
17
                  = 'loop' , Block , [ ';' ] ;
   LoopStmt
18
                 = 'while' , Expression , Block , [ ';' ] ;
19 WhileStmt
                  = 'for' , ident , '=' , Expression , ';' , Expression
20 ForStmt
                  , ';' , Expression , Block , [ ';' ] ;
22 BreakStmt = 'break' , ';' ;
23 ContinueStmt = 'continue' , ';'
                  = ExprWithoutBlock ,
24 ExprStmt
                  | ExprWithBlock , [ ';' ] ;
25
26
                      = ExprWithoutBlock | ExprWithBlock;
27
   Expression
                      = Block | IfExpr ;
   ExprWithBlock
                      = 'if' , Expression , Block , [ 'else' , ( IfExpr
   IfExpr
30
                                                                    Block ) ] ;
31
   ExprWithoutBlock = int
32
                      | float
33
                      I bool
                      char
34
                       ident
35
                      | PrefixExpr
36
37
                      InfixExpr
                      | AssignExpr
                      CallExpr
39
40
                      | CastExpr
                      | '(' , Expression , ')';
41
42 PrefixExpr
                      = PREFIX_OPERATOR , Expression ;
                      = Expression , INFIX_OPERATOR , Expression ;
43 InfixExpr
   (* The left hand side can only be an `ident` or a `PrefixExpr` with the `*`
    → operator *)
   AssignExpr
                      = Expression , ASSIGN_OPERATOR , Expression ;
                     = ident , '(' , [ ArgumentList ] , ')' ;
= Expression , { ',' , Expression } , [ ',' ] ;
= Expression , 'as' , Type ;
   CallExpr
   ArgumentList
48
   CastExpr
49
  ident = LETTER , { LETTER | DIGIT } ;
```

```
51
52
53
54
55
                 | '\' , ( ESCAPE_CHAR
56
                         | """
57
                         | 'x' , 2 * HEX ) ) , "'" ;
58
   bool = 'true' | 'false';
59
60
   61
62
63
               = 'A' | 'B' | 'C' | 'D' | 'E' | 'F' | 'G' | 'H' | 'I'
64
   LETTER
                 'J' | 'K' | 'L' | 'M' | 'N' |
                                              '0' | 'P' | 'Q'
                                                             | 'R'
65
                          | 'U' | 'V' |
                                        'W'
                                                    'Y' |
                                                         'Z'
66
                 'S' | 'T'
                                              ' X '
                                                             | 'a'
                 'b'
                     | 'c'
                          | 'd' | 'e' |
                                        'f' |
                                              ' q '
                                                   'h'
                                                         'i'
                                                             | 'j'
67
                 'k' | 'l' | 'm' | 'n' | 'o' | 'p' | 'q'
                                                       | 'r'
68
                't' | 'u' | 'v' | 'w' | 'x' | 'y' | 'z'
69
               = '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8'
70
   DIGIT
               191
71
               = DIGIT | 'A' | 'B' | 'C' | 'D' | 'E' | 'F' | 'a'
72
               | 'b' | 'c' | 'd' | 'e' | 'f' ;
73
74
               = ? any UTF-8 character ? ;
   ASCII_CHAR = ? any ASCII character ? ;
75
   ESCAPE_CHAR = '\' | 'b' | 'n' | 'r' | 't' ;
76
77
                      = '!' | '-' | '&' | '*' ;
   PREFIX_OPERATOR
78
                      = ARITHMETIC_OPERATOR | RELATIONAL_OPERATOR
   INFIX_OPERATOR
79
80
                       BITWISE_OPERATOR | LOGICAL_OPERATOR ;
   ARITHMETIC_OPERATOR = '+' | '-' | '*' | '/' | '%' | '**' ;
81
   RELATIONAL_OPERATOR = '==' | '!=' | '<' | '>' | '<=' | '>=' ;
82
                      = '<<' | '>>' | '|' | '&' | '^' ;
   BITWISE_OPERATOR
83
                      = '&&' | '||' ;
= '=' | '+=' | '-=' | '*=' | '/=' | '%='
84
   LOGICAL_OPERATOR
85
   ASSIGN_OPERATOR
                       | '**=' | '<<=' | '>>=' | '|=' | '&=' | '^=' ;
86
```