# Chapters 1 & 2: Introduction

## Definitions

**Translator** – A program that takes as input a program written in one programming language (the _source language_) and produces as output an _equivalent program_ written in another language (the _target language_)

3 programs are involved
- 1 fixed: the translator
- 2 unfixed: source program and target program

<h3><center><i>Translation Process</i></center></h3>
<img src="./res/01-02/1-1.png" width="700px" alt="Translation Process"/>![Title](res/01-02/1-1.png)

**Compiler** – A translator in which the source language is a high-level language such as Fortran, Pascal, C and Java and the target language is a low-level machine language or Assembly language

**Why compilers?**
* Users want to program in easy high-level languages.
* A computer can only understand/execute programs written in its own machine language.
* A computer serves as an interface between these two by translating high-level source codes into low-level machine codes.

## Program Compilation Environment

<img src="res/01-02/1-2.png" width="500px" alt="Source Program to Compiler to Target Program"/>

## Basic Modules of a Compiler

<img src="./res/01-02/1-3.png" width="600px" alt="Basic Modules of a Compiler"/>

### Some Fundamental Concepts

* _Analysis_ versus _synthesis_ 
 - The **analysis** part (the first **three modules in sequence**) breaks the source program into pieces such as subprograms, blocks and statements and represents their relations in intermediate code.
 - The **synthesis part** (**other three in sequence**) constructs the target program.
 - The two modules in the sides (symbol table manager and bookkeeper) support analysis/synthesis.
* _Single-pass_ versus _multi-pass compiling_
 - A **single-pass compiler** runs these modules **strictly in sequence**.
   - Fast
   - Needs a lot of memory space because all intermediate compilation information must be kept in the main memory
 - A **multi-pass compiler** runs these modules back and forth, **working only on a part of the source program at a time**. 
   - Less memory space
   - Slow
* _Front-end_ versus _back-end_ 
 - The **front-end** (**the first four modules in sequence**, together with part of the two supporting modules) depends primarily on the source language and is independent of the target machine.
 - The **back-end** **depends on the intermediate language and the target machine**. 
 - **Note that the same front-end can be used for many different machines** to create the associated back-end. So, it is better to include as much machine-independent features as possible in the front-end.

### Shared Modules

* **Symbol-table manager** (or **bookkeepe**r) – This module does the bookkeeping job, particularly records all user-defined identifiers and known attributes (such as their types and scopes) in a symbol table.
* **Error-handler** – This module prints appropriate error messages and corrects errors when possible. Errors are detected mostly by the first three modules and compilation must proceed in the presence of errors (error recovery) in order to find other errors in the source program.

### Lexical Analyzer (Scanner)

This module takes the source program as a sequence of characters and groups certain characters that logically belong together into single entities called tokens.

**NOTE:** Some literature makes the distinction between _lexemes_ (the actual text) and _tokens_ (the lexeme and its type). For instance, a variable $x$ would be a lexeme associated with a token of the form <$x$, $[id]$>.

<img src="./res/01-02/1-4.png" width="700px" alt="A scanner transforms a sequence of characters into a sequence of tokens"/>
<h4><center><i>A scanner transforms a sequence of characters into a sequence of tokens</i></center></h4>

#### Example: Tokens
* Reserved keywords such as `BEGIN`, `DO`, `WHILE`, `IF`, ...
* User-defined identifiers such as $x$, $y$, $myfile$, $x.myfile$, ...
* Constants such as $23$, $-13.6$, $25\times10^{-5}$, ...
* Special symbols such as `#`, `(`, `.`, `,`, `+`, `*`, ...

**NOTE:** All user-defined identifiers and constants must be passed to the bookkeeper.

#### Example: Pascal program
```
x := sqrt(y); while I <= j do whilei = 1022 1x; stop
```

<h4><center><i>Lexical Output</i></center></h4>

| Token (Lexeme) | Type           |
|----------------|----------------|
| x              | id             |
| :=             | keyword        |
| sqrt           | keyword        |
| (              | special symbol |
| y              | id             |
| )              | special symbol |
| ;              | special symbol |
| while          | keyword        |
| I              | id             |
| <=             | keyword        |
| j              | id             |
| do             | keyword        |
| whilei         | id             |
| =              | ss             |
| 1022           | const          |
| 1x             | ERROR          |
| ;              | special symbol |
| stop           | id             |

<h4><center><i>Symbol Table</i></center></h4>

| Token (Lexeme) | Type  | ... | ... | ... |
|----------------|-------|-----|-----|-----|
| x              | id    |     |     |     |
| y              | id    |     |     |     |
| I              | id    |     |     |     |
| j              | id    |     |     |     |
| whilei         | id    |     |     |     |
| 1022           | const |     |     |     |
| stop           | id    |     |     |     |

#### Example: FORTRAN program
```
IF (5.EQ.MAX) GOTO100 ELSE GO TO 100
```

**NOTE:** Observe that we need a "lookahead" before saying 5 alone is a token.

<h4><center><i>Lexical Output</i></center></h4>

| Token (Lexeme) | Type           |
|----------------|----------------|
| IF             | keyword        |
| (              | special symbol |
| 5              | constant       |
| .              | special symbol |
| EQ             |                |
| .              | special symbol |
| MAX            |                |
| )              | special symbol |
| GOTO100        | id             |
| ELSE           | keyword        |
| GO             | keyword        |
| TO             | keyword        |
| 100            | constant       |

<h4><center><i>Symbol Table</i></center></h4>

| Token (Lexeme) | Type  | ... | ... | ... |
|----------------|-------|-----|-----|-----|
| 5              | const |     |     |     |
| GOTO100        | id    |     |     |     |
| 100            | const |     |     |     |

### Syntax Analyzer (Parser)

This module performs a complete syntax checking (structural analysis) of the source program, i.e., it determines the syntactic relations among tokens found by the scanner.

<img src="./res/01-02/1-5.png" width="700px" alt="A parser transforms a sequence of tokens into a parse tree"/>
<h4><center><i>A parser transforms a sequence of tokens into a parse tree</i></center></h4>

Note that the symtax of a programming language is defined by a _context-free grammar_, which is a system $G=(N,\Sigma,P,S)$ with $N$ = nonterminal set, $\Sigma$ = terminal set, $P$ = rules of the form $A\rightarrow\alpha$, where $A\in N$ and $\alpha\in (N\cup \Sigma)^*$, and $S\in N$ is the start symbol.

#### Example: CFG
```
<asmt stmt>    → <identifier> := <expr>
<identifier>   → <letter> | <letter> <alphanumeric>
<letter>       → A | B | … | Z
<alphanumeric> → <letter> | <digit> | <letter> < alphanumeric>
               | <digit> <alphanumeric>
<digit>        → 0 | 1 | … | 9
<expr>         → <identifier> | <constant> | <expr> + <expr> | <expr> * <expr>
<constant>     → <digit> | <digit> <constant>
```
    
This CFG consists of 7 nonterminal symbols, 40 terminal symbols and 49 rules. The start symbol is `<asmt stmt>`, which defines the syntactic structure of all assignment statements.

#### Example: Parse tree
```
A := B1 + C * 21
```

**NOTE:** The parse tree is constructed on **tokens**, NOT INDIVIDUAL SYMBOLS! The symbols below the parse tree are detected by the scanner.
**NOTE:** The tokens for `A`, `B1`, `C`, and `21` have pointers to their symbol table entries.

<h4><center><i>Parse Tree Example</i></center></h4>
<img src="./res/01-02/1-6.png" width="700px" alt="Parse Tree Example"/>

### Semantic Analyzer

This module analyzes the meaning of the source program, e.g., it performs type checking.

#### Example: Type incompatibility
Type incompatibility in x := 20 * y, where x, y are of the real type and 20 is an integer, can be detected by the semantic analyzer, which also does type conversion if necessary.

### Intermediate Code Generator

This module produces a sequence of intermediate symbolic codes, typically three-address codes (statements involving at most three operands), of the form $A := B \odot C$.

#### Example
```
// general examples
A  := B * C
A  := B

// from the previous parse tree: A := B1 + C * 21
R1 := C * 21
R2 := B1 + R1
A  := R2
```

### Code Optimizer

This module transforms the intermediate codes into more time/space-efficient codes

* Local optimization (such as peephole optimization) considers only small portions of the source program.
* Global optimization performs optimization considering its effect over the entire source program (by using data/control flow analysis).

#### Example

```
// before optimization (notice that B1 + R1 can be immediately assigned to A)
R1 := C * 21
R2 := B1 + R1
A  := R2      // copy instruction, 3 operations is the upper limit

// after optimization (possible if R2 is not used elsewhere)
R1 := C * 21
A  := B1 + R1
```

### Code Generator

This module produces the target code in the form of Assembly code or relocatable machine code.

#### Example

##### 3-address code
```
R1 := C * 21
R2 := B1 + R1
A  := R2
```

##### Assembly code
```
LOAD C      ; acc (accumulator) = C
MULT 21     ; acc = C * 21
STOR R1     ; R1 = C * 21
LOAD B1     ; acc = B1
ADD  R1     ; acc = B1 + R1
STOR A      ; A := B1 + C * 21
```

##### Machine code
Assume that instructions have the following function codes:
```
LOAD = 0001
MULT = 0101
ADD  = 1010
STOR = 1101
```
Assume that the symbol table contains the following values
```
| Address | Variable | ... |
| ------- | -------- | --- |
| 0011    | C        |     |
| 0100    | B1       |     |
| 0111    | 21       |     |
| 1100    | A        |     |
```
Assume that machine code for an instruction is of the form `ffff cccc` where 
* `ffff` is the function code
* `cccc` is the variable address/constant

Assume that the instructions
```
STOR R1     ; R1 = C * 21
LOAD B1     ; acc = B1
ADD  R1     ; acc = B1 + R1
```
have been optimized as
```
ADD  B1
```

Then, we obtain the following target code:
```
| Assembly code | Machine code |
| ------------- | ------------ |
| LOAD C        | 0001 0011    |
| MULT 21       | 0101 0111    |
| ADD  B1       | 1010 0100    |
| STOR A        | 1101 1100    |
```

This is called relocatable machine code, because actual memory addresses can be determined at runtime by using relative distances.

## Compiler-compiler (or compiler-generator)
A software program that takes the description of the main features of the source language as input and produces a module of a compiler.

#### Examples
* Scanner generator, e.g., LEX (regular expression -> scanner)
* Parser generator, e.g., YACC (CFG -> parser)
* Syntax-directed translation engine
* Automatic code generator
* Data-flow engine