Learning how to write a compiler
npm run build or ./build.sh
There are several steps to run source code
A series of passes over the data make it easier to handle:
The first step scans through the source code as a string
and returns a series of tokens
identified by their:
type- defines syntactic usage, such as identifier, keyword, operator, brackets, etcdata- typically the string represented by the type, but can be transformed by preprocessor
For instance, a preprocessor could take several tokens:
{type: "parenthesis", data:"("},
{type: "parenthesis", data:")"},
{type: "operator", data:"="},
{type: "operator", data:">"},
And turn them into
`{type: "arrow-function", data:"()=>"}`,
3. line and char numbers (useful for debugging source)
This part is still in the works, but it will essentially
be a function that passes over tokens and returns a
modified set.
What modifications actual entells is up to the preprocessor
but some examples are:
- source directives
.babelrc- special language features
not supported by a parser that
can be broken down into lower level codes.
Creates a tree structure from a token array
called an Abstract Syntax Tree or AST
This is where the recursive descent part comes into play, and the part I came here to learn about.
I plan on implementing both an interpreter and code generator.
They will take an abstract syntax tree and
- run (interpreter)
- or compile (codegen)
it into some lower level code
(typically OP codes, or machine code)
In my process I've decided to take a language-agnostic
approach, even though my end goal is probably
something like typescript/javascript
For instance, the tokenizer process actually relies
on a Scanner, which is where language syntax will actually be handled,
and the tokenize function will already be implemented for you.
To handle your own language, you'll need to implement
a scanner subclass.
This is a class meant to be extended
It provides functionality to implement scanning text
an a more standard way, which should make debugging easier
- addPass - for adding more syntax handling
Where
addPass(name: string, pass: ScannerPass): this
nameis the token.type when pass is successful
and pass is a scanner pass
Each scanner pass is meant to handle a single type of
language syntax.
(data: string, offset: number): ScannerDataWhere data is the source code data
offset the offset in the source to read from
and return expected to be a ScannerData
{
success: boolean //needs to be false when not finding data at offset that satisfies ScannerData.type
readChars: number //chars that fit this type before we read something we didn't like
readLines: number //obsolete, this will be handled by internal code soon
error?: string //optional - meant for when positive identification of error is determined, not necessarily every time success == false
}Note that scanner data does not actually return the text that was read, only the char count.
This is to standardize the reading process, which should cause a lot less errors
between implementations of languages.
Basically: don't allow reading of chars that don't fit your specifications, and don't count ones that don't.