Skip to content

RepComm/recursive-descent-parser

Repository files navigation

recursive-descent-parser

Learning how to write a compiler

Building

npm run build or ./build.sh

Methods described

There are several steps to run source code
A series of passes over the data make it easier to handle:

Tokenize

The first step scans through the source code as a string
and returns a series of tokens
identified by their:

  1. type - defines syntactic usage, such as identifier, keyword, operator, brackets, etc
  2. data - typically the string represented by the type, but can be transformed by preprocessor
    For instance, a preprocessor could take several tokens:
    {type: "parenthesis", data:"("},
    {type: "parenthesis", data:")"},
    {type: "operator", data:"="},
    {type: "operator", data:">"},

And turn them into
`{type: "arrow-function", data:"()=>"}`,
3. line and char numbers (useful for debugging source)

Preprocess

This part is still in the works, but it will essentially
be a function that passes over tokens and returns a
modified set.

What modifications actual entells is up to the preprocessor
but some examples are:

  • source directives
  • .babelrc
  • special language features
    not supported by a parser that
    can be broken down into lower level codes.

Parser

Creates a tree structure from a token array
called an Abstract Syntax Tree or AST

This is where the recursive descent part comes into play, and the part I came here to learn about.

Interpreter / Codegen

I plan on implementing both an interpreter and code generator.

They will take an abstract syntax tree and

  • run (interpreter)
  • or compile (codegen)
    it into some lower level code
    (typically OP codes, or machine code)

Implementation

In my process I've decided to take a language-agnostic
approach, even though my end goal is probably
something like typescript/javascript

For instance, the tokenizer process actually relies
on a Scanner, which is where language syntax will actually be handled,
and the tokenize function will already be implemented for you.

To handle your own language, you'll need to implement
a scanner subclass.

Scanner

This is a class meant to be extended
It provides functionality to implement scanning text
an a more standard way, which should make debugging easier

  • addPass - for adding more syntax handling
    addPass(name: string, pass: ScannerPass): this
    Where name is the token.type when pass is successful
    and pass is a scanner pass

ScannerPass

Each scanner pass is meant to handle a single type of
language syntax.

(data: string, offset: number): ScannerData

Where data is the source code data
offset the offset in the source to read from
and return expected to be a ScannerData

ScannerData

{
  success: boolean //needs to be false when not finding data at offset that satisfies ScannerData.type
  readChars: number //chars that fit this type before we read something we didn't like
  readLines: number //obsolete, this will be handled by internal code soon
  error?: string //optional - meant for when positive identification of error is determined, not necessarily every time success == false
}

Note that scanner data does not actually return the text that was read, only the char count.
This is to standardize the reading process, which should cause a lot less errors
between implementations of languages.

Basically: don't allow reading of chars that don't fit your specifications, and don't count ones that don't.

About

Learning how to write a compiler

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published