#### Writer Note:
This notebook involves both Scheme code and Python code and so, 2 different kernels are needed to read/execute the cells correctly. Since the beginning of the lecture uses Scheme code, the Calysto Scheme kernel will be used at first. However, at a certain point in the notes, you will be suggested to switch to Python kernel.

# Parsing

Parsing is the process of taking texts input (which represents a computer program or some other formal language expression) and turns those into some sort of object that represents the expressions while validating their syntax.

A parser takes text and returns an expression. This happens through an intermediate called **Tokens**. 

<img src = 'token.jpg' width = 600/>

**Lexical analysis** is the process of breaking up texts into tokens (e.g. words or individual symbols). Then we do **Syntactic analysis** to figure out how these symbols nest into hierarchical expression. 

<img src = 'syntactic.jpg' width = 600/>

#### Lexical Analysis

Let's say we typed the following 3 lines:

In [1]:
( + 1
  (- 23)
  (* 4 5.6))

0.3999999999999986

The process of Lexical analysis breaks up each line into the right pieces for input in the Syntactic analysis. 

<img src = '1.jpg' width = 700/>

These are the pieces that represent whole numbers or special symbols such as parentheses `(` or symbols in the language such as `+`.

As we can see, the line,

In [None]:
(+ 1

is broken up to 3 tokens: `(`, `+`, and `1`

Meanwhile the line below,

In [None]:
  (- 23)

The line is converted into 4 tokens: `(`, `-`, `23`, and `)`.

Notice that the number `23` is treated as one token instead of separated to `2` and `3`. Lexical analysis figured out that `2` and `3` next to each other means the number `23`.

Also notice that there are whitespace before the open parentheses `(`. Part of lexical analysis is to figure out what to discard. In this case, the whitespace is ignored.

For the last line,

In [None]:
   (* 4 5.6))

The line is broken down to 6 tokens: `(`, `*`, `4`, `5.6`, `)`, `)`.

See that the lexical analysis is able to tell that `5.6` is all in one token.

<img src = 'lexical.jpg' width = 500/>

#### Syntactic Analysis

Syntactic analysis processes all the tokens to give us expressions in the language that we're trying to parse. 

Syntactic analysis, in addition to skipping the whitespaces, figured out the structure of the expression, balanced out the parentheses and created the nested tree structure represented as `Pair` structure.

<img src = 'syntactic1.jpg' width = 500/>

<img src = 'whole.jpg' width = 700/>

## Recursive Syntactic Analysis

Recursive syntactic analysis is a standard problem in CS brought up all the time. 

We're going to build a particular type of parser for the Scheme expressions that we want to parse, called **predictive recursive descent parser**. This parser inspect only `k` tokens to decide how to proceed (e.g. what sort of structure is going to be built). It does this for some fixed `k` (meaning: we don't need to look too far ahead to understand what's going on in the program).

Let's try recursive descent parser on the English language. Can english be parsed via predictive recursive descent? To answer this question, let's analyze the following sentence:

$$ \text{The horse raced past the barn fell}$$

This is a well-formed sentence in English, but there are a few things that are unusual:

1. Think of the word **raced** as a synonym for **ridden**

<img src = 'ridden.jpg' width = 400/>

2. The word **that was** has been ommitted from the sentence. 

<img src = 'that.jpg' width = 400/>

Now the word **The horse that was ridden past the barn** is a sentence subject. It was in the original version of the sentence, but this version is easier to read.

<img src = 'sentence.jpg' width = 400/>

The reason this is a hard sentence to read is that when we read **The horse raced past the barn**, we assumed a structural analysis:

1. **raced** is the verb
2. **The horse** is the subject
3. **Past the barn** is a modifier of the verb telling us where the horse raced.
    * However, it was not the horse that was racing
    * Instead, someone was racing the horse
    
We had to look at the last word, **fell**, to resolve the the structural ambiguity. 

Thus, English is not something that can be parsed via recursive descent parser. Fortunately, Scheme expressions are.

## Syntactic Analysis

Syntactic analysis in Scheme expressions and other programming languages in general can use recursive descent parsers. It identifies the hierarchical structure of an expression, which may be nested.

Each call to `scheme_read` consumes the input tokens for exactly one expression.

1. The base case is that we only found symbols, or numbers.
2. The recursive call: `scheme_read` sub-expressions and combine them
    * Whenever we see an open parentheses `(`, we know that that's a combination with expressions within it
    * Each one must be `scheme_read` itself. 

Thus, if we have a nested expression such as the following,

In [None]:
'(', '+', 1, '(', '-', 23, ')', '(', '*', 4, 5.6, ')', ')'

In [None]:
'('

On the first call of `scheme_read`, it reads the `(`. It notices that we've started a combination, and therefore we'll have a sequence of subexpressions until we close that parentheses.

In [None]:
'+'
1

The next 2 tokens are base cases: `+` and `1`.

In [None]:
'(', '-', 23, ')' ; A sub expression

The next call to `scheme_read` is going to do a bunch of work. It's going to read the whole sub-expression

In [None]:
'(', '*', 4, 5.6, ')' ; Another sub-expression

And the next call will read the sub-expression as well.

In [None]:
')'

Finally, it'll find the end of the expression that it started in the beginning. This is how `scheme_read` works. 

Let's go back to `scheme_read` and analyze it!

#### Writer Note: Switch to Python kernel

Starting from this point, switch to Python kernel to be able to read the following codes correctly.

#### `scheme_read`

In [None]:
def scheme_read(src)

It takes in `src`, which is some source tokens. 

In [None]:
if src.current() is None:
    raise EOFError

Above, if we ran out of tokens, raise an **End of File Error**. 

In [None]:
val = src.pop()

^ otherwise, get the first token

In [None]:
if val == 'nil':
    return nil

^ if it's a `nil`, that's a base case

In [None]:
elif val not in DELIMITERS:  # ( ) ' .
    return val

If `val` not in `DELIMITERS`, meaning it's neither of the following:
* `(`
* `)`
* `'`
* `.`

then just return that value. This is when we get numbers and symbols as a base case.

In [None]:
elif val == "(":
    return read_tail(src)

^ Otherwise, if we just opened up a combination with an open parentheses `(`, then we `return read_tail(src)`, which means we "return the remainder of a list in src, starting before an element or closing parentheses `)`."

#### `read_tail`

In [None]:
if src.current() is None:
    raise SyntaxError("unexpected end of file")

^ if we run out of text before coming across any `)`, raise a `SyntaxError`. 

In [None]:
if src.current() == ")":
    src.pop()
    return nil

^ if we come across a `)`, then this is a base case, and we're done.

In [None]:
first = scheme_read(src)
rest = read_tail(src)
return Pair(first, rest)

^ otherwise, we find the `first` and `rest`, and return a `Pair` containing the `first` and the `rest`.

The `first` is a recursive call to `scheme_read`. This means if we have a nested expression like following,

In [None]:
> ((1 2) 3)
((1 2) 3)
Pair(Pair(1, Pair(2, nil)), Pair(3, nil))

If we find the first `(`

In [None]:
`(`

Then we'll execute the following in `scheme_read`:

In [None]:
elif val == "(":
    return read_tail(src)

It calls a recursive `read_tail`, 

In [None]:
first = scheme_read(src)

Then it executes the line above, which calls `scheme_read` to read the expression after the first `(`:

In [None]:
(1 2)

...which is used as the `first` element in the...

In [None]:
Pair(1, Pair(2, nil))

Then the second element,

In [None]:
Pair(3, nil)

...would give us the rest of the list.

We'll extend this small program in our project to handle `'` and `.`