# 605.629: Programming Languages
## Assignment 3

Sabbir Ahmed

1. \[100 pts, top down parser\]

Utilizing cells from the Jupyter notebook provided in lecture modules, write a top-down parser than can parse the following:

```
alist = [1, 3, 4]
sum = 0
for a in alist:
    sum = sum + a
```

Your solution might include only the necessary cells and the output should be at least similar to the following:

```
(= (name alist) ([ [(literal 1), (literal 3), (literal 4)]))
(= (name sum) (literal 0))
(for [(name a)] (name alist))
(= (name sum) (+ (name sum) (name a)))
```

__Answer:__

The following snippet has been extracted from various cells in the lecture modules.

In [1]:
import io
import tokenize as tok
import re


# Update the regular expression pattern to find two character matches with '**'
token_pattern = re.compile(r"\s*(?:(\d+)|(\*\*|.))")


class literal_token:
    def __init__(self, value):
        self.value = value

    def nud(self):
        return self

    def __repr__(self):
        return "(literal %s)" % self.value


class end_token:
    lbp = 0


def tokenize_python(_input_program):
    type_map = {
        tok.NUMBER: "(literal)",
        tok.STRING: "(literal)",
        tok.OP: "(operator)",
        tok.NAME: "(name)",
    }
    for t in tok.generate_tokens(io.StringIO(_input_program).readline):
        try:
            yield type_map[t[0]], t[1]
        except KeyError:  # Handling other token values
            if t[0] == tok.ENDMARKER or t[0] == tok.NEWLINE:
                break
            else:
                raise SyntaxError("Syntax error")
    yield "(end)", "(end)"


def tokenize(_input_program):
    for id, value in tokenize_python(_input_program):
        if id == "(literal)":
            symbol = symbol_table[id]
            s = symbol()
            s.value = value
        else:
            # name or operator
            symbol = symbol_table.get(value)
            if symbol:
                s = symbol()
            elif id == "(name)":
                symbol = symbol_table[id]
                s = symbol()
                s.value = value
            else:
                raise SyntaxError("Unknown operator (%r)" % id)
        yield s


# advance helper function checks the current token and advances if it has no value, i.e. it is not an expression
def advance(id=None):
    global token
    if id and token.id != id:
        raise SyntaxError("Expected %r" % id)
    token = next()


# helper method to be used as a decorator
def method(s):
    assert issubclass(s, symbol_base)

    def bind(fn):
        setattr(s, fn.__name__, fn)

    return bind


def expression(rbp=0):  # default right binding power set to 0
    global token, next  # next shadows the reserved word next
    t = token
    token = next()  # get the next token from the tokenizer iterator
    left = t.nud()  # recursion
    while rbp < token.lbp:
        t = token
        token = next()
        left = t.led(left)  # recursion
    return left


def parse(_program):
    global token, next  # the next reserved word is shadowed
    next = tokenize(
        _program
    ).__next__  # tokenizer is an iterator, save its next() function as global
    token = next()  # get the next token
    return expression()  # parse the expression


class symbol_base(object):  # new-style class, derived from Python object class
    id = None  # node/token type name
    value = None  # used by literals
    first = second = third = None  # used by tree nodes

    def nud(self):
        raise SyntaxError("Syntax error (%r)." % self.id)

    def led(self, left):
        raise SyntaxError("Unknown operator (%r)." % self.id)

    def __repr__(self):
        if self.id == "(name)" or self.id == "(literal)":
            return "(%s %s)" % (self.id[1:-1], self.value)
        # positions of the operands and operator, left to right
        out = [self.id, self.first, self.second, self.third]
        # apply the str function to list of out, filter creates the list by applying None
        out = map(str, filter(None, out))
        return "(" + " ".join(out) + ")"  # join concatenates the list with the ' '


symbol_table = {}


def symbol(id, bp=0):
    try:
        s = symbol_table[id]
    except KeyError:

        class s(symbol_base):
            pass

        s.__name__ = "symbol-" + id  # for debugging purposes
        s.id = id
        s.lbp = bp
        symbol_table[id] = s
    else:
        s.lbp = max(bp, s.lbp)
    return s


def infix(id, bp):
    def led(self, left):
        self.first = left
        self.second = expression(bp)  # recursion
        return self

    symbol(id, bp).led = led

The following cell defines the operator symbols and tokens required for the assignment.

In [2]:
symbol('(literal)').nud = lambda self: self
symbol('(name)').nud = lambda self: self
symbol('(end)');

infix('+', 110)  # Python addition operator
infix('=', 60)  # Python assignment operator
infix('in', 60)  # Python in operator

The following cell implements the symbols required to parse a `list` in Python.

In [3]:
symbol(',')  # Python container delimiter
symbol(']')  # right bracket to enclose Python lists

# left bracket to enclose Python lists
@method(symbol('['))
def nud(self):
    self.first = []
    if token.id != ']':
        while 1:
            if token.id == ']':
                break
            self.first.append(expression())
            if token.id != ',':
                break
            advance(',')
    advance(']')
    return self

The following cell implements the symbols required to parse a `for`-loop statement in Python.

In [4]:
symbol(':')

@method(symbol('for'))
def nud(self):
    self.first = []
    if token.id != 'in':
        argument_list(self.first)
    advance('in')
    self.second = expression()
    advance(':')
    return self

def argument_list(list):
    while 1:
        if token.id != '(name)':
            SyntaxError('Expected an argument name.')
        list.append(token)
        advance()
        if token.id != ':':
            break
        advance(':')

In [5]:
display(parse("alist = [1, 3, 4]"))
display(parse("sum = 0"))
display(parse("for a in alist:"))
display(parse("sum = sum + a"))

(= (name alist) ([ [(literal 1), (literal 3), (literal 4)]))

(= (name sum) (literal 0))

(for [(name a)] (name alist))

(= (name sum) (+ (name sum) (name a)))

-----------

2. \[50 pts bonus\]

Comment about how would you modify your parser program to make "+=" work and comment about adding functionality which checks the Python indentation syntax.

To make the `+=` operator work (addition assignment) we have to first add the symbol to the parser. This can be achieved similarly to how the other tokens were added, with the following cell:

__Answer:__

In [6]:
infix('+=', 60)  # Python addition assignment operator

In [7]:
display(parse("sum += a"))

(+= (name sum) (name a))

Adding the operator to the symbols only allows the parser to accept the token and not throw a syntax error. For the expression to evaluate, it would have to use the addition expression with the 2 operands (`sum` and `a`) and then pass the result into the assignment expression with the 2 operands (`sum + a` and `sum`).

Adding whitespace indentation check to the parser would require some additional constraints to the token patterns. The parser would have to first support scoping. Adding indentation is a syntax error if the expression does not have any preceding lines without any indentation. Indentation may only be accepted after `for/while` loops, `if/else` conditionals, function/class definitions, context managers, exception handling statements, etc.

The modified pattern would have to add positive lookbehind tokens in the regular expression, and accept whitespace characters of 2 or 4. The parser would also have to ensure the number of spaces in the indentation is consistent throughout the script.

-----------