# Let the Syntax guide You

<!--
\index{syntax directed language processing}
-->

Now that we have an understanding of what parsing entails we can build our first interpreters.
For a certain class of languages we can do our processing as soon as we recognize syntactic structures, that means
we can do our processing right in the embedded actions of the grammar.
This is called *syntax directed language processing* and is best illustrated with an example.

In [2]:
import sys
sys.path.insert(0,"code")

## An Interpreter for Exp1 using a Recursive Descent Parser

<!--
\index{interpreter}
\index{interpretation}
\index{interpretation!algebraic terms}
\index{interpretation!assignment statement}
\index{interpretation!syntax-directed}
\index{syntax-directed interpretation}
-->

According to our classification of language processors in Chapter 1 an interpreter reads a program and executes the program
directly (see Chapter 1 Figure 6).
We accomplish this by interpreting the syntactic structures as soon as we parse them.
This is called *syntax-directed interpretation* where we execute the semantic rules of the language as soon as we recognize
the corresponding syntactic structures.

What exactly do we mean by interpretation?
In order to get a better idea of what interpretation is we turn to a language that you are very familiar with: algebra.
Consider the algebraic expression,
```
x = 3
```
We interpret this expression by first interpreting the symbol `3` as the mathematical value three, we then interpret the symbol
`x` as a variable, and because the variable appears to the left of the symbol `=`  we assign the value three to the variable `x`.
Now consider the term,
```
y = 2 + x
```
In order to interpret this term we first figure out what value is assigned to the variable `x`, we then interpret the symbol `2` as the mathematical
value two, and finally we compute the value of the right term by interpreting the `+` symbol as addition computing
the value five (if we assume that `x` has the value three from the previous example).  In order to complete the interpretation of this algebraic term we again interpret
the `=` as the assignment of the value five to the variable `y`.

<!--
\index{interpretation!syntax directed}
\index{syntax-direct interpretation}
-->

One thing you probably noticed at this point is that the interpretation of algebraic terms is *bottom-up*, that is, it starts with the operands
that are immediately computable, such as constant symbols or variables,  and works its way up to the top-level operator which in this case is the assignment operator.

>This approach to interpretation is called syntax-directed interpretation because the interpretation is guided by the syntactic structure of the terms.

Now recall the syntax of our Exp1 language:
```
prog : stmt_list 

stmt_list : stmt_list stmt
          | stmt

stmt : PRINT exp ';'
     | STORE var exp ';'

exp : '+' exp exp
    | '-' exp exp
    | '(' exp ')'
    | var
    | num
	
var : NAME 
num : NUMBER
```
It is the language of pre-fix expressions and has two statements.  One to print values of expressions to the terminal and the other to store the value of an expression in a variable.
In order to see what syntax-directed interpretation looks like for our Exp1 language let us start with the parse tree for the program,
```
store y + 2 x ;
```
Figure 1 shows the parse tree for this program.  It is clear from the structure of the tree that in order to compute a value to store into variable `y` we would have to interpret the tree starting at the right side leaves and then keep interpreting the operators and computing the values along the tree branches in the direction of the red arrows.  One way to visualize syntax directed interpretation is that values percolate from the tree leaves up to the root. In our case, once interpretation reaches the root of the parse tree the value computed thus far is stored in the variable `y`.

***
<center>
<img src="figures/chap03/1/figure/Slide1.jpg" alt="">
Fig 1. Interpreting the parse tree for the program `store y + 2 x ;`.
</center>

***

Now it turns out that we can achieve the same interpretation behavior that we showed above in a parser without having to construct an explicit parse
tree.
Consider the non-terminal `exp` defined in
the Exp1 grammar as,
```
exp : '+' exp exp
    | '-' exp exp
    | '(' exp ')'
    | var
    | num

```
Consider a hand-built recursive descent parser for this non-terminal.
In order to enable syntax directed interpretation all we have to do is 
allow return values from the parsing functions.

Consider,

In [3]:
def exp():
    tok = token_stream.pointer()
    
    if tok.type == '+':
        token_stream.next()
        return exp() + exp()
    
    elif tok.type == '-':
        token_stream.next()
        return exp() - exp()
    
    elif tok.type == '(':
        token_stream.next() # match '('
        val = exp()
        token_stream.next() # match ')'
        return val
    
    elif tok.type == 'NAME':
        return var()
    
    elif tok.type == 'NUMBER':
        return num()
    
    else:
        raise SyntaxError('unexpected symbol {} while parsing'.format(tok.value))

Having declared a return value for the parsing function `exp` implies that parsing an
expression will actually compute an integer value.
If we look back at the interpretation of the parse tree in Figure 1 then we see that this
is exactly what is happening: any time we see the non-terminal `exp` 
in the tree we can observe that either an integer value is being computed or propagated.

<!--
\index{rvalue}
-->

Now, if we take a closer look at the parsing function itself we see that the exact same behavior  that we observed on the parse tree is encoded here. For the tokens `+` and `-` we see that the function `exp()` calls itself recursively and then given the returned values performs the appropriate arithmetic operation in order to compute its own return value, that is, at this point we take the two values that propagated up from the subexpressions, add or subtract them as appropriate, and return the newly computed value. Something very similar happens with the token `(`; here we simply return the value of the parenthesized expression.

When we encounter variables in the expression we use the name of the variable in order to look up its associated value in our symbol table and then return that value.
With constants we simple retun the  integer value of that constant token. 

The function `exp()` represents a recursive function that will recurse to the
recursion termination cases `var()` and `num()` and then
backs out of the recursion while returning integer values.
In this way we see computed values percolating from the bottom up to the top where they can then be used.

Here are the parsing functions for variables and numbers.

In [4]:
def var():
    tok = token_stream.pointer()
    
    if tok.type == 'NAME':
        token_stream.next()
        return symbol_table.get(tok.value, 0) # return 0 if not found
    
    else:
        raise SyntaxError('unexpected symbol {} while parsing'.format(tok.value))

In [5]:
def num():
    tok = token_stream.pointer()
    
    if tok.type == 'NUMBER':
        token_stream.next()
        return tok.value
    
    else:
        raise SyntaxError('unexpected symbol {} while parsing'.format(tok.value))

Our symbol table is a dictionary in order to associate names with values,  

In [6]:
symbol_table = dict()

In order to run those functions we need set up our lexical analysis and token stream.  For our lexical analysis we use the lexer for Exp1 from Chapter 2, `exp1_lex.lexer`.  The class `TokenStream` converts an character stream into a token stream using the given lexical analyzer.

In [7]:
from exp1_lex import lexer
from grammar_stuff import TokenStream

In [8]:
input_stream = "+ 1 x0"

In [9]:
token_stream = TokenStream(lexer, input_stream)

In [10]:
while not token_stream.end_of_file():
    tok = token_stream.pointer()
    print("Token: {} {}".format(tok.type, tok.value))
    token_stream.next()

Token: + +
Token: NUMBER 1
Token: NAME x0


Our token stream works nicely. Now, let's put this to use for parsing and evaluating Exp1 expressions.

In [11]:
input_stream = "+ 1 2"

In [12]:
token_stream = TokenStream(lexer, input_stream)

In [13]:
print(exp())

3


Yes! Given an input stream `"+ 1 2"` our toplevel `exp()` call return the value `3`, as we would expect.

Let's try this on something a bit more complicated,

In [14]:
input_stream = "(- (+ 1 2) 1)"

In [15]:
token_stream = TokenStream(lexer, input_stream)

In [16]:
print(exp())

2


In order to get back to our example above, recall the grammar snippet for 
`stmt` in Exp1,
```
stmt : PRINT exp ';'
     | STORE var exp ';'
```
The corresponding parsing function looks like this,

In [17]:
def stmt():
    tok = token_stream.pointer()
    
    if tok.type == 'PRINT':
        token_stream.next() # match PRINT
        print("> {}".format(exp()))
        token_stream.next() # match ;
        return None
    
    elif tok.type == 'STORE':
        token_stream.next() # match STORE
        name = lvar()
        val = exp()
        symbol_table[name] = val
        token_stream.next() # match ;
        return None
    
    else:
        raise SyntaxError('unexpected symbol {} while parsing'.format(tok.value))

The first thing to notice is that in Exp1 statements themselves do not compute any values and
therefore the corresponding parsing function does not return any values.
Looking at the function itself we see that in the case of a `PRINT` statement we compute the value of the expression
while parsing it and then write that value to the output.
In terms of the `STORE` statement we parse the lvalue-variable with the function `lvar()`
which will give us a name.
We then parse the expression which will return an integer value for the expression and it is this value that we store in the symbol table together with the variable name.

In [18]:
def lvar():
    tok = token_stream.pointer()
    
    if tok.type == 'NAME':
        token_stream.next()
        return tok.value # return var name
    
    else:
        raise SyntaxError('unexpected symbol {} while parsing'.format(tok.value))

Back to our example, we want to interpret the statement `store y + 2 x ;`.  We need to set up our input and token streams appropriately,

In [19]:
input_stream = "store y + 2 x ;"
token_stream = TokenStream(lexer, input_stream)

In [20]:
symbol_table = dict()

In [21]:
stmt()

In [22]:
print(symbol_table)

{'y': 2}


The contents of the symbol table is exactly what we had expected given that the default value for the variable `x` is zero since nothing had been assigned to it.  In order to change that we preload the symbol table with a value for `x`.

In [23]:
symbol_table = {'x':3}

In [24]:
input_stream = "store y + 2 x ;"
token_stream = TokenStream(lexer, input_stream)

In [25]:
stmt()

In [26]:
print(symbol_table)

{'x': 3, 'y': 5}


<!--
\index{rvalue}
\index{lvalue}
\index{symbol table}
-->

<!-- This paragraph makes no sense here...
Just as in the program from the previous chapter where we tried to find all the variable references were not variable definitions in an Exp0 program
we have to be careful with the interpretation of Exp1 programs and distinguish lvalues and rvalues.
If a variable appears as an lvalue (that is it appears as the first argument to the STORE statement) then we assign a value to it and
if a variable appears as an rvalue (that is it appears in the expression of the STORE statement) then we just look up the corresponding 
value for the variable.
Value updates and lookups are usually accomplished with the help of a symbol table. 
Exp1 is simple enough that a simple dictionary like table as a way to associate variable names with values suffices.
-->

The following video shows an animation of the syntax directed interpretation of our Exp1 program:

<!-- videos/chap02/q7/figure.mov -->

<a href="http://www.youtube.com/watch?feature=player_embedded&v=jmE_9zOfp1g" target="_blank">
<img style='border:1px solid #000000' src="movie.jpg" width="120" height="90" />
</a>

We can initialize `x` through a `store` statement and we can print out the value of `y` with an Exp1 `print` statement.  Adding statment lists to our parser will allow us to do that.  Recall the grammar snippet that specifies statement lists,
```
stmt_list : stmt_list stmt
          | stmt

```
Now we are facing a problem, this grammar snippet is not LL(1), the lookahead sets for both rules are indentical.  That means we cannot directly convert the grammar snippet into
a recursive descent parser function.
However, we can rewrite these rules borrowing some notation from regular expressions,
```
stmt_list : stmt+
```
meaning that a statement list consists of one or more statements.  This allows us to construct a parser function for `stmt_list`,

In [27]:
def stmt_list():
    while True:
        stmt()
        if token_stream.end_of_file():
            break
    return None

In [28]:
symbol_table = dict()

In [29]:
input_stream = \
'''
store x 3; 
store y + 2 x; 
print y;
'''

token_stream = TokenStream(lexer, input_stream)

In [30]:
stmt_list()

> 5


We now have a fully functioning interpreter for our Exp1 language.  The interpreter is syntax directed because the values are being computed and passed along as we are parsing the source program.  To create a more polished implementation of the interpreter we can add a toplevel driver function,

In [36]:
def exp1_rinterp(input_stream = None):
    'A driver for our recursive descent Exp1 interpreter.'
    
    global token_stream
    global symbol_table
    
    if not input_stream:
        input_stream = input("exp1 > ")
    
    token_stream = TokenStream(lexer, input_stream)
    symbol_table = dict()
    
    stmt_list()

In [37]:
exp1_rinterp("store x 1; store y 2; print + x y;")

> 3


## An Interpreter for Exp1 using an LR(1) Parser

%%%%%%%% figure  %%%%%%%%%
\myfigureA
{chap02:exp1interp-gram}
{\input{figures/chap02/22/exp1interp-gram.tex}}
{ANTLR specification for the Exp1 interpreter.}

Of course we don't want to build parsers by hand but we want to use a tool like ANTLR to generate our parsers.
ANTLR provides the idea of an attribute that we can associate with a non-terminal and this allows
us to encode the same parser behavior as in the hand-built parser.
Figure~\ref{chap02:exp1interp-gram} shows the ANTLR specification for our Exp1 interpreter.
For space reasons we did not include the supporting Java code.  If you do look at the Java code (see the QR code below) you
will notice that the supporting Java code is split into two different parts: one part is the \ilisting{@header} part for the declarations of libraries to include
in your parser and the other part is the \ilisting{@members} parts that allows you to add data and function members to the 
parser class as we have seen before.

%%%% qr code %%%%
\qrcode
{Scan the QR code or use the URL in order to see full ANTLR specification for the Exp1 interpreter.}
{qrcodes/chap02/q8/qrcode.png}
{\bookurl/b/2/q8/exp1Interp.g}


If we take a closer look at the specification file and ignore the actions for a minute we
find a number  of major differences between this specification and our original Exp1 specification in Figure~\ref{chap02:exp1-gram}.
The first one is that structural tokens such as {\icd 'print'} have been put directly into the grammar specification itself.
Due to the tight integration of the parser generator and the lexer in ANTLR, ANTLR can generate lexical rules automatically for any structural tokens appearing in the grammar specification making ANTLR specifications nice and compact and easily readable. 
The second difference is that some non-terminals now have a return value.  These are precisely the values that we are interested
in during our interpretation of Exp1 programs and these are the same values that we saw as return values in our hand-built parser.
And finally we have two additional lexical rules; one for comments and one for white space.  A closer look at the actions
associated with these lexical rules shows that there is a special directive for the parser to ignore both comments and
white space: {\icd \$channel=HIDDEN}.

We begin by taking a look at the rules for the non-terminal {\lstinline[basicstyle=\normalsize]$exp$},
\antlrlistingnomath
exp returns [Integer value]
	   	:   '+' e1=exp e2=exp 	{ $value = $e1.value + $e2.value; }
   		|   '-' e1=exp e2=exp 	{ $value = $e1.value - $e2.value; }
		|	'(' e=exp ')' 		{ $value = $e.value; }
		|	var 				{ $value = lookup($var.name); }
		|	INTVAL				{ $value = new Integer($INTVAL.text); }
		;
\end{lstlisting}
The first  rule specifies the addition operation.  Notice that the original rule has two {\lstinline[basicstyle=\normalsize]$exp$}
non-terminals on the right.
This introduces an ambiguity if we are trying to access the return values of each of these non-terminals.
In order to get rid of this ambiguity we give names to each of the non-terminals.
In our case we call the left {\lstinline[basicstyle=\normalsize]$exp$} non-terminal {\icd e1} and the right {\lstinline[basicstyle=\normalsize]$exp$} non-terminal {\icd e2}.
Recall that we declared return a value with name {\icd value} for  every non-terminal \ilisting{exp}.
In the rule actions we can access these return values with a special notation. 
We can access the return value of our first expression {\icd e1} with the notation,
\begin{code}
\$e1.value
\end{code}
Similarly for expression {\icd e2}.
We also have a special notation to set the return value of the current {\icd exp} non-terminal,
\begin{code}
\$value
\end{code}
With this the action for the first rule simply adds the two return values for expressions {\icd e1} and {\icd e2} and makes
the resulting value the return value of the current expression non-terminal,
\antlrlistingnomath
{ $value = $e1.value + $e2.value; }
\end{lstlisting}
You should convince yourself that the remaining rules encode exactly the same behavior as in the corresponding hand-build parsing function
above.

The next important group of rules in our grammar are the rules for the non-terminal \ilisting{stmt},
\antlrlistingnomath
stmt	:	'print' exp ';'			{ print($exp.value); }
		|	'store' var exp ';'		{ update($var.name,$exp.value); }
		;
\end{lstlisting} 
Since statements do not return any values there is no need to declare a return value for this non-terminal.
The first rule specifies the PRINT statement and as expected the action associated with this rule takes the value 
computed by the non-terminal {\icd exp} and prints it.
The second rule specifies the STORE statement and again as expected the action updates the symbol table with the name-value
pair which consists of the variable name and the value computed by the expression.
You should compare this rule set with hand-built parsing function for statements above.

You should take notice that tokens have a built-in return value, namely the text of the string that was returned from the lexer as part of the
token.  
For example, in the case of the token \ilisting{INTVAL} we can access that text with the notation,
\antlrlistingnomath
$INTVAL.text
\end{lstlisting}

All we need to do in order to complete our interpreter is to write a driver program similar to the program
appearing in Figure~\ref{chap02:exp0count-driver} but with the names adjusted for our new program accordingly.  
At this point you should download the code from the book website and experiment: \bookurl/source

%%%%%%%%%%%%%%%%%%% new section %%%%%%%%%%%%%%%%%%%%%
\subsection{An Example: A Pretty Printer for Exp1}

%%%%%%%% figure  %%%%%%%%%
\myfigureA
{chap02:exp1pp-gram}
{\input{figures/chap02/23/exp1pp-gram.tex}}
{ANTLR specification for the Exp1 pretty printer.}

\index{syntax directed translation}
\index{translation!syntax directed}
Syntax directed language processing does not only apply to interpretation.  We can also use syntax directed techniques to build
simple translators.
A pretty printer for our Exp1 language is a good example to study.
As you might know, pretty printers are programs that read the source of a program written in some programming language
and then generate code in the same language but formatted nicely so that the program is easy to read for humans.
This is a great example of a simple translator shown in Figure~\ref{chap01:simple-translator} except in our case it is not necessary
to construct an IR because we will use syntax directed translation.
Our pretty printer accomplishes two things: One, it will put each statement on its own line.  Two, the expressions will be rewritten into
Lisp like syntax.  
In Lisp, each operation is embedded in a pair of parentheses.
For example, to add two numbers in Lisp we write the following expression,
\begin{code}
(+ 2 3)
\end{code}
This means we also get rid of unnecessary parentheses.  For example, the expression {\icd ((+ (2) (3)))} will be rewritten as above.
 
Figure~\ref{chap02:exp1pp-gram} shows the ANTLR specification of our pretty printer. 
You will notice the usual prologue in the specification.  
Here we declare a parser member function \ilisting{emit} that allows us to write strings to the terminal output.
Skipping down to the lexical rules we see that nothing has changed from the specification of the syntax directed interpreter with the exception that
now we have a token \ilisting{VAR} instead of \ilisting{NAME}.

Now, if you look at the grammar rule section of the specification and ignore the actions for a minute
then you will notice that we have rewritten the grammar slightly.
It still generates the same language and in this form makes it easier to generate code.
Also notice that none of the non-terminals have return values.
This is because we are dealing with a simple translator, a translator that does not perform any semantic analysis but simple does a mapping
of the syntax.

The first rule of the grammar section is,
\antlrlistingnomath
prog 	:	( stmt ';' { emit(";\n"); } )+ 
		;
\end{lstlisting}
This is the rule that states that programs consist of one or more statements.
We have changed this rule slightly compared to the same rule in the syntax directed interpreter by inserting the semicolon token and the action that emits code.
In this form the rule states that every time we recognize a statement followed by a semicolon in the input stream we print out a semicolon
followed by a newline character to the output.
This classic syntax directed language processing: the actions are dictated by the syntactic structures recognized in the input stream.

The next group of rules specifies what statements look like,
\antlrlistingnomath
stmt	:	'print' { emit("print "); } exp
		|	'store' VAR { emit("store " + $VAR.text + " "); } exp
		;
\end{lstlisting}
In the first rule we emit the keyword print as soon as we recognized the token \ilisting{'print'} in the input stream.
We then continue to process the input stream with the non-terminal \ilisting{exp}.
The second rule states that as soon as we recognized the tokens \ilisting{'store'} and \ilisting{VAR} we emit the keyword store and the
variable name then continue processing with the non-terminal \ilisting{exp}.

The last group of rules in the grammar rule section specifies expressions,
\antlrlistingnomath
exp	   	:   '+' { emit("(+ "); } exp { emit(" "); } exp { emit(")"); }
   		|   '-' { emit("(- "); } exp { emit(" "); } exp { emit(")"); }
		|	'(' exp ')'
		|	VAR 	{ emit($VAR.text); }
		|	INTVAL 	{ emit($INTVAL.text); }
		;
\end{lstlisting}
The first rule specifies the addition operation. 
Here we emit output as soon as we recognize the token \ilisting{'+'}.
Recall that we want to rewrite the output in Lisp format.  
Therefore, instead of just emitting the plus sign we emit \ilisting{"(+ "}, that is, an open parenthesis followed by the plus sign and a space
character.
We then continue processing with the first \ilisting{exp} non-terminal.  
Once we have recognized the syntactic structure of the corresponding expression we emit a space character and then continue processing
with the second \ilisting{exp} non-terminal.
Once we have recognized the syntactic structure of this second expression we emit the closing parenthesis.
The second rule works identically except that we are dealing with subtraction.
Given these two rules it is easy to see that the emitted code will have a Lisp like format in that addition and subtraction operations will always
be surrounded by parentheses.
The third rule is interesting.
Here we recognize the syntactic structure of parenthesized expressions but we don't emit any code for the parentheses.
In essence we are deleting parentheses from the input program.
These parentheses from the input program are superfluous because every non-trivial expression is already parenthesized in the output 
using the first two rules and therefore we do not emit them into the output.
In the last two rules above we emit the strings of the recognized tokens \ilisting{VAR} and \ilisting{INTVAL}, respectively.

In order to get a deeper insight in how syntax directed translation works is perhaps best to envision the grammar of the pretty
printer as a recursive descent parser.
In that case the rule set for expressions from above could be viewed as the parsing function for expressions as follows:
\pseudolisting
function exp() returns void
begin
   switch inputToken()
   case PLUS:
      emit("(+ ") 
      exp()
      emit(" ")
      exp()
      emit(")")
      return
   case MINUS:
      emit("(- ")
      exp()
      emit(" ")
      exp()
      emit(")")
      return
   case POPEN:
      exp()
      matchToken(PCLOSE)
      return
   case VAR:
      Token var = inputToken()
      emit(var.getString())
      return
   case INTVAL:
      Token value = inputToken()
      emit(value.getString())
      return
   default:
      syntaxError()
   end switch
end
\end{lstlisting}
Now it is easy to see that during syntax directed translation parsing functions and code generation functions are interleaved.
It is also easy to see that code is typically generated as soon as the relevant piece of syntax is recognized.
See if you can work through this example using the pretty printer grammar,
\begin{code}
store x 1 ; print + (x) (2) ;
\end{code}
You should obtain the following output program,
\begin{code}
store x 1;
print (+ x 2);
\end{code}
In order to complete our pretty printer we have to provide a driver program similar to the one in Figure~\ref{chap02:exp0count-driver}.

% TODO: look at all the recursive descent parsing code and fix the stream index - inputToken vs nextToken etc

In [None]:
from exp1lex import PLUS, MINUS, POPEN, PCLOSE, VAR, INTVAL
from exp1lex import input_token, match_token

def exp():
   tk = input_token()
   if tk.name == PLUS:
      return exp() + exp()
   elif tk.name == MINUS:
      return exp() - exp()
   elif tk.name == POPEN:
      val = exp()
      match_token(PCLOSE)
      return val
   elif tk.name == VAR:
      return symtab(tk.value)
   elif tk.name == INTVAL:
      return int(tk.value)
   else:
      syntax_error()




# Exercises

17. (project) Use the code for the Exp1 language from above and extend the language with multiplication and integer division.  Demonstrate that your interpreter works by running it on some telling examples.

 \ex (project)
 Rewrite the grammar in Figure~\ref{chap02:exp1-gram} in such a way that it supports infix expressions and then construct a
 syntax directed interpreter for it.
 
 \ex (project)
 Rewrite the grammar in Figure~\ref{chap02:exp1-gram} in such a way that it supports
 \begin{enumerate}
 \item the infix operations `*' and `/', multiplication and divide, respectively, as well as addition and subtraction.
 \item  properly encodes associativity and presence of all the operators.
 \end{enumerate}
 and then construct a
 syntax directed interpreter for it.

