# Chapter 4-A Top-Down Parsing

The syntax analyzer performs the major syntax checking (the inherently recursive part) of the source program.

#### Example

$L = \{a^nb^n\ |\ n \ge 1\}$, i.e., the language of all nested parentheses (or begin-ends), is not regular but is context-free.

For example, `()`, `(())`, `begin begin end end`.

The syntax of a programming language is defined by a context-free grammar but there are language features that cannot be described by a context-free grammar. They must be handled by the semantic analyzer.

#### Example

$L = \{wzw\ |\ w, z \in \{a, b\}^*\}$, i.e., the language abstracting the feature that identifiers are declared before their use, is not context-free but is context-sensitive (semantic analyzer).

#### Example

$
\begin{eqnarray}
E &\rightarrow& E + E \\
&|& E * E \\
&|& [id] \\
&|& [const]
\end{eqnarray}
$

<h3><center><i>Example Parse Tree</i></center></h3>

<img src="./res/04/4_1.png" width="300px" alt="Example Parse Tree"/>

A leftmost derivation:

$
\begin{eqnarray}
E &\Rightarrow_{lm}& E * E \\
&\Rightarrow_{lm}& [id] * E \\
&\Rightarrow_{lm}& [id] * E + E \\
&\Rightarrow_{lm}& [id] * [id] + E \\
&\Rightarrow_{lm}& [id] * [id] + [id]
\end{eqnarray}
$

A rightmost derivation:

$
\begin{eqnarray}
E &\Rightarrow_{rm}& E * E & \\
&\Rightarrow_{rm}& E * E + E \\
&\Rightarrow_{rm}& E * E + [id] \\
&\Rightarrow_{rm}& E * [id] + [id] \\
&\Rightarrow_{rm}& [id] * [id] + [id]
\end{eqnarray}
$

## Ambiguity

A context-free grammar is **ambiguous** if there are two or more parse trees (or leftmost/rightmost derivations) for some sentence. Otherwise, it is **unambiguous**.

#### Example
The example grammar given above is ambiguous, because two different parse trees can lead to the same sentential form.

#### Note
An ambiguous CFG should not be used to define the syntax of a programming language.
* Context-free grammar ambiguity problem is undecidable in general.
* Rules causing ambiguity can sometimes be modified into unambiguous rules by assuming additional constraints.

Also, there is no transformation from an ambiguous CFG to an unambiguous CFG.

#### Disambiguating rules
Assume the operator precedence $* > +$.

$
\begin{eqnarray}
E &\rightarrow& E + T\ |\ T \\
T &\rightarrow& T * F\ |\ F \\
F &\rightarrow& [id]\ |\  [const]
\end{eqnarray}
$

#### Example
$[id] * [id] + [id]$

<h3><center><i>Parse Tree Using Unambiguous Grammar</i></center></h3>

<img src="./res/04/4_2.png" width="300px" alt="Parse Tree Using Unambiguous Grammar"/>

### Dangling-else
```
<if-stmt> → if <expr> then <stmt> else <stmt> | if <expr> then <stmt>
```

#### Disambiguating rules
Assume that else matches with the nearest unmatched preceding then.
```
<if-stmt>        → <matched-stmt> | <unmatched-stmt>
<matched-stmt>   → if <expr> then <matched-stmt> else <matched-stmt> | <other-stmt>
<unmatched-stmt> → if <expr> then <matched-stmt> else <unmatched-stmt>
                 | if <expr> then <stmt>
```

## Deterministic parsing methods:

* Universal parsing algorithms such as Cocke-Younger-Kasami or Earley’s algorithm can parse any context-free language but is impractical in that they run in $O(n^3)$ time.
* Top-down LL and bottom-up LR parsing algorithms can parse large subclasses of context-free languages and run in $O(n)$ time.

## Syntax error detection and recovery:

* Majority of errors are syntactic in nature, e.g., unbalanced parentheses in the arithmetic expression
* Most syntax errors are simple ones that can be detected easily by the parser.
* Again, heuristics must be used to recover from syntax errors and full recovery is not possible nor cost-effective. Panic mode recovery ignores several subsequent tokens, e.g., upto a sentence-ending one such as ; or end, and continues.

## Cocke-Younger-Kasami (CYK) parsing
Given a CFG $G = (N, \Sigma, P, S)$ in Chomsky Normal Form and an input string $w$, test if $w \in L(G)$. Let $w = a_1 a_2 \dots a_n$ and let $T$ be an $n \times n$ table such that $T[i, j]$ is the set of all nonterminals generating $a_{i} a_{i+1} \cdots a_{j}$. Then, $w \in L(G)$ if and only if $S \in T[1, n]$. $T$ can be constructed by a _dynamic programming_.

#### Example
Parse $w=abba$ with the following CFG:

$
\begin{eqnarray}
S &\rightarrow& AB\ |\ SA\ |\ BB \\
A &\rightarrow& AA\ |\ BA\ |\ a \\
A &\rightarrow& BB\ |\ b \\
\end{eqnarray}
$

## Top-down parsing
Given a CFG $G = (N, \Sigma, P, S)$ and an input string $w$, construct a parse tree for $w$ in $G$ top-down, i.e., start with the start symbol $S$ and expand nonterminals in order to generate $w$.

#### Observation

The main difficulty of the top-down parsing lies in the right choice of the right-hand side (RHS) of a rule when there are multiple RHSs.

#### Note

PDA is the recognition device for CFGs (CFG = PDA). Namely, a program structure is defined by a CFG but whether an input structure is valid or not according to the rules of the CFG is tested by a PDA. Our top-down parser will be a PDA simulating a certain type of action of the CFG.

## Pushdown automaton (PDA)

A finite automaton with an additional stack of unbounded size, defined formally as $M = (Q, \Sigma, \Gamma, \delta, q_0, Z_0, F)$, where $Q$ is the state set, $\Sigma$ is the input alphabet, $\Gamma$ is the stack alphabet, $\delta: Q \times (\Sigma \cup \{\varepsilon\}) \times \Gamma \rightarrow 2^{Q \times \Gamma^*}$ is the transition function, $q_0$ is the initial state, $Z_0 \in \Gamma$ is the stack bottom marker, and $F \subseteq Q$ is the set of accepting states.

### Configuration
A triple $(q, x{\uparrow}y, \gamma)$ indicating the PDA’s situation that it is in state $q$ after consuming the prefix $x$ of the input string $xy$, it is scanning the first symbol of $y$ on the input tape, and the stack content is $\gamma$, where the first symbol of $\gamma$ is the stack top symbol.

### Accepting computation
A sequence of configurations $(q_0, {\uparrow}w, Z_0) \vdash \cdots \vdash (q, w{\uparrow}, \gamma)$, where $q \in Q$ (the last one in this sequence is an accepting configuration), i.e., a PDA accepts at the end of the input tape if it is in an accepting state. Assume w.l.o.g. that the PDA accepts with the empty stack (so, $\gamma = \varepsilon$).

### Deterministic PDA (DPDA)
A PDA with at most one possible next action at any point of computation

#### Example

DPDA for $L = \{a^n b^n\ |\ n \ge 1\}$

1. Push a’s into the stack.
2. Pop a’s while reading b’s.
3. Accept if the stack is empty at the end of the input tape.

$M = (\{q_0,q_1\}, \{a, b\}, \{a, Z_0\}, \delta, q_0, Z_0, {q_1})$, where

$
\delta(q_0,a,Z_0) = \{(q_0, aZ_0)\} \\
\delta(q_0,a,a) = \{(q_0, aa)\} \\
\delta(q_0,a,a) = \{(q_1, \varepsilon)\} \\
\delta(q_1,b,a) = \{(q_1, \varepsilon)\} \\
\delta(q_1,\varepsilon,Z_0) = \{(q_1, \varepsilon)\} \text{ // empty the stack to accept}
$

#### An accepting computation:

$
\begin{eqnarray}
(q_0, {\uparrow}aabb, Z_0) &\vdash& (q_0, a{\uparrow}abb, aZ_0) \\
&\vdash& (q_0, aa{\uparrow}bb, aaZ_0) \\
&\vdash& (q_1, aab{\uparrow}b, aZ_0) \\
&\vdash& (q_1, aabb{\uparrow}, Z_0) \\
&\vdash& (q_1, aabb{\uparrow}, \varepsilon)
\end{eqnarray}
$

#### Note
DPDA ⊊ PDA, i.e., there is a language defined by a PDA but not by any DPDA.

#### Example
Example. $L = \{ww^R\ |\ w \in \{a, b\}^*\} ∈$ PDA $–$ DPDA.

1. Similar to the above construction.
2. But the mid-point must be guessed.

$M = (\{q_0,q_1\}, \{a, b\}, \{a, Z_0\}, \delta, q_0, Z_0, {q_1})$, where

$
\delta(q_0,X,Z_0) = \{(q_0, XZ_0)\}\ \forall X\in\{a,b\} \\
\delta(q_0,X,X) = \{(q_0, XX), (q_1,\varepsilon)\}\ \forall X\in\{a,b\} \text{ // multiple choices} \\
\delta(q_0,X,Y) = \{(q_0, XY)\}\ \forall X\in\{a,b\} \text{ if } X\ne Y \\
\delta(q_1,X,X) = \{(q_1, \varepsilon)\} \forall X\in\{a,b\} \\
\delta(q_1,\varepsilon,Z_0) = \{(q_1, \varepsilon)\}
$

## CFG to PDA

Given a CFG $G = (N, \Sigma, P, S)$, we construct a PDA M such that $L(M) = L(G)$. For any input $w \in \Sigma^*$, $M$ will simulate a leftmost derivation of $w$ in $G$ and accept $w$ if and only if $w$ can be generated by $G$.

1. Push $S$ into stack. $M$ will check if the stack content $S$ can convert to the unscanned portion $w$ of the input tape by using the rules of $G$.
2. If $M$ has successfully simulated the leftmost derivation of $G$, such as $S \Rightarrow_{lm}^* xA\gamma$, where $x \in \Sigma^*$ and $A \in N$ (thus, $A$ is the leftmost nonterminal in the current sentential form), then the corresponding configuration of $M$ is $(q, x{\uparrow}y, A{\gamma}Z_0)$. $M$ needs to verify that the current stack content $A\gamma$ can convert to the unscanned portion $y$ of the input tape.
3. It is sufficient to understand how $M$ simulates one step action of $G$ that expands a nonterminal to one of its RHSs, say $A \Rightarrow \alpha$. Replace the stack top symbol $A$ by $\alpha$.
  - If a terminal symbol is exposed on the stack top as the result, then consume an identical input symbol and pop it. Repeat this action.
  - If a nonterminal symbol is exposed on the stack top, then we are done with the one-step simulation, so go back to (3).
4. Accept if the stack is empty at the end of the input tape since it is trivial that the stack content $\varepsilon$ can turn to the unscanned portion $\varepsilon$ of the input tape.

Thus,

$M = (\{q_0,q_1\}, \Sigma, N \cup \Sigma \cup \{Z_0\}, \delta, q_0, Z_0, \{q_1\})$, where

$
\delta(q_0,\varepsilon,Z_0) = \{(q_1, SZ_0)\}\ \\
\delta(q_1,\varepsilon,A) = \{(q_1, \alpha_j)\ |\ j=1,2,\dots,n\} \text{ if} A\rightarrow\alpha_1\ |\ \alpha_2\ | \cdots |\ \alpha_n \text{ are } A \text{-rules of } G\\
\delta(q_1,a,a) = \{(q_1, \varepsilon)\}\ \forall a\in\Sigma \\
\delta(q_1,\varepsilon,Z_0) = \{(q_1, \varepsilon)\}
$