In [2]:
# imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set(style="whitegrid", font_scale=1.5, rc={'figure.figsize':(12, 6)})

# for custom notebook formatting.
from IPython.core.display import HTML
display(HTML('<style>.prompt{width: 0px; min-width: 0px; visibility: collapse}</style>'))
HTML(open('../custom.css').read())


<br>

## Natural Language Processing
### :::: Context-free grammars ::::

<br>

<br><br><br><br><br><br>


This week we'll take a break from neural networks to learn about parsing.

## Part-of-speech

| POS  | Description | Examples |
|------|-------------|----------|
| **Noun** | Names of things | boy, cat, truth become, hit |
| **Verb** | Action or state | become, hit |
| **Pronoun** | Reference for a noun | I, you , me, them |
| **Adverb** | Modifies V, Adj, Adv | quickly, very |
| **Adjective** | Modifies a noun | happy, smart |
| **Conjunction** | connects words | and, but, since |
| **Preposition** | Relates to a noun | to, of, from |
| **Interjection** | Outcry | Ah, Ha |

## Constituent

- A group of words behaving as a single unit or phrase

- Name them based on the **head** word in the constituent

> **Noun phrases**: "the big house" or "a beautiful day"  
> **Adjective phrases**: "very useful"  
> **Prepositional phrases**: "on the hill"  
> **Verb phrases**: "saw the dog"  

<br><br>

## Language is recursive

- A sentence has many parts, many of which have subparts, many of which have subparts, ...

> I saw the dog with one eye on the hill with the tree by the lake...

We need a way to compactly represent this recursion.

$\Rightarrow$ **Context-Free Grammars (CFGs)**

<br><br>

## Context-free Grammar, Informally

- Set of **rules** or **productions**
  - Define how constituents can be grouped
- **Lexicon**: list of words and symbols

### Example: CFG for Noun Phrases

> NP $\rightarrow$ Det Nominal  
> NP $\rightarrow$ ProperNoun  
> Nominal $\rightarrow$ Noun | Noun Nominal

Rules can be part of a hierarchy:

> Det $\rightarrow$ a  
> Det $\rightarrow$ the  
> Noun $\rightarrow$ flight  

- **Terminal** symbols: words in the language (e.g., "a", "flight")
- **Nonterminal** symbols: clusters or generalizations of terminals (e.g., Noun, Nominal, NP)



<br><br>

## Derivation
- A sequence of rule expansions to generate a given string.
- This sequence is most commonly shown as a **parse tree**

![figs/parse.png](figs/parse.png)

**Derivation**  
1. NP $\rightarrow$ Det Nom
2. Det $\rightarrow$ a
3. Nom $\rightarrow$  Noun
4. Noun $\rightarrow$ flight


<br><br>

## CFGs, Formally

A context-free grammar is a four-tuple:

1. A set of non-terminal symbols (or 'variables') $N$
2. A set of terminal symbols $\Sigma$ (disjoint from $N$)
3. A set of productions $P$ of the form $A \rightarrow \alpha$, where
 - $A \in N$ is a non-terminal
 - $\alpha$ is a string of symbols from the infinite set $(\Sigma \cup N)^*$
4. A start symbol $S$


A string $\alpha_1$ **derives** a string $\alpha_m$ if $\alpha_1$ can be rewritten as $\alpha_m$ by a series of rule applications from $P$.

$$\alpha_1 \Rightarrow \alpha_2,  \alpha_2 \Rightarrow \alpha_3,  \ldots,  \alpha_{m-1} \Rightarrow \alpha_m$$  

Denoted: $\alpha_1 \Rightarrow^* \alpha_m$

If $A \rightarrow \beta$ is a production in $P$, and $\alpha$ and $\gamma$ are strings in $(\Sigma \cup N)^*$,
- $\alpha A \gamma$ **directly derives** $\alpha \beta \gamma$
- denoted: $\alpha A \gamma \Rightarrow \alpha \beta \gamma$

> a $Noun$ ride $\Rightarrow$ a train ride (by applying $Noun \rightarrow train$)

<br><br>

## Context free language

Given a CFG $G$, we can define the formal language of strings accepted/generated by $G$ as:

$L_G = \{w $ $ | $ $ w \in \Sigma^* $ and $ S \Rightarrow^* w\}$

<br><br>

## Example: CFG for airline reservation system

![figs/lexicon.png](figs/lexicon.png)

![figs/grammar.png](figs/grammar.png)

![figs/flight.png](figs/flight.png)

<br>

### Why do we need CFGs?







1. Convenience / Compactness  

2. Expressivity

 - Regular expressions / finite state machines cannot represent languages like $a^n b^n$
   - e.g., $aaaabbbb$ - need to remember how many $a$'s seen  
   - Proof by the [Pumping Lemma](https://en.wikipedia.org/wiki/Pumping_lemma_for_context-free_languages) (See CMPS 3250  Theory of Computation)
   - E.g., the language of properly matched parentheses
     - $(1 + (2-(6/3)))$
 - Does this really appear in English?
 
<br><br><br>


> The cat likes fish.  
> The cat the dog chased likes fish.  
> The cat the dog the rat bit chased likes fish.  
> The cat the dog the rat the elephant admired bit chased likes tuna fish.
> ...  

(the Noun)$^n$ (Verb)$^{n-1}$ likes fish


Also:

> If $S_1$ then $S_2$  
> Either $S_3$ or $S_4$

> If either the man who said $S_5$ is arriving today or the man who said $S_5$ is arriving tomorrow, then the man who said $S_6$ is arriving the day after.

Letting:
- if $\rightarrow a$ 
- then $\rightarrow a$
- either $\rightarrow b$
- or $\rightarrow b$

Sentence above becomes $abba$. Can embed further to get $a^nb^nb^na^n$.

Of course, there's a practical human memory limit on this recursion.

<br><br><br><br><br>

## Syntactic Parsing

Given a grammar $G$ with start symbol $S$ and a string of words $w$,  
find a valid derivation from $S \Rightarrow^* w$

<br><br><br><br>

### Parsing as search

Similar to how the Viterbi algorithm is an efficient way to search over all possible state sequences, we need a way to efficiently search over all possible parse trees.

- for now, we want to find all valid parses
- next time, we'll assign scores to each valid parse and rank them

In [25]:
## Recursive descent parsing demo.
import nltk
nltk.app.rdparser()

## Top-down Parsing

- Start at root node $S$
- Expand rules until reach words at leaf nodes
- If fail to match a word, backtrack and try a different rule.


![figs/grammar2.png](figs/grammar2.png)

![figs/topdown.png](figs/topdown.png)

Which of the above will lead to a valid parse of the sentence "Book that flight" ?

![figs/flight2.png](figs/flight2.png)

## Top-down parsing

- Ideally, we'd like to explore all trees in parallel, but this takes too much memory.
- Instead, we use depth-first search.
- Choose most recently expanded state for next expansion.
- Left-most unexpanded node is expanded first.
- If get an invalid parse, return to most recent, unexplored state.

> Does this flight include a meal?

![figs/topdown3.png](figs/topdown3.png)

In [26]:
# Defining grammars in NLTK.
grammar = nltk.CFG.fromstring("""
  S -> NP VP | Aux NP VP | VP
  NP -> Det Nominal
  Nominal -> Noun | Noun Nominal
  NP -> ProperNoun
  VP -> Verb
  VP -> Verb NP
  Det -> 'that' | 'this' | 'a'
  Noun -> 'book' | 'flight' | 'meal' | 'money'
  Verb -> 'include'| 'book' | 'prefer'
  Aux -> 'does'
  Prep -> 'from' | 'to' | 'on'
  ProperNoun -> 'Houston' | 'TWA'
  """)

grammar

<Grammar with 25 productions>

In [29]:
parser = nltk.RecursiveDescentParser(grammar)
sent = 'book that flight'.split()
[str(p) for p in parser.parse(sent)][0]

'(S (VP (Verb book) (NP (Det that) (Nominal (Noun flight)))))'

In [30]:
parser = nltk.RecursiveDescentParser(grammar, trace=2)
[str(p) for p in parser.parse(sent)]

Parsing 'book that flight'
    [ * S ]
  E [ * NP VP ]
  E [ * Det Nominal VP ]
  E [ * 'that' Nominal VP ]
  E [ * 'this' Nominal VP ]
  E [ * 'a' Nominal VP ]
  E [ * ProperNoun VP ]
  E [ * 'Houston' VP ]
  E [ * 'TWA' VP ]
  E [ * Aux NP VP ]
  E [ * 'does' NP VP ]
  E [ * VP ]
  E [ * Verb ]
  E [ * 'include' ]
  E [ * 'book' ]
  M [ 'book' ]
  E [ * 'prefer' ]
  E [ * Verb NP ]
  E [ * 'include' NP ]
  E [ * 'book' NP ]
  M [ 'book' * NP ]
  E [ 'book' * Det Nominal ]
  E [ 'book' * 'that' Nominal ]
  M [ 'book' 'that' * Nominal ]
  E [ 'book' 'that' * Noun ]
  E [ 'book' 'that' * 'book' ]
  E [ 'book' 'that' * 'flight' ]
  M [ 'book' 'that' 'flight' ]
  + [ 'book' 'that' 'flight' ]
  E [ 'book' 'that' * 'meal' ]
  E [ 'book' 'that' * 'money' ]
  E [ 'book' 'that' * Noun Nominal ]
  E [ 'book' 'that' * 'book' Nominal ]
  E [ 'book' 'that' * 'flight' Nominal ]
  M [ 'book' 'that' 'flight' * Nominal ]
  E [ 'book' 'that' 'flight' * Noun ]
  E [ 'book' 'that' 'flight' * 'book' ]
  E

['(S (VP (Verb book) (NP (Det that) (Nominal (Noun flight)))))']

In [19]:
# What happens if we add Left Recursion?
# Add this rule below to get infinite recursion:
# Nominal -> Nominal PP

grammar_lr = nltk.CFG.fromstring("""
  S -> NP VP | Aux NP VP | VP
  NP -> Det Nominal
  Nominal -> Noun | Noun Nominal
  NP -> ProperNoun
  VP -> Verb
  VP -> Verb NP
  Det -> 'that' | 'this' | 'a'
  Noun -> 'book' | 'flight' | 'meal' | 'money'
  Verb -> 'book' | 'include' | 'prefer'
  Aux -> 'does'
  Prep -> 'from' | 'to' | 'on'
  ProperNoun -> 'Houston' | 'TWA'
  Nominal -> Nominal PP
  """)

parser = nltk.RecursiveDescentParser(grammar_lr)
# [str(p) for p in parser.parse(sent)]





## Problem with top-down parsing

- **Problem 1:** Left recursive rules lead to infinite recursion
  > Nominal $\rightarrow$ Nominal PP  
  > NP $\rightarrow$ NP PP
  
  - Could be indirect left recursion:
  > NP $\rightarrow$ Det Nominal  
  > Det $\rightarrow$ NP ' s
  
![figs/recursion.png](figs/recursion.png)
<br><br>
- **Problem 2:** Ambiguity leads to many valid trees.
![figs/elephant.png](figs/elephant.png)

![figs/ambiguous.png](figs/ambiguous.png)

> Teller Stuns Man with Stolen Check

> Yoko Ono will talk about her husband John Lennon who was killed in an interview with Barbara Walters. 

> Tuna Biting Off Washington Coast 

>  Killer Sentenced to Die for Second Time in 10 Years 

> Hospitals are Sued by 7 Foot Doctors

<br><br>
- **Problem 3:** Repeated work
  - Many subtrees are repeated due to backtracking
  - E.g., "in my pajamas" above
  - Can take exponential time in sentence length

<br><br>

## Bottom-up parsing

- Start with input words, and build trees from words up
- Valid parse if end at root symbol $S$

Example: Recall this grammar

![figs/grammar2.png](figs/grammar2.png)


To start
- Lookup each word in lexicon
- Build partial trees for all valid parts of speech for each word.

![figs/bu1.png](figs/bu1.png)

- Proceed by searching for rules whose right-hand side fits
  - In contrast to top-down parsing, which expands trees from left-to-right

![figs/bu2.png](figs/bu2.png)

![figs/bu3.png](figs/bu3.png)

![figs/bu4.png](figs/bu4.png)


<br><br>

## Shift-reduce parsing
- A common bottom-up implementation
- To start, put entire sentence in an "input buffer"
- Two operations:
  1. **Shift:** push the next input symbol from buffer onto a stack
  2. **Reduce:** If some rule's RHS is on top of the stack:
    - Pop the RHS off the stack
    - Replace it with the nonterminal on the LHS of the rule

Decision points:
- Sometimes, either a shift or a reduce operation is possible at the same time.
- Multiple rules may match RHS

- Pick one and remember other options for backtracking

![figs/sr.png](figs/sr.png)

In [31]:
sr_parser = nltk.ShiftReduceParser(grammar, trace=2)
# See warning: this implementation will always pick one rule if they are ambiguous
sent = 'book that flight'.split()
[str(p) for p in sr_parser.parse(sent)]

# This implementation fails!

Parsing 'book that flight'
    [ * book that flight]
  S [ 'book' * that flight]
  R [ Noun * that flight]
  R [ Nominal * that flight]
  S [ Nominal 'that' * flight]
  R [ Nominal Det * flight]
  S [ Nominal Det 'flight' * ]
  R [ Nominal Det Noun * ]
  R [ Nominal Det Nominal * ]
  R [ Nominal NP * ]


[]

In [21]:
# Had to remove ambiguity for this implementation to work
# VP -> Verb
# Noun -> 'book'

grammar_sr = nltk.CFG.fromstring("""
  S -> NP VP | Aux NP VP | VP
  NP -> Det Nominal
  Nominal -> Noun | Noun Nominal
  NP -> ProperNoun
  VP -> Verb NP
  Det -> 'that' | 'this' | 'a'
  Noun -> 'flight' | 'meal' | 'money'
  Verb -> 'include'| 'book' | 'prefer'
  Aux -> 'does'
  Prep -> 'from' | 'to' | 'on'
  ProperNoun -> 'Houston' | 'TWA'
  """)

sr_parser2 = nltk.ShiftReduceParser(grammar_sr, trace=2)
[str(p) for p in sr_parser2.parse(sent)]

Parsing 'book that flight'
    [ * book that flight]
  S [ 'book' * that flight]
  R [ Verb * that flight]
  S [ Verb 'that' * flight]
  R [ Verb Det * flight]
  S [ Verb Det 'flight' * ]
  R [ Verb Det Noun * ]
  R [ Verb Det Nominal * ]
  R [ Verb NP * ]
  R [ VP * ]
  R [ S * ]


['(S (VP (Verb book) (NP (Det that) (Nominal (Noun flight)))))']

<br><br>
    
### Problems with bottom-up and top-down parsing so far

- **Left recursive** rules (e.g., "Nom -> Nom Noun") led to infinite recursion in top-down parsing
- **Ambiguity** may lead to exponentially many trees
- **Duplication** of work by regenerating subtrees many times $\rightarrow$ exponential work


<br><br><br><br>

*Solution:* **Dynamic programming**

<br><br>

## Earley Parser

- Dynamic programming based top-down parser
- Eliminates repetitive solutions from subproblems
<br><br>

- In a single left-to-right pass, it fills an array called a **chart** with $N-1$ entries ($N$ is number of words in input)

- For each word, chart contains a list of states storing the partial parse
- By end of sentence, chart encodes all possible valid parses


<br><br><br><br><br>


## States in Earley Parser
- Each state contains
  - A subtree for a single grammar rule
  - Information on the progress made in completing this subtree
  - Position of the tree w.r.t. input
  
  
**Dot notation**: Indicates progress made in completing the subtree

E.g., "Book that flight"

![../l03/figs/grammar.png](figs/grammar.png)

![../l03/figs/flight2.png](figs/flight2.png)

E.g., three states:

![figs/dot.png](figs/dot.png)

1. In [0,0], first 0 indicates constituent of this state begins at 0 (the start of the input). Second 0 indicates where the dot is (also at the beginning).
2. NP starts as position 1, a *Det* has already been parsed, *Nominal* is expected next
3. Parsed a $VP$ that spans entire input.

Can represent the same with a directed-acyclic graph:

![figs/dag.png](figs/dag.png)



## Earley operators
- Algorithm proceeds in left-to-right fashion through states in chart.
- Can apply one of three operators:
  1. Predictor
  2. Scanner
  3. Completer

## Predictor

- Create new states based on top-down view
- Applied to any state that has a non-terminal to the right of the dot and is not a part-of-speech
- Results in one new state for each alternative expansion
- Added to the same chart entry as the generating state

> Apply Predictor to $S \rightarrow \cdot VP, [0, 0]$

> Add new states $VP \rightarrow \cdot Verb, [0,0]$ and $VP \rightarrow \cdot Verb$ $NP, [0, 0]$ to first chart entry.

## Scanner

- Applied to states with a POS category to right of dot.
- Creates a new state that moves the dot past the POS category
- Only valid POS tags will appear in chart.

> Apply Scanner to $VP \rightarrow \cdot Verb$ $NP, [0, 0]$

> Add new state to **second** chart entry  
> $VP \rightarrow Verb \cdot NP, [0, 1]$


## Completer

- Applied to a state that has a dot that has reached the right end of a rule
- Represents a successful parse for a constituent for a span of the input
- Finds and advance all previously created states that were looking for this constituent at this position.
- Create new states by copying older state, advancing dot, and adding to the current chart entry


> Apply Completer to $NP \rightarrow Det $ $ Nominal \cdot, [1,3]$  
> Completer looks for states ending at 1 that expect an $NP$  
> Creates new state from $VP \rightarrow Verb \cdot NP, [0,1]$ to  $VP \rightarrow Verb $ $ NP \cdot, [0, 3]$ 

## Full example

![figs/chart.png](figs/chart.png)

Observations:

- Add a dummy state to start
- When $VP \rightarrow \cdot V$ $ NP, [0,0]$ is processed, Scanner does not add a duplicate state.
- We don't include $Noun \rightarrow book$

<br><br>
- Chart[1] processed after all entries in Chart[0] are processed.
- Chart[1] created when Scanner applied to $VP \rightarrow \cdot Verb, [0, 0]$.
- Chart[2] created when Scanner applied to $NP \rightarrow \cdot Det$ $Nominal, [1,1]$
- Success found in Chart[3]: $S \rightarrow VP \cdot, [0, 3]$


## To retrieve parse from chart
- Change Completer to add a pointer to older state when a new state is created
- To retrieve parse, recursively follow pointers, starting at $S$

![figs/chart2.png](figs/chart2.png)

![figs/parse.png](figs/parse2.png)

####  sources

- https://www.cs.colorado.edu/~martin/SLP/
- https://people.cs.umass.edu/~mccallum/courses/inlp2007/lect5-cfg.pdf

In [8]:
from IPython.core.display import HTML
HTML(open('../custom.css').read())