In [1]:
# Please do not change this cell because some hidden tests might depend on it.
import os

# Otter grader does not handle ! commands well, so we define and use our
# own function to execute shell commands.
def shell(commands, warn=True):
    """Executes the string `commands` as a sequence of shell commands.
     
       Prints the result to stdout and returns the exit status. 
       Provides a printed warning on non-zero exit status unless `warn` 
       flag is unset.
    """
    file = os.popen(commands)
    print (file.read().rstrip('\n'))
    exit_status = file.close()
    if warn and exit_status != None:
        print(f"Completed with errors. Exit status: {exit_status}\n")
    return exit_status

shell("""
ls requirements.txt >/dev/null 2>&1
if [ ! $? = 0 ]; then
 rm -rf .tmp
 git clone https://github.com/cs187-2021/lab3-3.git .tmp
 mv .tmp/tests ./
 mv .tmp/requirements.txt ./
 rm -rf .tmp
fi
pip install -q -r requirements.txt
""")




In [2]:
# Initialize Otter
import otter
grader = otter.Notebook()

$$
\renewcommand{\vect}[1]{\mathbf{#1}}
\renewcommand{\cnt}[1]{\sharp(#1)}
\renewcommand{\argmax}[1]{\underset{#1}{\operatorname{argmax}}}
\renewcommand{\softmax}{\operatorname{softmax}}
\renewcommand{\Prob}{\Pr}
\renewcommand{\given}{\,|\,}
$$

# CS187
## Lab 3-3 - Probabilistic context-free grammars

In previous labs, you have practiced constituency parsing using context-free grammars with the CKY parsing algorithm. In this lab you will extend this framework to a probabilistic one, probabilistic context-free grammars (PCFG).

New bits of Python used for the first time in the _solution set_ for this lab, and which you may therefore find useful:

* [`math.prod`](https://docs.python.org/3/library/math.html#math.prod)
* [`nltk.tree.Tree.productions`](https://www.nltk.org/api/nltk.html?highlight=production#nltk.tree.Tree.productions)

# Preparations {-}

In [3]:
import copy
import math
import nltk
import operator
import pandas as pd

from collections import Counter, defaultdict
from pprint import pprint

# Syntactic ambiguity

Let's start with the following simplified grammar for arithmetic word expressions from the last lab:

In [4]:
arithmetic_grammar = nltk.CFG.fromstring("""
    S -> NUM | S OP S
    OP -> ADD | MULT

    NUM -> 'zero' | 'one' | 'two' | 'three' | 'four' | 'five'
    NUM -> 'six' | 'seven' | 'eight' | 'nine' | 'ten' 
    
    ADD -> 'plus'
    MULT -> 'times'
""")

As a running example throughout this lab, we'll use the example phrase "two times three plus four".

In [5]:
example = "two plus three times four"

We can use the given CFG to parse this example phrase and print the possible parse trees.

In [6]:
parser = nltk.parse.BottomUpChartParser(arithmetic_grammar)
parses = list(parser.parse(example.split()))

for i, tree in enumerate(parses):
  print(f"Parse {i+1}:\n")
  tree.pretty_print()

Parse 1:

           S             
      _____|__________    
     S           |    |  
  ___|_____      |    |   
 S   OP    S     OP   S  
 |   |     |     |    |   
NUM ADD   NUM   MULT NUM 
 |   |     |     |    |   
two plus three times four

Parse 2:

           S             
  _________|_____         
 |   |           S       
 |   |      _____|____    
 S   OP    S     OP   S  
 |   |     |     |    |   
NUM ADD   NUM   MULT NUM 
 |   |     |     |    |   
two plus three times four



Each parse tree represents a structured arithmetic expression (the _abstract syntax_ of the concrete expression,  for those of you with CS51 backgrounds). Manually calculate the value of the resulting equation for each of the parse trees.

<!--
BEGIN QUESTION
name: parsed_equation_result
-->

In [11]:
#TODO
result_tree1 = 20
result_tree2 = 14

In [12]:
grader.check("parsed_equation_result")

We got two different parse trees for this simple expression. The occurrence of different structural interpretations of the same text is called _structural ambiguity_ or _syntactic ambiguity_. Since natural language is oftentimes ambiguous, this is a very real concern.

In this particular case, the two syntactic structures corresponded to two different semantic values. As an exercise, try to construct an ambiguous expression (name it `pseudo_ambiguous`) such that all of its parse trees correspond to the same value, thereby demonstrating that not all structural ambiguity leads to semantic ambiguity.

<!--
BEGIN QUESTION
name: redundant_parses
-->

In [13]:
# TODO - construct an ambiguous expression such that all of its parse
# trees correspond to the same value. `pseudo_ambiguous` should be
# a string.
pseudo_ambiguous = "one plus one times one"

In [14]:
grader.check("redundant_parses")

One approach to dealing with the issue of syntactic ambiguity is by defining a scoring system to score the possible parses and choosing the highest scoring tree. We will see how this can be done by taking a probabilistic approach to CFG.

# Probabilistic context-free grammars

To assign probabilities to strings, we will use a probabilistic context-free grammar (PCFG), a CFG in which each rule is augmented with a probability. A PCFG rule will be notated
$$A \to \beta\ [p]$$
where $A$ is a nonterminal, $\beta$ is a sequence of terminals and nonterminals, and $p$ is a probability associated with the rule.

We'll write $\Prob(\beta \given A)$ for the probability associated with the rule $A \to \beta$.

To constitute a valid probability distribution we require that for every nonterminal $A$
$$\sum_{A \to \beta \in \cal{P}} \Prob(\beta \given A) = 1$$
where $\cal{P}$ is the set of CFG productions of the grammar. That is, the probabilities associated with all rules with the same left-hand side must sum to one.

Define `probabilistic_arithmetic_grammar` to be a proabilistic version of `arithmetic grammar` above, where the nonterminal probability distributions are **as uniform across the productions as possible**.

> You'll use the NLTK `nltk.PCFG.fromstring` function, which allows you to add the probabilities in brackets after each right=hand side, just as we've been doing above. For example, to notate `NUM -> 'zero'` as of probability 0.5, use `NUM -> 'zero' [0.5]`.

<!--
BEGIN QUESTION
name: uniform_probabilities
-->

In [19]:
# TODO - define `probabilistic_arithmetic_grammar`. Round to
#        *3* significant figures if not divisible.

probabilistic_arithmetic_grammar = nltk.PCFG.fromstring("""
    S -> NUM [0.5]| S OP S [0.5]
    OP -> ADD [0.5]| MULT [0.5]

    NUM -> 'zero' [0.0909] | 'one' [0.0909]| 'two' [0.0909]| 'three' [0.0909]| 'four' [0.0909]| 'five' [0.0909]
    NUM -> 'six' [0.0909]| 'seven' [0.0909]| 'eight' [0.0909]| 'nine' [0.0909]| 'ten' [0.0909]
    
    ADD -> 'plus' [1]
    MULT -> 'times' [1]
""")

In [20]:
grader.check("uniform_probabilities")

We can use the [nltk.CFG.productions()](https://www.nltk.org/api/nltk.html?highlight=production#nltk.grammar.CFG.productions) method to get a list of the PCFG's productions:

In [21]:
probabilistic_arithmetic_grammar.productions()

[S -> NUM [0.5],
 S -> S OP S [0.5],
 OP -> ADD [0.5],
 OP -> MULT [0.5],
 NUM -> 'zero' [0.0909],
 NUM -> 'one' [0.0909],
 NUM -> 'two' [0.0909],
 NUM -> 'three' [0.0909],
 NUM -> 'four' [0.0909],
 NUM -> 'five' [0.0909],
 NUM -> 'six' [0.0909],
 NUM -> 'seven' [0.0909],
 NUM -> 'eight' [0.0909],
 NUM -> 'nine' [0.0909],
 NUM -> 'ten' [0.0909],
 ADD -> 'plus' [1.0],
 MULT -> 'times' [1.0]]

Each of the productions in the list is an instance of the [ProbabilisticProduction](https://www.nltk.org/api/nltk.html?highlight=production#nltk.grammar.ProbabilisticProduction) class. Each such instance is defined by three parameters: its left hand side (`lhs`), right-hand side (`rhs`), and rule probability (`prob`). These attributes can be accessed separately:

In [22]:
## Extract the second rule
pprod_example = probabilistic_arithmetic_grammar.productions()[1]

## Display its various components
print(f'For the production "{pprod_example}":\n' 
      f'left hand side of the rule is {pprod_example.lhs()}\n'
      f'right hand side of the rule is {pprod_example.rhs()}\n'
      f'probability of the rule is {pprod_example.prob()}')

For the production "S -> S OP S [0.5]":
left hand side of the rule is S
right hand side of the rule is (S, OP, S)
probability of the rule is 0.5


For non-probabilistic grammars, the class of productions is [Production](https://www.nltk.org/api/nltk.html?highlight=production#nltk.grammar.Production), which doesn't have a probability attribute and is only defined by its lhs and rhs attributes:

In [23]:
print(f'PCFG production: {probabilistic_arithmetic_grammar.productions()[1]} \n'
      f'      vs.\n'
      f'CFG production:  {arithmetic_grammar.productions()[1]}') 

PCFG production: S -> S OP S [0.5] 
      vs.
CFG production:  S -> S OP S


# Parse tree probabilities

To use a PCFG to select among parse trees, we need to be able to calculate the probability of a parse tree as specified by the PCFG. We take the probability of a parse tree to be simply the product of the probabilities of each constituent in the tree, the probability of the rule associated with the constituent.

You'll use the PCFG `probabilistic_arithmetic_grammar` to calculate the probability of each of the parse trees in `parses`, the list of trees that were parsed from the `example` sentence. 

To do that, you'll need to get all the productions used in a parse tree (using the [productions](https://www.nltk.org/api/nltk.html?highlight=production#nltk.tree.Tree.productions) method), find their probabilities, and multiply them together.

First, we will create a dictionary from the PCFG, so that we can easily access the rule probabilities. Write a function which accepts a PCFG and returns a dictionary whose keys are the CFG (not PCFG) productions and values are the associated probabilities. 

> To construct a CFG production from a PCFG production, you can use `nltk.grammar.Production(production.lhs(), production.rhs())`.

<!--
BEGIN QUESTION
name: pcfg_to_dict
-->

In [34]:
#TODO - returns a dictionary whose keys are `nltk.grammar.Production` objects
#       and whose values are the associated probabilities
def pcfg_to_dict(pcfg):
    pcfgDict = {}
    #production = probabilistic_arithmetic_grammar.productions()
    #cfgProduction = nltk.grammar.Production(production.lhs(), production.rhs())
    for example in pcfg.productions():
        pcfgDict[nltk.grammar.Production(example.lhs(), example.rhs())] = example.prob()
    return pcfgDict
        

In [38]:
production = probabilistic_arithmetic_grammar.productions()
print(production)
cfgProduction = nltk.grammar.Production(production.lhs(), production.rhs())
print(cfgProduction)

[S -> NUM [0.5], S -> S OP S [0.5], OP -> ADD [0.5], OP -> MULT [0.5], NUM -> 'zero' [0.0909], NUM -> 'one' [0.0909], NUM -> 'two' [0.0909], NUM -> 'three' [0.0909], NUM -> 'four' [0.0909], NUM -> 'five' [0.0909], NUM -> 'six' [0.0909], NUM -> 'seven' [0.0909], NUM -> 'eight' [0.0909], NUM -> 'nine' [0.0909], NUM -> 'ten' [0.0909], ADD -> 'plus' [1.0], MULT -> 'times' [1.0]]


AttributeError: 'list' object has no attribute 'lhs'

In [37]:
grader.check("pcfg_to_dict")

We can use the function you wrote to convert `probabilistic_arithmetic_grammar` to a dictionary and inspect it to make sure it's working.

In [39]:
pprint(pcfg_to_dict(probabilistic_arithmetic_grammar))

{ADD -> 'plus': 1.0,
 MULT -> 'times': 1.0,
 NUM -> 'eight': 0.0909,
 NUM -> 'five': 0.0909,
 NUM -> 'four': 0.0909,
 NUM -> 'nine': 0.0909,
 NUM -> 'one': 0.0909,
 NUM -> 'seven': 0.0909,
 NUM -> 'six': 0.0909,
 NUM -> 'ten': 0.0909,
 NUM -> 'three': 0.0909,
 NUM -> 'two': 0.0909,
 NUM -> 'zero': 0.0909,
 OP -> ADD: 0.5,
 OP -> MULT: 0.5,
 S -> NUM: 0.5,
 S -> S OP S: 0.5}


Now for the payoff: Write a function that takes a parse tree and a PCFG and returns the probability of the parse tree according to the PCFG. The `pcfg_to_dict` function you just wrote is likely to come in handy.

> Note that we are asking for the probability (not the log probability). We **don't work in log space** in this lab for simplicity, but for parse trees of longer sentences (which you'll see in the project) you might have to work in the log space to avoid underflows.

<!--
BEGIN QUESTION
name: parsed_trees_probs
-->

In [44]:
# TODO: returns the probability of the parse tree.
# `tree.productions() might be useful for getting the 
#  productions of a parse tree
def parse_probability(tree, pcfg):
    pcfgDict = pcfg_to_dict(pcfg)
    prob = 1
    for key in tree.productions():
        prob *= pcfgDict[key]
    return prob

In [45]:
grader.check("parsed_trees_probs")

We'll use it to calculate and print out the probability of each parse tree.

In [46]:
for i, tree in enumerate(parses):
    print(f'Probability of parse tree {i+1} is '
          f'{parse_probability(tree, probabilistic_arithmetic_grammar):1.2e}')
    tree.pretty_print()

Probability of parse tree 1 is 5.87e-06
           S             
      _____|__________    
     S           |    |  
  ___|_____      |    |   
 S   OP    S     OP   S  
 |   |     |     |    |   
NUM ADD   NUM   MULT NUM 
 |   |     |     |    |   
two plus three times four

Probability of parse tree 2 is 5.87e-06
           S             
  _________|_____         
 |   |           S       
 |   |      _____|____    
 S   OP    S     OP   S  
 |   |     |     |    |   
NUM ADD   NUM   MULT NUM 
 |   |     |     |    |   
two plus three times four



<!-- BEGIN QUESTION -->

**Question:** Which of the trees is the most probable parse? Explain why. If the two have the same probability, explain why that is the case instead, and describe how you might adjust the rule probabilities if possible so that they have different probabilities.

<!--
BEGIN QUESTION
name: open_response_ambiguity
manual: true
-->

The two trees have the same probability because they are constructed out of the same set of productions. Since the productions are the same across both the trees, the multiplied probability is also the same across both of the trees. One way that the rule probabilities could be adjusted is changing the probability for 

<!-- END QUESTION -->



# Lexicalizing the grammar

In order to allow parse probabilities to be more sensitive to contexts, it turns out to be useful to _lexicalize_ the grammar -- splitting (some of the) nonterminals based on what particular words they dominate. There are many techniques for performing this lexicalization. For this grammar, we'll split the `S` nonterminal based on the main operator that it dominates (if any). We'll thus have nonterminals `S_ADD`, `S_MULT`, and `S_NUM`. Thus, instead of a rule `S -> S OP S`, we'll have rules like:

```
S_ADD -> S_NUM ADD S_NUM
S_ADD -> S_NUM ADD S_ADD
S_ADD -> S_NUM ADD S_MULT
S_ADD -> S_ADD ADD S_NUM
``` 
and so forth. By splitting the nonterminals (and hence the productions) in this way, we can assign different probabilities to cases where, for instance, the primary operator on the left is a number, or addition, or multiplication.

Here is the lexicalized grammar:

In [57]:
lexicalized_arithmetic_grammar = nltk.CFG.fromstring( 
    """
    S -> S_NUM | S_ADD | S_MULT

    S_NUM -> NUM

    S_ADD -> S_NUM ADD S_NUM
    S_ADD -> S_NUM ADD S_ADD
    S_ADD -> S_NUM ADD S_MULT
    S_ADD -> S_ADD ADD S_NUM
    S_ADD -> S_ADD ADD S_ADD
    S_ADD -> S_ADD ADD S_MULT
    S_ADD -> S_MULT ADD S_NUM
    S_ADD -> S_MULT ADD S_ADD
    S_ADD -> S_MULT ADD S_MULT

    S_MULT -> S_NUM MULT S_NUM
    S_MULT -> S_NUM MULT S_ADD
    S_MULT -> S_NUM MULT S_MULT
    S_MULT -> S_ADD MULT S_NUM
    S_MULT -> S_ADD MULT S_ADD
    S_MULT -> S_ADD MULT S_MULT
    S_MULT -> S_MULT MULT S_NUM
    S_MULT -> S_MULT MULT S_ADD
    S_MULT -> S_MULT MULT S_MULT

    NUM -> 'zero'   | 'one'    | 'two'
    NUM -> 'three'  | 'four'   | 'five'
    NUM -> 'six'    | 'seven'  | 'eight'
    NUM -> 'nine'   | 'ten'

    ADD -> 'plus'
    MULT -> 'times'
    """ 
)

<!-- BEGIN QUESTION -->

Use this grammar to parse the example phrase ("two plus three times four") defined as `phrase` above.

<!--
BEGIN QUESTION
name: lexicalized_parse
manual: true
-->

In [58]:
# TODO - parse `example` using the lexicalized grammar. `lexicalized_parses`
#        should be a list of parses.
parser = nltk.parse.BottomUpChartParser(lexicalized_arithmetic_grammar)
lexicalized_parses = list(parser.parse(example.split()))

In [59]:
grader.check("lexicalized_parse")

<!-- END QUESTION -->



Examine the trees, and make sure that you understand why they look the way they do. Notice that because of the lexicalization, the highest `S_` node corresponds to the highest operator in the parse -- `S_MULT` when `MULT` is the highest operator and `S_ADD` when `ADD` is the highest operator.

In [60]:
for i, tree in enumerate(lexicalized_parses):
  print(f"Possible parse {i+1}:\n")
  tree.pretty_print()

Possible parse 1:

              S               
              |                
            S_MULT            
         _____|____________    
      S_ADD          |     |  
   _____|_____       |     |   
S_NUM   |   S_NUM    |   S_NUM
  |     |     |      |     |   
 NUM   ADD   NUM    MULT  NUM 
  |     |     |      |     |   
 two   plus three  times  four

Possible parse 2:

             S               
             |                
           S_ADD             
   __________|_____           
  |    |         S_MULT      
  |    |      _____|______    
S_NUM  |   S_NUM   |    S_NUM
  |    |     |     |      |   
 NUM  ADD   NUM   MULT   NUM 
  |    |     |     |      |   
 two  plus three times   four



We can augment this grammar with probabilities as well.

Again, do so making the probabilities as uniform as possible.
<!--
BEGIN QUESTION
name: uniform_lexicalized_probabilities
-->

In [61]:
# TODO - define `probabilistic_lexicalized_arithmetic_grammar`.
#        Round to *3* significant figures if not divisible.
probabilistic_lexicalized_arithmetic_grammar = nltk.PCFG.fromstring( 
    """
    S -> S_NUM [0.333]| S_ADD [0.333]| S_MULT [0.333]

    S_NUM -> NUM [1]

    S_ADD -> S_NUM ADD S_NUM [0.111]
    S_ADD -> S_NUM ADD S_ADD [0.111]
    S_ADD -> S_NUM ADD S_MULT [0.111]
    S_ADD -> S_ADD ADD S_NUM [0.111]
    S_ADD -> S_ADD ADD S_ADD [0.111]
    S_ADD -> S_ADD ADD S_MULT [0.111]
    S_ADD -> S_MULT ADD S_NUM [0.111]
    S_ADD -> S_MULT ADD S_ADD [0.111]
    S_ADD -> S_MULT ADD S_MULT [0.111]

    S_MULT -> S_NUM MULT S_NUM [0.111]
    S_MULT -> S_NUM MULT S_ADD [0.111]
    S_MULT -> S_NUM MULT S_MULT [0.111]
    S_MULT -> S_ADD MULT S_NUM [0.111]
    S_MULT -> S_ADD MULT S_ADD [0.111]
    S_MULT -> S_ADD MULT S_MULT [0.111]
    S_MULT -> S_MULT MULT S_NUM [0.111]
    S_MULT -> S_MULT MULT S_ADD [0.111]
    S_MULT -> S_MULT MULT S_MULT [0.111]

    NUM -> 'zero' [0.0909]  | 'one' [0.0909]   | 'two' [0.0909]
    NUM -> 'three' [0.0909] | 'four' [0.0909]  | 'five' [0.0909]
    NUM -> 'six'  [0.0909]  | 'seven' [0.0909] | 'eight' [0.0909]
    NUM -> 'nine' [0.0909]  | 'ten' [0.0909]

    ADD -> 'plus' [1]
    MULT -> 'times' [1]
    """ 
)

In [62]:
grader.check("uniform_lexicalized_probabilities")

In [64]:
pcfg_to_dict(probabilistic_lexicalized_arithmetic_grammar)

{S -> S_NUM: 0.333,
 S -> S_ADD: 0.333,
 S -> S_MULT: 0.333,
 S_NUM -> NUM: 1.0,
 S_ADD -> S_NUM ADD S_NUM: 0.111,
 S_ADD -> S_NUM ADD S_ADD: 0.111,
 S_ADD -> S_NUM ADD S_MULT: 0.111,
 S_ADD -> S_ADD ADD S_NUM: 0.111,
 S_ADD -> S_ADD ADD S_ADD: 0.111,
 S_ADD -> S_ADD ADD S_MULT: 0.111,
 S_ADD -> S_MULT ADD S_NUM: 0.111,
 S_ADD -> S_MULT ADD S_ADD: 0.111,
 S_ADD -> S_MULT ADD S_MULT: 0.111,
 S_MULT -> S_NUM MULT S_NUM: 0.111,
 S_MULT -> S_NUM MULT S_ADD: 0.111,
 S_MULT -> S_NUM MULT S_MULT: 0.111,
 S_MULT -> S_ADD MULT S_NUM: 0.111,
 S_MULT -> S_ADD MULT S_ADD: 0.111,
 S_MULT -> S_ADD MULT S_MULT: 0.111,
 S_MULT -> S_MULT MULT S_NUM: 0.111,
 S_MULT -> S_MULT MULT S_ADD: 0.111,
 S_MULT -> S_MULT MULT S_MULT: 0.111,
 NUM -> 'zero': 0.0909,
 NUM -> 'one': 0.0909,
 NUM -> 'two': 0.0909,
 NUM -> 'three': 0.0909,
 NUM -> 'four': 0.0909,
 NUM -> 'five': 0.0909,
 NUM -> 'six': 0.0909,
 NUM -> 'seven': 0.0909,
 NUM -> 'eight': 0.0909,
 NUM -> 'nine': 0.0909,
 NUM -> 'ten': 0.0909,
 ADD -> 'plus'

Using this PCFG, we can calculate the probabilities associated with the two parses of the example phrase.

In [63]:
for i, tree in enumerate(lexicalized_parses):
    print(f'Probability of parsed tree {i+1} is '
          f'{parse_probability(tree, probabilistic_lexicalized_arithmetic_grammar):1.2e}')
    tree.pretty_print()

Probability of parsed tree 1 is 3.08e-06
              S               
              |                
            S_MULT            
         _____|____________    
      S_ADD          |     |  
   _____|_____       |     |   
S_NUM   |   S_NUM    |   S_NUM
  |     |     |      |     |   
 NUM   ADD   NUM    MULT  NUM 
  |     |     |      |     |   
 two   plus three  times  four

Probability of parsed tree 2 is 3.08e-06
             S               
             |                
           S_ADD             
   __________|_____           
  |    |         S_MULT      
  |    |      _____|______    
S_NUM  |   S_NUM   |    S_NUM
  |    |     |     |      |   
 NUM  ADD   NUM   MULT   NUM 
  |    |     |     |      |   
 two  plus three times   four



Make sure that you understand why the parse probabilities are the way they are. Call over a staff member for a quick check.

# Estimating rule probabilities from a corpus

In the previous section, you received a CFG augmented with rule probabilities that were arbitrarily stipulated. But where should rule probabilities come from? One way to generate rule probabilites is to learn them from a training corpus. 

In this section you will use a toy corpus of sentences parsed according to the lexicalized grammar to generate maximum likelihood estimates of rule probabilities by counting the number of occurrences of a rule used in the corpus.

In [66]:
## The raw corpus, before splitting into separate phrases
corpus_raw = """
    # seven
    (S (S_NUM (NUM seven)))
    # one plus two
    (S (S_ADD (S_NUM (NUM one)) (ADD plus) (S_NUM (NUM two))))
    # two times three
    (S (S_MULT (S_NUM (NUM two)) (MULT times) (S_NUM (NUM three))))
    # two plus six times one
    (S (S_ADD (S_NUM (NUM two)) (ADD plus) (S_MULT (S_NUM (NUM six)) (MULT times) (S_NUM (NUM one)))))
    # eight plus three plus seven
    (S (S_ADD (S_ADD (S_NUM (NUM eight)) (ADD plus) (S_NUM (NUM three))) (ADD plus) (S_NUM (NUM seven))))
    # two plus three times four
    (S (S_ADD (S_NUM (NUM two)) (ADD plus) (S_MULT (S_NUM (NUM three)) (MULT times) (S_NUM (NUM four)))))
    # eight times four times two
    (S (S_MULT (S_MULT (S_NUM (NUM eight)) (MULT times) (S_NUM (NUM four))) (MULT times) (S_NUM (NUM two))))
    # five times two plus one
    (S (S_ADD (S_MULT (S_NUM (NUM five)) (MULT times) (S_NUM (NUM two))) (ADD plus) (S_NUM (NUM one))))
    # five plus one times four
    (S (S_ADD (S_NUM (NUM five)) (ADD plus) (S_MULT (S_NUM (NUM one)) (MULT times) (S_NUM (NUM four)))))
    # two times three plus four
    (S (S_ADD (S_MULT (S_NUM (NUM two)) (MULT times) (S_NUM (NUM three))) (ADD plus) (S_NUM (NUM four))))
    # ten plus two times three
    (S (S_ADD (S_NUM (NUM ten)) (ADD plus) (S_MULT (S_NUM (NUM two)) (MULT times) (S_NUM (NUM three)))))
    # four times three plus two times one
    (S (S_ADD (S_MULT (S_NUM (NUM four)) (MULT times) (S_NUM (NUM three))) (ADD plus) (S_MULT (S_NUM (NUM two)) (MULT times) (S_NUM (NUM one)))))
    # four plus three times two plus one
    (S (S_ADD (S_ADD (S_NUM (NUM four)) (ADD plus) (S_MULT (S_NUM (NUM three)) (MULT times) (S_NUM (NUM two)))) (ADD plus) (S_NUM (NUM one))))
"""

def corpus_from_string(raw):
  """Return a corpus as a list of sentences.
  
  The `raw` corpus is split at newlines, trimmed of whitespace, 
  and comment lines and blank lines are eliminated.
  """
  return list(filter(lambda x: x != '' and x[0] != '#', 
                     map(lambda sent: sent.strip(),
                         raw.split('\n'))))

## The processed corpus we'll use
corpus = corpus_from_string(corpus_raw)
print(corpus)

['(S (S_NUM (NUM seven)))', '(S (S_ADD (S_NUM (NUM one)) (ADD plus) (S_NUM (NUM two))))', '(S (S_MULT (S_NUM (NUM two)) (MULT times) (S_NUM (NUM three))))', '(S (S_ADD (S_NUM (NUM two)) (ADD plus) (S_MULT (S_NUM (NUM six)) (MULT times) (S_NUM (NUM one)))))', '(S (S_ADD (S_ADD (S_NUM (NUM eight)) (ADD plus) (S_NUM (NUM three))) (ADD plus) (S_NUM (NUM seven))))', '(S (S_ADD (S_NUM (NUM two)) (ADD plus) (S_MULT (S_NUM (NUM three)) (MULT times) (S_NUM (NUM four)))))', '(S (S_MULT (S_MULT (S_NUM (NUM eight)) (MULT times) (S_NUM (NUM four))) (MULT times) (S_NUM (NUM two))))', '(S (S_ADD (S_MULT (S_NUM (NUM five)) (MULT times) (S_NUM (NUM two))) (ADD plus) (S_NUM (NUM one))))', '(S (S_ADD (S_NUM (NUM five)) (ADD plus) (S_MULT (S_NUM (NUM one)) (MULT times) (S_NUM (NUM four)))))', '(S (S_ADD (S_MULT (S_NUM (NUM two)) (MULT times) (S_NUM (NUM three))) (ADD plus) (S_NUM (NUM four))))', '(S (S_ADD (S_NUM (NUM ten)) (ADD plus) (S_MULT (S_NUM (NUM two)) (MULT times) (S_NUM (NUM three)))))', '(S (S_

Recall that for the rule probabilities to define a valid probability distibution, the following needs to hold
$$\sum_{A \to \beta \in G} \Prob(\beta \given A) = 1$$
where $G$ is the set of productions.

In order to get an estimate for each production probability, we can count the number of occurrences of the production, normalizing by the number of occurrences of all productions with the same right-hand side.

\begin{align}
\Prob(\beta \given A) 
  &= \frac{\cnt{A \to \beta}}{\sum_{\beta'} \cnt{A \to \beta'}} \\
  &= \frac{\cnt{A \to \beta}}{\cnt{A}}
\end{align}

We will define three functions: 

1. `rule_counter` - accepts a list of sentences and returns a dictionary of rule counts (where the key is the NLTK CFG production (defined by the lhs and rhs) and the value is the number of rule occurrences)
2. `lhs_counter` - accepts a list of sentences and returns a dictionary of lhs counts (where the key is the lhs nonterminal and the value is the count of that nonterminal's occurences as a lhs)
3. `rule_probs` - accepts a list of sentences and returns a dictionary of rule probabilities (where the key is the production and the value is the rule probability).

Implement these functions as specified above.

<!--
BEGIN QUESTION
name: probs_from_corpus
-->

In [87]:
#TODO 
def rule_counter(sentence_list):
    d = defaultdict(int)
    for sentence in sentence_list:
        for rule in nltk.Tree.fromstring(sentence).productions():
            d[rule] += 1
    return d       

#TODO
def lhs_counter(sentence_list):
    d = defaultdict(int)
    for sentence in sentence_list:
        for rule in nltk.Tree.fromstring(sentence).productions():
            d[rule.lhs()] += 1
    
    return d

#TODO
def rule_probs(sentence_list):
    ruleDict = rule_counter(sentence_list)
    lhsDict = lhs_counter(sentence_list)
    print(ruleDict)
    print(lhsDict)
    d = {}
    for sentence in sentence_list:
        for rule in nltk.Tree.fromstring(sentence).productions():
            d[rule] = ruleDict[rule] / lhsDict[rule.lhs()]
            #keyCount += 1
    print(d)
    return d
        
        

In [88]:
grader.check("probs_from_corpus")

Now we can use the `rules_prob` function you wrote to get the rule probabilities from our corpus:

In [89]:
probs_from_corpus = rule_probs(corpus)
pprint(probs_from_corpus)

defaultdict(<class 'int'>, {S -> S_NUM: 1, S_NUM -> NUM: 37, NUM -> 'seven': 2, S -> S_ADD: 10, S_ADD -> S_NUM ADD S_NUM: 2, NUM -> 'one': 6, ADD -> 'plus': 12, NUM -> 'two': 10, S -> S_MULT: 2, S_MULT -> S_NUM MULT S_NUM: 11, MULT -> 'times': 12, NUM -> 'three': 7, S_ADD -> S_NUM ADD S_MULT: 5, NUM -> 'six': 1, S_ADD -> S_ADD ADD S_NUM: 2, NUM -> 'eight': 2, NUM -> 'four': 6, S_MULT -> S_MULT MULT S_NUM: 1, S_ADD -> S_MULT ADD S_NUM: 2, NUM -> 'five': 2, NUM -> 'ten': 1, S_ADD -> S_MULT ADD S_MULT: 1})
defaultdict(<class 'int'>, {S: 13, S_NUM: 37, NUM: 37, S_ADD: 12, ADD: 12, S_MULT: 12, MULT: 12})
{S -> S_NUM: 0.07692307692307693, S_NUM -> NUM: 1.0, NUM -> 'seven': 0.05405405405405406, S -> S_ADD: 0.7692307692307693, S_ADD -> S_NUM ADD S_NUM: 0.16666666666666666, NUM -> 'one': 0.16216216216216217, ADD -> 'plus': 1.0, NUM -> 'two': 0.2702702702702703, S -> S_MULT: 0.15384615384615385, S_MULT -> S_NUM MULT S_NUM: 0.9166666666666666, MULT -> 'times': 1.0, NUM -> 'three': 0.1891891891891

Observe that the probabilities of the two rules `S_ADD -> S_NUM ADD S_MULT` and `S_MULT -> S_ADD MULT S_NUM` are now different from each other. (They were both the same in the previous grammar, since you made the probabilities as uniform as possible.)

NLTK allows us to infer a probabilistic grammar from a parsed corpus like this one using [`nltk.induce_pcfg`](https://www.nltk.org/api/nltk.grammar.html#nltk.grammar.induce_pcfg). Let's do that.

In [90]:
def flatten(l):
    return sum(l, [])
    
def pcfg_from_trees(trees):
    return nltk.induce_pcfg(nltk.Nonterminal('S'), 
                            flatten([nltk.Tree.fromstring(tree).productions() 
                                     for tree in trees]))

induced_pcfg = pcfg_from_trees(corpus)

print(induced_pcfg)

Grammar with 22 productions (start state = S)
    S -> S_NUM [0.0769231]
    S_NUM -> NUM [1.0]
    NUM -> 'seven' [0.0540541]
    S -> S_ADD [0.769231]
    S_ADD -> S_NUM ADD S_NUM [0.166667]
    NUM -> 'one' [0.162162]
    ADD -> 'plus' [1.0]
    NUM -> 'two' [0.27027]
    S -> S_MULT [0.153846]
    S_MULT -> S_NUM MULT S_NUM [0.916667]
    MULT -> 'times' [1.0]
    NUM -> 'three' [0.189189]
    S_ADD -> S_NUM ADD S_MULT [0.416667]
    NUM -> 'six' [0.027027]
    S_ADD -> S_ADD ADD S_NUM [0.166667]
    NUM -> 'eight' [0.0540541]
    NUM -> 'four' [0.162162]
    S_MULT -> S_MULT MULT S_NUM [0.0833333]
    S_ADD -> S_MULT ADD S_NUM [0.166667]
    NUM -> 'five' [0.0540541]
    NUM -> 'ten' [0.027027]
    S_ADD -> S_MULT ADD S_MULT [0.0833333]


We'll use NLTK's implementation of the probabilistic CKY algorithm ([`nltk.ViterbiParser`](https://www.nltk.org/api/nltk.parse.viterbi.html#nltk.parse.viterbi.ViterbiParser)) to generate the best parse for some strings according to this induced PCFG. (You'll implement this yourself in lab 3-4.)

In [91]:
induced_parser = nltk.ViterbiParser(induced_pcfg)

Use this parser to parse the `example` phrase "two plus three times four" from above. Which parse does it return? Do you understand why?

> Be careful. The parser returns a Python generator of the parses, not a list. You can't use the generator twice, so you should save the `induced_grammar_parses` as a list constructed from the generator object to pass all of the tests.

<!--
BEGIN QUESTION
name: induced_grammar_parses
-->

In [92]:
# TODO - parse `example` using `induced_parser`
induced_grammar_parses = list(induced_parser.parse(example.split()))

In [93]:
grader.check("induced_grammar_parses")

In [94]:
for i, tree in enumerate(induced_grammar_parses):
    print(f'Probability of parse tree {i+1} is '
          f'{parse_probability(tree, induced_pcfg):1.2e}')
    tree.pretty_print()

Probability of parse tree 1 is 2.44e-03
             S               
             |                
           S_ADD             
   __________|_____           
  |    |         S_MULT      
  |    |      _____|______    
S_NUM  |   S_NUM   |    S_NUM
  |    |     |     |      |   
 NUM  ADD   NUM   MULT   NUM 
  |    |     |     |      |   
 two  plus three times   four



How many parses there are for the expression "three plus nine plus two" according to the induced PCFG? Set the variable in the next cell accordingly.

<!--
BEGIN QUESTION
name: parse_count_2
-->

In [95]:
example = "three plus nine plus two"
induced_grammar_parses = list(induced_parser.parse(example.split()))
for i, tree in enumerate(induced_grammar_parses):
    print(f'Probability of parse tree {i+1} is '
          f'{parse_probability(tree, induced_pcfg):1.2e}')
    tree.pretty_print()

ValueError: Grammar does not cover some of the input words: "'nine'".

In [96]:
# TODO 
example2_parse_count = 0

In [97]:
grader.check("parse_count_2")

<!-- BEGIN QUESTION -->

**Question:** You undoubtedly obtained a number of parses for this second example that didn't seem appropriate. With a _single word_, what technique that you've learned would be appropriate to solve this problem.

<!--
BEGIN QUESTION
name: open_response_debrief
manual: true
-->

smoothing

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question:** The example that we provided of an ambiguity in arithmetic expressions is admittedly quite artificial. Can you think of other (more natural) examples, in natural language or elsewhere, where this phenomenon might occur?

<!--
BEGIN QUESTION
name: open_response_other_examples
manual: true
-->

Ambiguity in natural language also occurs when referring to multiple subjects that are doing less than the number of subjects actions. 

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

# Lab debrief – for consensus submission only

**Question:** We're interested in any thoughts your group has about this lab so that we can improve this lab for later years, and to inform later labs for this year. Please list any issues that arose or comments you have to improve the lab. Useful things to comment on include the following: 

* Was the lab too long or too short?
* Were the readings appropriate for the lab? 
* Was it clear (at least after you completed the lab) what the points of the exercises were? 
* Are there additions or changes you think would make the lab better?

<!--
BEGIN QUESTION
name: open_response_debrief
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->



# End of Lab 3-3 {-}

---

To double-check your work, the cell below will rerun all of the autograder tests.

In [None]:
grader.check_all()