# Approach
## 1 Different Parsing techniques

In this assignment, we found that the texts are possible derived from a script called *Monty Python and the Holy Grail*. A reasonable way to utilize this extra text source is parsing it with state-of-the-art English parser to generate parsed grammar trees, and traverse these trees to generate grammar files. 

Parsing English on Penn Treebank dataset has been well studied by previous researchers. Researchers from Stanford has proposed a method to train highly accurate praser over Penn Treebank dataset[1]. They have packed their software into jar, and distributed on their website. In their parser package , there are two parsers, one traditional statistical PCFG parser, and a RNN based PCFG parser. Both parsers are trained with Penn Treebank dataset.

## 2 Generate S1.gr & Vocab.gr

### 2.1 Generate trees

* Use provided devset.trees

* Use state-of-the-art parser to parse existed materials. 

### 2.2 Generate grammars files from grammar trees


1. Given grammar trees(e.g. `devset.trees`) parse it to seperate each parenthesis surrounded strings.

    * For given `devset.trees`, the parenthesis are not sanitized.(e.g. `(( ()` which implies `( -> (`) This would result in failure of later processing. So we identified such cases in the grammar trees and sanitize '(' and ')' to '-LRB-' and '-RRB-' as in the result of stanford parser. 
    
2. Construct the tree by using `nltk.Tree.fromstring()`

3. Convert the tree to CNF by using `nltk.Tree.chomsky_normal_form()`

4. Traverse the tree by depth first search. Use a dictionary to keep counting the frequency of one symbol followed by another.

    * In the grammer trees, there are rules like `. -> .`, `, -> ,` where '.' and ',' are both nonterminal and ternimal. To address this problem, we sanitize nonterminal punctuations to things like 'PERIOD' 'COMMA'. And because of we previously sanitized '(' and ')' to '-LRB-' and '-RRB-', which results in rule like `-LRB- -> -LRB-`. So we sanitize the termial '-LRB-' and '-RRB-' to '(' and ')'.
    
    * In short, We make sure every non terminals are uppercase letters and there is no symbols that are both non terminal and terminal.
    
5. Output the grammar weighted by frequence to output grammar files `S1.gr` and `Vocab.gr`.




### 2.3 Add other unseen words with tags

After generating `S1.gr` and `Vocab.gr` using grammar trees, we found that not every word in `allowed_words.txt` appears in `Vocab.gr`. Thus we must handle those unseen words with some meaningful tags.

`nltk` can assign POS tag for any single word by 

```python
nltk.tag.pos_tag([word])
```

So we can simply iterate all the words in `allowed_words.txt`, if that word doesn’t exist in `Vocab.gr`, we assign it with tag and append it to the end of file.
The code that generates unseen words, assigns tags and put them into `Vocab.gr` is in `vocab_generator.py`.

Note that one word may have multiple POS tags in different context, however we didn’t find a way to list all the possible tags. Also we only have small percentage unseen words, so we think it is ok to not include all tags of a word.

## 3 Generate S2.gr

### 3.1 Unigram

As mentioned above, Vocab.gr maps tags to words like this:
|Weight|Tag|Word|
|---|---|---|
|1|NNP|Whoa|
|18|NNP|Arthur|
|1|NNP|Uther|
|6|EX|there|

Our first approach to generate S2.gr is to simply generate a unigram. We want S2 to accept sentences that consist of any random words in any random order, and the occurrence of a word is completely independent of other words in that same sentence. In other words, the probability that one tag follows another is flat. Without taking the relationship between tags into account, we simply calculate: how often a tag occurs, i.e. the amount of its occurrence in Vocab.gr. That being said, the above Vocab.gr produces the following S2.gr:

|Weight|Left|Right|Comment|
|---|---|---|---|
|1|S2|_Word|# S2 consists of any length of Words|
|1|_Word|Word _Word|
|1|_Word|Word|
|3|Word|NNP|# There is 3 NNPs in Vocab.gr|
|1|Word|EX|# There is 1 EX in Vocab.gr|


### 3.2 Bigram

In the bigram approach, unlike the unigram approach, we consider the relationship between tags. We first initilize all possible combinations  We calculate the number of times a tag is the beginning of a sentence, is the end of a sentence and follows another tag.

Say we only have this one sentence:

`(INTJ (NNP Whoa) (ADVP (EX there)))`

S2.gr will look like:

|Weight|Left|Right|Comment|
|---|---|---|---|
|2|S2|_NNP|# Each rule starts with an initial value of 1. NNP is the beginning of 1 sentence so here the weight is 1+1=2|
|1|S2|_EX||
|1|_NNP|NNP|# A sentence ends with NNP|
|1|_NNP|NNP _NNP|# NNP is followed by another NNP|
|2|_NNP|NNP _EX|
|2|_EX|EX|# A sentence ends with EX|
|1|_EX|EX _EX|
|1|_EX|EX _NNP|

# Experiments

## 1 Compare different grammars from different training set

Based on the methods we talked above, we want to compare performance on different training set.
We prepared the following training set:

`devset.txt` only
`devset.txt` + `more_sentences.txt`
`devset.txt` + `example_sentences.txt`
`devset.txt` + `example_sentences.txt` + `more_sentences.txt`

*Note: where more sentences come from?
We realize that the devset sentences all come from Monty Python and the Holy Grail, which nltk.books has. We get all the sentences from the book, reformat them so that they match the style as in devset.text (e.g. transform “couldn ’ t” to “could n’t”). Then we keep only the sentences consisting of only allowed words.*

| Training data \ Test data                   | example sentences | more example sentences |
| ------------------------------------------- | ----------------- | ---------------------- |
| devset                                      | -8.92096          | -9.57106               |
| devset + example sentences                  | -7.87721          | -8.83074               |
| devset + more sentences                     | -9.03043          | -9.67019               |
| devset + more sentences + example sentences | -8.31589          | -9.04387               |


Since our test data will come from the same distribution of `devset.txt` & `example_sentences.txt`, we need to make sure we fully utilize it in our training.
So we choose to use `devset.txt` & `example_sentences.txt` as training data in out final grammars.



In [1]:
%load_ext autoreload
%autoreload 2
from pcfg_parse_gen import Pcfg, PcfgGenerator, CkyParse
import nltk

In [2]:
def test_grammar(s1, s2, vocab, test):
    parse_gram = Pcfg([s1, s2, vocab])
    parser = CkyParse(parse_gram, beamsize=0.0001, verbose=0)
    ce, trees = parser.parse_file(test)

### On `text/example_sentences.txt`

In [3]:
test_grammar("grammars/devset_s1.gr", "grammars/devset_s2.gr", "grammars/devset_vocab.gr", "text/example_sentences.txt")

#-cross entropy (bits/word): -8.66895


In [4]:
test_grammar("grammars/devset_rnn_s1.gr", "grammars/devset_rnn_s2.gr", "grammars/devset_rnn_vocab.gr", "text/example_sentences.txt")

#No parses found for: they migrate precisely because they know they will grow .
#-cross entropy (bits/word): -13.819


In [5]:
test_grammar("grammars/devset_stanford_s1.gr", "grammars/devset_stanford_s2.gr", "grammars/devset_stanford_vocab.gr", "text/example_sentences.txt")

#-cross entropy (bits/word): -9.15619


### On `text/more_examples.txt`

In [6]:
test_grammar("grammars/devset_s1.gr", "grammars/devset_s2.gr", "grammars/devset_vocab.gr", "text/more_examples.txt")

#-cross entropy (bits/word): -9.43403


In [7]:
test_grammar("grammars/devset_rnn_s1.gr", "grammars/devset_rnn_s2.gr", "grammars/devset_rnn_vocab.gr", "text/more_examples.txt")

#No parses found for: they migrate precisely because they know they will grow .
#-cross entropy (bits/word): -11.2527


In [8]:
test_grammar("grammars/devset_stanford_s1.gr", "grammars/devset_stanford_s2.gr", "grammars/devset_stanford_vocab.gr", "text/more_examples.txt")

#-cross entropy (bits/word): -9.70726



## 2 Compare different parsers on devset

Since we also found some other parsers, we want to know which one is better. Here is our test result:

| Parser \ Test data    | example sentences | more example sentences |
| --------------------- | ----------------- | ---------------------- |
| provided devset.trees | -8.92096          | -9.57106               |
| stanford parser       | -9.15619          | -9.70726               |
| rnn parser            | -13.819           | -11.2527               |


We finally found that the `devset.trees` provided in the repo generate the best entropy. Thus we will use that as grammar trees.


### On `text/example_sentences.txt`

In [9]:
test_grammar("grammars/devset_s1.gr", "grammars/devset_s2.gr", "grammars/devset_vocab.gr", "text/example_sentences.txt")

#-cross entropy (bits/word): -8.66895


In [10]:
test_grammar("grammars/devset_and_examplesentences_s1.gr", "grammars/devset_and_examplesentences_s2.gr", "grammars/devset_and_examplesentences_vocab.gr", "text/example_sentences.txt")

#-cross entropy (bits/word): -7.87721


In [11]:
test_grammar("grammars/moresentences_devset_s1.gr", "grammars/moresentences_devset_s2.gr", "grammars/moresentences_devset_vocab.gr", "text/example_sentences.txt")

#-cross entropy (bits/word): -9.03043


In [12]:
test_grammar("grammars/moresentences_devset_examplesentences_s1.gr", "grammars/moresentences_devset_examplesentences_s2.gr", "grammars/moresentences_devset_examplesentences_vocab.gr", "text/example_sentences.txt")

#-cross entropy (bits/word): -8.31589


### On `text/more_examples.txt`

In [13]:
test_grammar("grammars/devset_s1.gr", "grammars/devset_s2.gr", "grammars/devset_vocab.gr", "text/more_examples.txt")

#-cross entropy (bits/word): -9.43403


In [14]:
test_grammar("grammars/devset_and_examplesentences_s1.gr", "grammars/devset_and_examplesentences_s2.gr", "grammars/devset_and_examplesentences_vocab.gr", "text/more_examples.txt")

#-cross entropy (bits/word): -8.83074


In [15]:
test_grammar("grammars/moresentences_devset_s1.gr", "grammars/moresentences_devset_s2.gr", "grammars/moresentences_devset_vocab.gr", "text/more_examples.txt")

#-cross entropy (bits/word): -9.67019


In [16]:
test_grammar("grammars/moresentences_devset_examplesentences_s1.gr", "grammars/moresentences_devset_examplesentences_s2.gr", "grammars/moresentences_devset_examplesentences_vocab.gr", "text/more_examples.txt")

#-cross entropy (bits/word): -9.04387



## 3 Compare different methods to generate S2

We examine S2 by only using S2.gr and Vocab.gr for parsing.

Beamsize: 0.0001

|S2.gr|Trained on|Entropy on example sentences|Entropy on devset|
|-----|-----|-----|-----|
|s2_unigram.gr|devset| -11.1758|-11.2044|
|S2_bigram.gr|devset|-33.2766|-13.6859|

Beamsize: 0.000001

|S2.gr|Trained on|Entropy on example sentences|Entropy on devset|
|-----|-----|-----|-----|
|s2_unigram.gr|devset| ?| ?|
|S2_bigram.gr|devset|?|?|

We found that for bigram works better when beam size is small enough. That is expected result since we have the frequency of one word follows another word and thus the entropy is better.

However when beam size is 0.0001, some sentences may fail to be passed due to pruning inside the praser. For those sentences failed, the entropy is super bad. 

We compared 

In order to make sure we parse every sentences successfully we choose to use unigram.

# Results

According to the experiments we did, we chose the grammar trained on `devset` and `more_sentences`. Although the test result is slightly worse that using only `devset`, we think that this maybe because the test data set is small and trained on more data would generalize better for unseen data.

In [17]:
from pcfg_parse_gen import Pcfg, PcfgGenerator, CkyParse
import nltk

parse_gram = Pcfg(["grammars/devset_more_s1.gr","grammars/devset_more_s2.gr","grammars/devset_more_vocab.gr"])
parser = CkyParse(parse_gram, beamsize=0.0001, verbose=0)
ce, trees = parser.parse_file('text/example_sentences.txt')
print("-cross entropy: {}".format(ce))

-cross entropy: -9.03042669531999


#-cross entropy (bits/word): -9.03043




# References
[1] Danqi Chen and Christopher D Manning. 2014. *A Fast and Accurate Dependency Parser using Neural Networks.* Proceedings of EMNLP 2014

