# Context-Free Grammars

NTLK has a module for building context-free grammars from a string representation of the rules. Let's start by creating our toy grammar from the lecture, and generating the four possible sentences that can be derived using it.

In [None]:
import nltk
from nltk.parse.generate import generate
toy_grammar = nltk.CFG.fromstring("""
  S -> NP VP
  VP -> VBG NP 
  VBG -> "ate"
  NP -> DT NN
  DT -> "the"
  NN -> "rat" | "cheese" 
  """)

for sentence in generate(toy_grammar):
    print sentence

Let's examine our grammar a bit. We can iterate through the rules (productions), look at the symbols on their left hand and right hand sides, and see whether they're terminals or non-terminals. We'll do this to create a list of terminals and non-terminals

In [None]:
terminals = set()
nonterminals = set()
for production in toy_grammar.productions():
    print production
    print production.lhs()
    print production.rhs()
    for symbol in production.rhs():
        if nltk.grammar.is_terminal(symbol):
            terminals.add(symbol)
        else:
            nonterminals.add(symbol)

print "terminals"
print terminals
print "nonterminals"
print nonterminals

Now, let's use Early parsing to parse a sentence using our grammar. NLTK has a built-in version of the algorithm that you can use. We can set trace=2 to show the steps that the model is taking

In [None]:
parser = nltk.parse.EarleyChartParser(toy_grammar, trace=2)
for parse in parser.parse("the rat ate the cheese".split()):
    print parse

These are exactly the same steps as we saw in lecture, except this version has the prediction of lexical rules such as NN -> 'cheese' in the step before scanning them. You should look through these steps again to make sure you understand them.

Let's try to go beyond our toy grammar. Building one by hand would be painful, but fortunately NLTK contains a portion of the human-annotated Penn Treebank, which we can use for grammar induction. First, though, let's take a look at some trees from the Penn Treebank

In [None]:
for tree in nltk.corpus.treebank.parsed_sents()[0:5]:
      print tree

One problem with the default Penn Treebank annotation is that some of its nonterminals include sentential grammatical role labels which don't make sense for building a CFG grammar (for instance, NP-SBJ for NPs as the subject as the sentence). Let's write a function to remove them. To do this, we'll have to traverse these trees, and change the labels on the nodes. This is fairly easy, since iterating over a NLTK tree with a for loop means to interate over its children

In [None]:
def remove_grammatical_roles(tree):
    try:
        if "-" in tree.label():
            tree.set_label(tree.label().split("-")[0])
    except: #we've hit a terminal node
        return 
    for child in tree:
        remove_grammatical_roles(child)

for tree in nltk.corpus.treebank.parsed_sents()[0:5]:
    remove_grammatical_roles(tree)
    print tree

Now let's build a new grammar. NLTK trees have a handy method (productions()) which gives you all the CFG rules for the tree. We can collect these rules across all the texts to build a new CFG

In [None]:
from nltk.grammar import CFG,Nonterminal

productions = set()

for tree in nltk.corpus.treebank.parsed_sents():
    remove_grammatical_roles(tree)
    for production in tree.productions():
        productions.add(production)
treebank_grammar = CFG(Nonterminal('S'), list(productions))
print len(treebank_grammar.productions())

Whereas our old grammar had 7 rules, this one has over 17 thousand. Many of them, however, are the lexical rules which produce terminals nodes. Let's see how many of them are non-lexical by counting only those whose RHS is not a terminal.

In [None]:
nonlex_count = 0
for production in treebank_grammar.productions():
    if not nltk.grammar.is_terminal(production.rhs()[0]):
        nonlex_count +=1
print nonlex_count

That's still a lot of rules. Let's see if it can parse our original sentence

In [None]:
parser = nltk.parse.EarleyChartParser(treebank_grammar, trace=0)
for parse in parser.parse("the rat ate the cheese".split()):
    print parse

No, because it was built on Wall Street Journal texts, it doesn't know about rats, cheese, or even eating, these words are out of vocabulary (OOV). Unfortunately, even when the vocabulary of sentence is covered, the grammar is completely unusable (in this context). To demonstrate why, try running the below, though note that it might crash your iPython notebook session. We have switched to a bottom up parser which works a bit better in this situation.

In [None]:
parser = nltk.parse.BottomUpChartParser(treebank_grammar, trace=0)
for parse in parser.parse("revenue increased last quarter".split()):
    print parse

In short, way, way, way too many possible parses, almost all of which are absolute junk. Toy grammars make parsing look easy, but parsing with a large grammar involves huge amounts of ambiguity: you need ways to filter out the junk to find the correct parse. One simple thing you might try if you'd like to play around some more is filtering your grammar based on frequency, so you're only using the core rules of the grammar. Can you bring down the range of possibilities to a reasonable set without eliminating the correct parse?