# FLIP(01):  Advanced Data Science
**(Module 03: Natural Language Processing)**

---
- Materials in this module include resources collected from various open-source online repositories.
- You are free to use, but NOT allowed to change or distribute this package.

Prepared by and for 
**Student Members** |
2006-2018 [TULIP Lab](http://www.tulip.org.au)

---


# Session 07 - Analyzing Sentence Structure

### Some Grammatical Dilemmas

## Linguistic Data and Unlimited Possibilities

In this chapter, we will adopt the formal framework of “generative grammar,” in which a “language” is considered to be nothing more than an enormous collection of all grammatical sentences, and a grammar is a formal notation that can be used for “generating” the members of this set. Grammars use recursive productions of the form S → S and S, as we will explore in Section 8.3. In Chapter 10 we will extend this, to
automatically build up the meaning of a sentence out of the meanings of its parts.

## Ubiquitous Ambiguity

Let’s take a closer look at the ambiguity in the phrase: I shot an elephant in my pajamas.First we need to define a simple grammar:

In [None]:
import nltk
groucho_grammar = nltk.parse_cfg("""
                                S -> NP VP
                                PP -> P NP
                                NP -> Det N | Det N PP | 'I'
                                VP -> V NP | VP PP
                                Det -> 'an' | 'my'
                                N -> 'elephant' | 'pajamas'
                                V -> 'shot'
                                P -> 'in'
                                """)

In [None]:
sent = ['I', 'shot', 'an', 'elephant', 'in', 'my', 'pajamas']

In [None]:
parser = nltk.ChartParser(groucho_grammar)

In [None]:
trees = parser.nbest_parse(sent)

In [None]:
for tree in trees:
    print tree

# Context-Free Grammar

## A Simple Grammar

Let’s start off by looking at a simple context-free grammar (CFG). By convention, the lefthand side of the first production is the start-symbol of the grammar, typically S, and all well-formed trees must have this symbol as their root label. In NLTK, contextfree grammars are defined in the nltk.grammar module.

In [None]:
grammar1 = nltk.parse_cfg("""
                            S -> NP VP
                            VP -> V NP | V NP PP
                            PP -> P NP
                            V -> "saw" | "ate" | "walked"
                            NP -> "John" | "Mary" | "Bob" | Det N | Det N PP
                            Det -> "a" | "an" | "the" | "my"
                            N -> "man" | "dog" | "cat" | "telescope" | "park"
                            P -> "in" | "on" | "by" | "with"
                            """)

In [None]:
sent = "Mary saw Bob".split()

In [None]:
rd_parser = nltk.RecursiveDescentParser(grammar1)

In [None]:
for tree in rd_parser.nbest_parse(sent):
    print tree

## Writing Your Own Grammars

If you are interested in experimenting with writing CFGs, you will find it helpful to create and edit your grammar in a text file, say, mygrammar.cfg. You can then load it into NLTK and parse with it as follows:

In [None]:
grammar1 = nltk.data.load('file:mygrammar.cfg')

In [None]:
sent = "Mary saw Bob".split()

In [None]:
rd_parser = nltk.RecursiveDescentParser(grammar1)

In [None]:
for tree in rd_parser.nbest_parse(sent):
    print tree

## Recursion in Syntactic Structure

The production Nom -> Adj Nom (where Nom is the category of nominals) involves direct recursion on the category Nom, whereas indirect recursion on S arises from the combination of two productions, namely S -> NP VP and VP -> V S.

In [None]:
grammar2 = nltk.parse_cfg("""
    S -> NP VP
    NP -> Det Nom | PropN
    Nom -> Adj Nom | N
    VP -> V Adj | V NP | V S | V NP PP
    PP -> P NP
    PropN -> 'Buster' | 'Chatterer' | 'Joe'
    Det -> 'the' | 'a'
    N -> 'bear' | 'squirrel' | 'tree' | 'fish' | 'log'
    Adj -> 'angry' | 'frightened' | 'little' | 'tall'
    V -> 'chased' | 'saw' | 'said' | 'thought' | 'was' | 'put'
    P -> 'on'
    """)

# Parsing with Context-Free Grammar

A parser processes input sentences according to the productions of a grammar, and builds one or more constituent structures that conform to the grammar. A grammar is a declarative specification of well-formedness—it is actually just a string, not a program.
A parser is a procedural interpretation of the grammar. It searches through the space of trees licensed by a grammar to find one that has the required sentence along its fringe.

## Recursive Descent Parsing

The simplest kind of parser interprets a grammar as a specification of how to break a high-level goal into several lower-level subgoals. The top-level goal is to find an S. The S → NP VP production permits the parser to replace this goal with two subgoals: find an NP, then find a VP. Each of these subgoals can be replaced in turn by sub-subgoals, using productions that have NP and VP on their lefthand side. Eventually, this expansion process leads to subgoals such as: find the word telescope.

In [None]:
rd_parser = nltk.RecursiveDescentParser(grammar1)

In [None]:
sent = 'Mary saw a dog'.split()

In [None]:
for t in rd_parser.nbest_parse(sent):
    print t

## Shift-Reduce Parsing

A simple kind of bottom-up parser is the shift-reduce parser. In common with all bottom-up parsers, a shift-reduce parser tries to find sequences of words and phrases that correspond to the righthand side of a grammar production, and replace them with the lefthand side, until the whole sentence is reduced to an S.

In [None]:
sr_parse = nltk.ShiftReduceParser(grammar1)

In [None]:
sent = 'Mary saw a dog'.split()

In [None]:
print sr_parse.parse(sent)

## Well-Formed Substring Tables

The simple parsers discussed in the previous sections suffer from limitations in both completeness and efficiency. In order to remedy these, we will apply the algorithm design technique of dynamic programming to the parsing problem.

In [None]:
text = ['I', 'shot', 'an', 'elephant', 'in', 'my', 'pajamas']

In [None]:
def init_wfst(tokens, grammar):
    numtokens = len(tokens)
    wfst = [[None for i in range(numtokens+1)] for j in range(numtokens+1)]
    for i in range(numtokens):
        productions = grammar.productions(rhs=tokens[i])
        wfst[i][i+1] = productions[0].lhs()
    return wfst

In [None]:
def complete_wfst(wfst, tokens, grammar, trace=False):
    index = dict((p.rhs(), p.lhs()) for p in grammar.productions())
    numtokens = len(tokens)
    for span in range(2, numtokens+1):
        for start in range(numtokens+1-span):
            end = start + span
            for mid in range(start+1, end):
                nt1, nt2 = wfst[start][mid], wfst[mid][end]
                if nt1 and nt2 and (nt1,nt2) in index:
                    wfst[start][end] = index[(nt1,nt2)]
                    if trace:
                        print "[%s] %3s [%s] %3s [%s] ==> [%s] %3s [%s]" % \
                        (start, nt1, mid, nt2, end, start, index[(nt1,nt2)], end)
    return wfst

In [None]:
def display(wfst, tokens):
    print '\nWFST ' + ' '.join([("%-4d" % i) for i in range(1, len(wfst))])
    for i in range(len(wfst)-1):
        print "%d " % i,
        for j in range(1, len(wfst)):
            print "%-4s" % (wfst[i][j] or '.'),
        print

In [None]:
tokens = "I shot an elephant in my pajamas".split()

In [None]:
wfst0 = init_wfst(tokens, groucho_grammar)

In [None]:
display(wfst0, tokens)

In [None]:
wfst1 = complete_wfst(wfst0, tokens, groucho_grammar)

In [None]:
display(wfst1, tokens)

In [None]:
wfst1 = complete_wfst(wfst0, tokens, groucho_grammar, trace=True)

# Dependencies and Dependency Grammar

Phrase structure grammar is concerned with how words and sequences of words combine
to form constituents. A distinct and complementary approach, dependency
grammar, focuses instead on how words relate to other words. Dependency is a binary
asymmetric relation that holds between a head and its dependents. The head of a
sentence is usually taken to be the tensed verb, and every other word is either dependent
on the sentence head or connects to it through a path of dependencies.

In [None]:
groucho_dep_grammar = nltk.parse_dependency_grammar("""
    'shot' -> 'I' | 'elephant' | 'in'
    'elephant' -> 'an' | 'in'
    'in' -> 'pajamas'
    """)

In [None]:
print groucho_dep_grammar

In [None]:
pdp = nltk.ProjectiveDependencyParser(groucho_dep_grammar)

In [None]:
sent = 'I shot an elephant in my pajamas'.split()

In [None]:
trees = pdp.parse(sent)

In [None]:
for tree in trees:
    print tree

# Grammar Development

Parsing builds trees over sentences, according to a phrase structure grammar. Now, all
the examples we gave earlier only involved toy grammars containing a handful of productions.

## Treebanks and Grammars

The corpus module defines the treebank corpus reader, which contains a 10% sample
of the Penn Treebank Corpus.

In [None]:
from nltk.corpus import treebank

In [None]:
t = treebank.parsed_sents('wsj_0001.mrg')[0]

In [None]:
print t

In [None]:
def filter(tree):
    child_nodes = [child.node for child in tree
        if isinstance(child, nltk.Tree)]
    return (tree.node == 'VP') and ('S' in child_nodes)

In [None]:
from nltk.corpus import treebank

In [None]:
[subtree for tree in treebank.parsed_sents()
         for subtree in tree.subtrees(filter)]

In [None]:
entries = nltk.corpus.ppattach.attachments('training')

In [None]:
table = nltk.defaultdict(lambda: nltk.defaultdict(set))

In [None]:
for entry in entries:
    key = entry.noun1 + '-' + entry.prep + '-' + entry.noun2
    table[key][entry.attachment].add(entry.verb)

In [None]:
for key in sorted(table):
    if len(table[key]) > 1:
        print key, 'N:', sorted(table[key]['N']), 'V:', sorted(table[key]['V'])

In [None]:
nltk.corpus.sinica_treebank.parsed_sents()[3450].draw()

## Pernicious Ambiguity

Unfortunately, as the coverage of the grammar increases and the length of the input
sentences grows, the number of parse trees grows rapidly. In fact, it grows at an astronomical
rate.

In [None]:
grammar = nltk.parse_cfg("""
                        S -> NP V NP
                        NP -> NP Sbar
                        Sbar -> NP V
                        NP -> 'fish'
                        V -> 'fish'
                        """)

In [None]:
tokens = ["fish"] * 5

In [None]:
cp = nltk.ChartParser(grammar)

In [None]:
for tree in cp.nbest_parse(tokens):
    print tree

## Weighted Grammar

As we have just seen, dealing with ambiguity is a key challenge in developing broadcoverage
parsers. Chart parsers improve the efficiency of computing multiple parses of
the same sentences, but they are still overwhelmed by the sheer number of possible
parses. Weighted grammars and probabilistic parsing algorithms have provided an effective
solution to these problems.

In [None]:
def give(t):
    return t.node == 'VP' and len(t) > 2 and t[1].node == 'NP'\
            and (t[2].node == 'PP-DTV' or t[2].node == 'NP')\
            and ('give' in t[0].leaves() or 'gave' in t[0].leaves())

In [None]:
def sent(t):
    return ' '.join(token for token in t.leaves() if token[0] not in '*-0')

In [None]:
def print_node(t, width):
        output = "%s %s: %s / %s: %s" %\
            (sent(t[0]), t[1].node, sent(t[1]), t[2].node, sent(t[2]))
        if len(output) > width:
            output = output[:width] + "..."
        print output

In [None]:
for tree in nltk.corpus.treebank.parsed_sents():
    for t in tree.subtrees(give):
        print_node(t, 72)

In [None]:
grammar = nltk.parse_pcfg("""
                        S -> NP VP [1.0]
                        VP -> TV NP [0.4]
                        VP -> IV [0.3]
                        VP -> DatV NP NP [0.3]
                        TV -> 'saw' [1.0]
                        IV -> 'ate' [1.0]
                        DatV -> 'gave' [1.0]
                        NP -> 'telescopes' [0.8]
                        NP -> 'Jack' [0.2]
                        """)

In [None]:
print grammar

In [None]:
viterbi_parser = nltk.ViterbiParser(grammar)

In [None]:
print viterbi_parser.parse(['Jack', 'saw', 'telescopes'])