# Exercise 2: Sensible PP attachment

In this exercise, we will learn about **POS tagging** and **dependency parsing** and study the well-known **PP attachment problem**.

## Introduction and POS tagging

First, let's take a look at spaCy's Part-of-Speech (POS) tagging and dependency parsing abilities. Here's how we load a sentence into a spaCy document object and view its dependency parse:

In [40]:
import spacy

# matplotlib theme
from jupyterthemes import jtplot
jtplot.style()

from spacy import displacy
nlp = spacy.load('en')
test_doc = nlp('I write code.')
displacy.render(test_doc, jupyter = True)

PRON
VERB
NOUN
PUNCT


spaCy also tokenizes the sentence for you. You can view tokens and their POS tags as follows:

In [5]:
print([(token, token.pos_) for token in test_doc])

[(I, 'PRON'), (write, 'VERB'), (code, 'NOUN'), (., 'PUNCT')]


Now let's try applying this to a real dataset. NLTK includes an API for accessing many free open textual corpora, including the Project Gutenberg collection of public domain books. We'll load an array of the sentences of Jane Austen's 1811 novel *Sense and Sensibility* for our tests:

In [6]:
import nltk
nltk.download('gutenberg')
from nltk.corpus import gutenberg
sentences = gutenberg.sents('austen-sense.txt')

[nltk_data] Downloading package gutenberg to
[nltk_data]     /Users/jeremybensoussan/nltk_data...
[nltk_data]   Unzipping corpora/gutenberg.zip.


**Questions:**
  1. How many sentences are in the novel? How many unique tokens?
  2. How many unique verbs are in the novel? What are the five most common verbs?



In [45]:
from nltk import word_tokenize

pos_dict = {}
unique_tokens = set()
for sentence in sentences:
    for tok in nlp(' '.join(sentence)):
        pos = tok.pos_
        token = tok.text
        if pos not in pos_dict:
            pos_dict[pos] = {token: 1}
        elif token not in pos_dict[pos]:
            pos_dict[pos][token] = 1
        else:
            pos_dict[pos][token] += 1
            
        unique_tokens.add(token)

In [59]:
print('There are {} sentences in the novel'.format(len(sentences)))
print('There are {} unique tokens'.format(len(unique_tokens)))
print('There are {} unique verbs'.format(len(pos_dict['VERB'])))
print('The 5 most common verbs are: {}'.format(sorted(pos_dict['VERB'].items(), key=lambda kv: kv[1], reverse=True)[:5]))

There are 4999 sentences in the novel
There are 6784 unique tokens
There are 2429 unique verbs
The 5 most common verbs are: [('was', 1846), ('be', 1305), ('had', 969), ('have', 807), ('is', 725)]


## Dependency parsing and PP attachment

As we saw above, spaCy also generates dependency parses that we can plot. These represent the grammatical relations that connect the different words and phrases in a sentence.

For the next task, we will consider how verbs and prepositional phrases can be related in sentences. (A *prepositional phrase* or *PP* is a phrase like "in the house", "on the table", "with my friend" which is headed by a prepisition like "in", "on", "with" ...).

**Questions:**
  3. What is the difference between the prepositional phrases in the sentences in (A) and those in (B)? Plot their dependency parses with displacy.render and look for a difference in structure.

(A)
  * I eat an apple in my house.
  * We listen to music at the theater.
  * John visited Brazil with his friend.
  
(B)
  * I see a fly in my soup.
  * She knows the man at the store.
  * I photographed a man with a bowtie.

In [67]:
displacy_options = {'compact': True, 'bg': '#006699',
           'color': 'white', 'font': 'Source Sans Pro'}
# (A) Dependency parses
displacy.render(nlp('I eat an apple in my house.'), options=displacy_options, jupyter = True)
displacy.render(nlp('We listen to music at the theater.'), options=displacy_options, jupyter = True)
displacy.render(nlp('John visited Brazil with his friend.'), options=displacy_options, jupyter = True)

In [68]:
displacy_options = {'compact': True, 'bg': '#cc6600',
           'color': 'white', 'font': 'Source Sans Pro'}
# (A) Dependency parses
displacy.render(nlp('I see a fly in my soup.'), options=displacy_options, jupyter = True)
displacy.render(nlp('She knows the man at the store.'), options=displacy_options, jupyter = True)
displacy.render(nlp('I photographed a man with a bowtie.'), options=displacy_options, jupyter = True)

The prepositional phrases in the first 3 examples (A) all relate directly to the verb and its subject.
This can be demonstrated by asking a question relating to the prepositional phrase.
- who is in my house? -> I am
- who is at the theater? -> we are
- who is with his friend? -> John

In contrast, the objects of the prepositions in the group (B) all relate to the direct object and not the verb itself:
- who is in my soup? -> the fly
- who is at the store? -> the man
- who is with a bowtie? -> a man

Kudos to the spacy library to correctly make such a subtle distinction!

As you can imagine, it is not simple for the parser to decide where the prepositional phrase should be attached -- this is the **PP attachment problem**. Let's evaluate spaCy's default behavior towards PP attachment on our *Sense and Sensibility* corpus:

**Questions:**
  4. Make an array of all tuples (verb, preposition) for prepositional phrases attached to the verb (like (A) above). Hint: for a spaCy token object *token*, you can get its children with *token*.children and the child's relation to it with *child.dep_*. What are first five (verb, preposition) pairs in this case?
  5. Do the same where the prepositional phrase is attached to the verb's object (case (B)). What are the five most common (verb, preposition) pairs in this case?

**Bonus:** Look at a few random sentences from the corpus that are parsed as (A) or (B). Do you agree with the given parse? Why or why not?

In [81]:
# List all (verb, preposition) tuples
verb_prep_list = []
for sentence in sentences:
    for token in nlp(' '.join(sentence)):
        if token.pos_ == 'VERB':
            for child in token.children:
                if child.dep_ == 'prep':
                    verb_prep_list.append((token.text, child.text))

In [84]:
from collections import Counter
Counter(verb_prep_list).most_common(5)

[(('was', 'in'), 124),
 (('be', 'in'), 68),
 (('was', 'at'), 45),
 (('was', 'for'), 40),
 (('is', 'in'), 37)]