# Exercise 2: Sensible PP attachment

In this exercise, we will learn about **POS tagging** and **dependency parsing** and study the well-known **PP attachment problem**.

## Introduction and POS tagging

First, let's take a look at spaCy's Part-of-Speech (POS) tagging and dependency parsing abilities. Here's how we load a sentence into a spaCy document object and view its dependency parse:

In [1]:
import spacy
from spacy import displacy
nlp = spacy.load('en_core_web_sm')
test_doc = nlp('I write code.')
displacy.render(test_doc, jupyter = True)

spaCy also tokenizes the sentence for you. You can view tokens and their POS tags as follows:

In [2]:
print([(token, token.pos_) for token in test_doc])

[(I, 'PRON'), (write, 'VERB'), (code, 'NOUN'), (., 'PUNCT')]


Now let's try applying this to a real dataset. NLTK includes an API for accessing many free open textual corpora, including the Project Gutenberg collection of public domain books. We'll load an array of the sentences of Jane Austen's 1811 novel *Sense and Sensibility* for our tests:

In [3]:
import nltk
nltk.download('gutenberg')
from nltk.corpus import gutenberg
sentences = gutenberg.sents('austen-sense.txt')

[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!


In [4]:
# for answering questions 1-2:
from nltk import FreqDist
from tqdm import tqdm
print(len(sentences), len(set([token for sentence in sentences for token in sentence])))
# most common verbs -- counting inflections separately
docs = [nlp(' '.join(sentence)) for sentence in tqdm(sentences, desc = 'Parsing sentences')]
verb_tokens = [token for doc in docs for token in doc if token.pos_ == 'VERB']
verbs = FreqDist([token.text.lower() for token in verb_tokens])
print(verbs.most_common(5))
# most common verbs -- lemmatized
verb_lemmas = FreqDist([token.lemma_ for token in verb_tokens])
print(verb_lemmas.most_common(5))

Parsing sentences:   0%|          | 5/4999 [00:00<01:48, 45.93it/s]

4999 6828


Parsing sentences: 100%|██████████| 4999/4999 [01:52<00:00, 44.59it/s]


[('was', 1861), ('be', 1305), ('had', 994), ('have', 819), ('is', 757)]
[('be', 5426), ('have', 2084), ('do', 729), ('say', 609), ('could', 578)]


**Questions:**
  1. How many sentences are in the novel? **4999** How many unique tokens? **6828**
  2. How many unique verbs are in the novel? What are the five most common verbs? **was, be, had, have, is (if counting inflections separately), be, have, do, say, could (if lemmatizing)**



## Dependency parsing and PP attachment

As we saw above, spaCy also generates dependency parses that we can plot. These represent the grammatical relations that connect the different words and phrases in a sentence.

For the next task, we will consider how verbs and prepositional phrases can be related in sentences. (A *prepositional phrase* or *PP* is a phrase like "in the house", "on the table", "with my friend" which is headed by a prepisition like "in", "on", "with" ...).

**Questions:**
  3. What is the difference between the prepositional phrases in the sentences in (A) and those in (B)? Plot their dependency parses with displacy.render and look for a difference in structure.

(A)
  * I eat an apple in my house.
  * We listen to music at the theater.
  * John visited Brazil with his friend.
  
(B)
  * I see a fly in my soup.
  * She knows the man at the store.
  * I photographed a man with a bowtie.

In [5]:
# answer to 3
displacy.render(nlp(u'I eat an apple in my house'), jupyter = True, options = {'compact': True})
displacy.render(nlp(u'We listen to music at the theater'), jupyter = True, options = {'compact': True})
displacy.render(nlp(u'John visited Brazil with his friend.'), jupyter = True, options = {'compact': True})
displacy.render(nlp(u'I see a fly in my soup.'), jupyter = True, options = {'compact': True})
displacy.render(nlp(u'She knows the man at the store.'), jupyter = True, options = {'compact': True})
displacy.render(nlp(u'I photographed a man with a bowtie.'), jupyter = True, options = {'compact': True})

As you can imagine, it is not simple for the parser to decide where the prepositional phrase should be attached -- this is the **PP attachment problem**. Let's evaluate spaCy's default behavior towards PP attachment on our *Sense and Sensibility* corpus:

**Questions:**
  4. Make an array of all tuples (verb, preposition) for prepositional phrases attached to the verb (like (A) above). Hint: for a spaCy token object *token*, you can get its children with *token*.children and the child's relation to it with *child.dep_*. What are first five (verb, preposition) pairs in this case? **(settle, in), (be, at), (be, in), (live, for), (live, in)**
  5. Do the same where the prepositional phrase is attached to the verb's object (case (B)). What are the five most common (verb, preposition) pairs in this case? **(engage, of), (have, in), (receive, of), (give, of), (have, in)**

**Bonus:** Look at a few random sentences from the corpus that are parsed as (A) or (B). Do you agree with the given parse? Why or why not?

In [6]:
# answers to 4 and 5
[(token.lemma_, child.lemma_, doc.text)
 for doc in docs[:20]
 for token in doc
 for child in token.children
 if token.pos_ == 'VERB' and child.dep_ == 'prep'
][:5]

[('settle', 'in', 'The family of Dashwood had long been settled in Sussex .'),
 ('be',
  'at',
  'Their estate was large , and their residence was at Norland Park , in the centre of their property , where , for many generations , they had lived in so respectable a manner as to engage the general good opinion of their surrounding acquaintance .'),
 ('be',
  'in',
  'Their estate was large , and their residence was at Norland Park , in the centre of their property , where , for many generations , they had lived in so respectable a manner as to engage the general good opinion of their surrounding acquaintance .'),
 ('live',
  'for',
  'Their estate was large , and their residence was at Norland Park , in the centre of their property , where , for many generations , they had lived in so respectable a manner as to engage the general good opinion of their surrounding acquaintance .'),
 ('live',
  'in',
  'Their estate was large , and their residence was at Norland Park , in the centre of the

In [7]:
[(token.lemma_, child2.lemma_, doc.text)
 for doc in docs[:20]
 for token in doc
 for child in token.children
 for child2 in child.children
 if token.pos_ == 'VERB' and child.dep_ == 'dobj' and child2.dep_ == 'prep'
][:5]

[('engage',
  'of',
  'Their estate was large , and their residence was at Norland Park , in the centre of their property , where , for many generations , they had lived in so respectable a manner as to engage the general good opinion of their surrounding acquaintance .'),
 ('have',
  'in',
  'The late owner of this estate was a single man , who lived to a very advanced age , and who for many years of his life , had a constant companion and housekeeper in his sister .'),
 ('receive',
  'of',
  'But her death , which happened ten years before his own , produced a great alteration in his home ; for to supply her loss , he invited and received into his house the family of his nephew Mr . Henry Dashwood , the legal inheritor of the Norland estate , and the person to whom he intended to bequeath it .'),
 ('give',
  'of',
  'The constant attention of Mr . and Mrs . Henry Dashwood to his wishes , which proceeded not merely from interest , but from goodness of heart , gave him every degree of 