In this notebook, you'll explore part of speech tagging using the Penn Treebank tagset (along with the performance of POS tagging in Spacy.)

In [None]:
import spacy, glob, os

In [None]:
nlp = spacy.load('en_core_web_sm', disable=['ner,parser'])
nlp.remove_pipe('ner')
nlp.remove_pipe('parser')

In [None]:
def get_spacy_tags(text):
    doc=nlp(text)
    for word in doc:
        print(word.text, word.tag_)

get_spacy_tags("Open the pod bay doors Hal")

In [None]:
def read_docs(inputDir, maxDocs=100):
    """ Read in movie documents (all ending in .txt) from an input folder
    and process with spacy """
    
    docs=[]
    for idx, filename in enumerate(glob.glob(os.path.join(inputDir, '*.txt'))):
        with open(filename) as file:
            docs.append((filename, nlp(file.read())))
        if idx >= maxDocs:
            break
    return docs

In [None]:
# directory with 2000 movies summaries from Wikipedia
inputDir="../data/movie_summaries/"
docs=read_docs(inputDir, maxDocs=100)

Here are the 45 tags used by the Penn Treebank:

|tag|meaning|
|---|---|
|CC|Coordinating conjunction|
|CD|Cardinal number|
|DT|Determiner|
|EX|Existential there|
|FW|Foreign word|
|IN|Preposition or subordinating conjunction|
|JJ|Adjective|
|JJR|Adjective, comparative|
|JJS|Adjective, superlative|
|LS|List item marker|
|MD|Modal|
|NN|Noun, singular or mass|
|NNS|Noun, plural|
|NNP|Proper noun, singular|
|NNPS|Proper noun, plural|
|PDT|Predeterminer|
|POS|Possessive ending|
|PRP|Personal pronoun|
|PRP\$|Possessive pronoun|
|RB|Adverb|
|RBR|Adverb, comparative|
|RBS|Adverb, superlative|
|RP|Particle|
|SYM|Symbol|
|TO|to|
|UH|Interjection|
|VB|Verb, base form|
|VBD|Verb, past tense|
|VBG|Verb, gerund or present participle|
|VBN|Verb, past participle|
|VBP|Verb, non-3rd person singular present|
|VBZ|Verb, 3rd person singular present|
|WDT|Wh-determiner|
|WP|Wh-pronoun|
|WP\$|Possessive wh-pronoun|
|WRB|Wh-adverb|
|.|period|
|,|comma|
|:|colon|
|(|left separator|
|)|right separator|
|$|dollar sign|
|\`\`|open double quotes|
|''|close double quotes|

Explore these tags below by searching for sentences in the (automatically tagged) movie summary corpus that have been tagged for each one.

In [None]:
def find_examples(docs, tag, num_examples=10, window=5):
    count=0
    for _, doc in docs:
        for idx, token in enumerate(doc[window:-window]):
            if token.tag_ == tag:
                print (' '.join(["%s" % context.text for context in doc[idx:idx+window ]]), "\033[91m%s\033[0m" % doc[idx+window].text, ' '.join(["%s" % context.text for context in doc[idx+window+1:idx+window+window+1] ]))
                # for windows users - you may want to use the following print statement
                # to highlight the middle token in each sentence using #s
                # print (' '.join(["%s" % context.text for context in doc[idx:idx+window ]]), "#%s#" % doc[idx+window].text, ' '.join(["%s" % context.text for context in doc[idx+window+1:idx+window+window+1] ]))
                count+=1
                if count >= num_examples:
                    return

In [None]:
find_examples(docs, "CC", num_examples=10, window=5)

What's the difference between the following?

* PRP and PRP$
* NN and NNP
* JJ and JJR
* VBZ and VB

Q2: Use the `find_examples` function to help understand the usage of each part-of-speech tag; after doing so, manually tag the following four sentences (if you're doing this in class, you can work with a partner!)

1. "Open the pod bay doors, Hal"

2. "Frankly, my dear, I don't give a damn"

3. "May the Force be with you"

4. One morning I shot an elephant in my pajamas. How he got in my pajamas, I don't know

Q3. After tagging the sentences above by hand, run them through the spacy tagger; what's spacy's accuracy on these sentences?