In this notebook, you'll explore part of speech tagging using the Penn Treebank tagset (along with the performance of POS tagging in Spacy.)

In [1]:
import spacy, glob, os

In [2]:
nlp = spacy.load('en_core_web_sm', disable=['ner,parser'])
nlp.remove_pipe('ner')
nlp.remove_pipe('parser')

('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x1666434c0>)

In [3]:
def get_spacy_tags(text):
    doc=nlp(text)
    for word in doc:
        print(word.text, word.tag_)

get_spacy_tags("Open the pod bay doors Hal")

Open VB
the DT
pod NNP
bay NNP
doors NNS
Hal NNP


In [4]:
def read_docs(inputDir, maxDocs=100):
    """ Read in movie documents (all ending in .txt) from an input folder
    and process with spacy """
    
    docs=[]
    for idx, filename in enumerate(glob.glob(os.path.join(inputDir, '*.txt'))):
        with open(filename) as file:
            docs.append((filename, nlp(file.read())))
        if idx >= maxDocs:
            break
    return docs

In [5]:
# directory with 2000 movies summaries from Wikipedia
inputDir="../data/movie_summaries/"
docs=read_docs(inputDir, maxDocs=100)

Here are the 45 tags used by the Penn Treebank:

|tag|meaning|
|---|---|
|CC|Coordinating conjunction|
|CD|Cardinal number|
|DT|Determiner|
|EX|Existential there|
|FW|Foreign word|
|IN|Preposition or subordinating conjunction|
|JJ|Adjective|
|JJR|Adjective, comparative|
|JJS|Adjective, superlative|
|LS|List item marker|
|MD|Modal|
|NN|Noun, singular or mass|
|NNS|Noun, plural|
|NNP|Proper noun, singular|
|NNPS|Proper noun, plural|
|PDT|Predeterminer|
|POS|Possessive ending|
|PRP|Personal pronoun|
|PRP\$|Possessive pronoun|
|RB|Adverb|
|RBR|Adverb, comparative|
|RBS|Adverb, superlative|
|RP|Particle|
|SYM|Symbol|
|TO|to|
|UH|Interjection|
|VB|Verb, base form|
|VBD|Verb, past tense|
|VBG|Verb, gerund or present participle|
|VBN|Verb, past participle|
|VBP|Verb, non-3rd person singular present|
|VBZ|Verb, 3rd person singular present|
|WDT|Wh-determiner|
|WP|Wh-pronoun|
|WP\$|Possessive wh-pronoun|
|WRB|Wh-adverb|
|.|period|
|,|comma|
|:|colon|
|(|left separator|
|)|right separator|
|$|dollar sign|
|\`\`|open double quotes|
|''|close double quotes|

Explore these tags below by searching for sentences in the (automatically tagged) movie summary corpus that have been tagged for each one.

In [6]:
def find_examples(docs, tag, num_examples=10, window=5):
    count=0
    for _, doc in docs:
        for idx, token in enumerate(doc[window:-window]):
            if token.tag_ == tag:
                print (' '.join(["%s" % context.text for context in doc[idx:idx+window ]]), "\033[91m%s\033[0m" % doc[idx+window].text, ' '.join(["%s" % context.text for context in doc[idx+window+1:idx+window+window+1] ]))
                # for windows users - you may want to use the following print statement
                # to highlight the middle token in each sentence using #s
                # print (' '.join(["%s" % context.text for context in doc[idx:idx+window ]]), "#%s#" % doc[idx+window].text, ' '.join(["%s" % context.text for context in doc[idx+window+1:idx+window+window+1] ]))
                count+=1
                if count >= num_examples:
                    return

In [7]:
find_examples(docs, "CC", num_examples=10, window=5)

intelligent , he is immature [91mand[0m lacks respect for classmates and
and lacks respect for classmates [91mand[0m adults alike . Frightened about
than an hour later , [91mand[0m eventually decide the crime is
unimportant because nothing was taken [91mand[0m the burglar escaped completely unharmed
, treat Furious with disrespect [91mand[0m contempt . The following day
half - brother Ricky , [91mand[0m Chris . Doughboy and Ricky
, and Chris . Doughboy [91mand[0m Ricky live with their mother
. While Ricky is naïve [91mand[0m trusting , Doughboy is aggressive
trusting , Doughboy is aggressive [91mand[0m street - smart . He
Ricky 's stolen football , [91mbut[0m is beaten up . The


What's the difference between the following?

* PRP and PRP$: PRP$ is the tag for possessive pronouns.
* NN and NNP: NNP refers to proper noun. 
* JJ and JJR: JJR is the comparative form of an adjective.
* VBZ and VB: VBZ is the 3nd person singular form of a verb

In [8]:
find_examples(docs, "PRP", num_examples=10, window=5)

Tre is rather intelligent , [91mhe[0m is immature and lacks respect
her child , Reva sends [91mhim[0m to live in the Crenshaw
Furious Styles , from whom [91mshe[0m hopes Tre will learn life
of Tre 's arrival , [91mhe[0m hears his father firing at
and street - smart . [91mHe[0m soon gets into a fight
The ball is returned to [91mhim[0m later by a Crips gang
a fishing trip , where [91mthey[0m talk , and he asks
where they talk , and [91mhe[0m asks him about sexual nature
talk , and he asks [91mhim[0m about sexual nature and discusses
the responsibility of fatherhood to [91mhim[0m . The pair return to


In [9]:
find_examples(docs, "PRP$", num_examples=10, window=5)

Tre Styles   lives with [91mhis[0m single mother Reva Devereaux  
a fight at school , [91mhis[0m teacher calls Reva . The
Frightened about the future of [91mher[0m child , Reva sends him
South Central Los Angeles with [91mhis[0m 27 - year - old
's arrival , he hears [91mhis[0m father firing at a burglar
" Doughboy " Baker , [91mhis[0m maternal half - brother Ricky
Doughboy and Ricky live with [91mtheir[0m mother across the street from
, lives at home with [91mhis[0m mother Brenda , girlfriend Shanice
, girlfriend Shanice , and [91mhis[0m newborn son . After the
walks home with leftovers for [91mhis[0m father . As he walks


In [10]:
find_examples(docs, "NN", num_examples=10, window=5)

In 1984 , ten - [91myear[0m - old Tre Styles  
  lives with his single [91mmother[0m Reva Devereaux   in Inglewood
Tre gets involved in a [91mfight[0m at school , his teacher
involved in a fight at [91mschool[0m , his teacher calls Reva
fight at school , his [91mteacher[0m calls Reva . The teacher
teacher calls Reva . The [91mteacher[0m informs Reva that although Tre
he is immature and lacks [91mrespect[0m for classmates and adults alike
alike . Frightened about the [91mfuture[0m of her child , Reva
about the future of her [91mchild[0m , Reva sends him to
to live in the Crenshaw [91mneighborhood[0m of South Central Los Angeles


In [11]:
find_examples(docs, "NNP", num_examples=10, window=5)

ten - year - old [91mTre[0m Styles   lives with his
lives with his single mother [91mReva[0m Devereaux   in Inglewood ,
with his single mother Reva [91mDevereaux[0m   in Inglewood , California
mother Reva Devereaux   in [91mInglewood[0m , California . After Tre
Devereaux   in Inglewood , [91mCalifornia[0m . After Tre gets involved
Inglewood , California . After [91mTre[0m gets involved in a fight
school , his teacher calls [91mReva[0m . The teacher informs Reva
Reva . The teacher informs [91mReva[0m that although Tre is rather
teacher informs Reva that although [91mTre[0m is rather intelligent , he
future of her child , [91mReva[0m sends him to live in


Q2: Use the `find_examples` function to help understand the usage of each part-of-speech tag; after doing so, manually tag the following four sentences (if you're doing this in class, you can work with a partner!)

1. "Open the pod bay doors, Hal"

Open VB <br>
the DT <br>
pod NN <br>
bay NN <br>
doors NNS <br>
, comma <br>
Hal NNP <br>

2. "Frankly, my dear, I don't give a damn"

Frankly RB <br>
, comma <br>
my PRP$ <br>
dear NN <br>
, comma <br>
I PRP <br>
do VBP <br>
n't RB <br>
give VBP <br>
a DT <br>
damn NN <br>

3. "May the Force be with you"

May MD <br>
the DT <br>
Force NNP <br>
be VBP <br>
with IN <br>
you PRP <br>

4. One morning I shot an elephant in my pajamas. How he got in my pajamas, I don't know

One CD <br>
morning NN <br>
I PRP <br>
shot VBD <br>
an DT <br>
elephant NN <br>
in IN <br>
my PRP$ <br>
pajamas NNS <br>
. period <br>

How WRB <br>
he PRP <br>
got VBD <br>
in IN <br>
my PRP$ <br>
pajamas NNS <br>
, comma <br>
I PRP <br>
do VBP <br>
n't RB <br>
know VBP <br>

Q3. After tagging the sentences above by hand, run them through the spacy tagger; what's spacy's accuracy on these sentences?

The accuracy overall is quite high. However, I'm a bit surprised that the universal POS tags do not have one tag for auxiliary verbs; verbs like `do` are tagged the same way as any other verbs, with one of the `VB` tags. The most approprite one, in my opinion, might be `Modal`. The spaCy tagger cannot identify noun-noun compound, for instance, the noun phrase `pod bay doors` are tagged as `NNP NNP NNS`, both `pod` and `bay` are tagged as singular proper noun instead of noun. One thing my classmate and I debated is verbs in imperative mood, for instance, `open` in the first sentence. We think it should be tagged as `VB` -- verb in base form instead of `VBP` -- verb in present tense. One thing we do agree with the spaCy tagger is `Force` in `May the Force be with you`, more likely to be tagged with `NNP` -- proper noun than noun. 

In [12]:
sent = ["Open the pod bay doors, Hal",
        "Frankly, my dear, I don't give a damn",
        "May the Force be with you",
        "One morning I shot an elephant in my pajamas. How he got in my pajamas, I don't know"]

for s in sent:
    get_spacy_tags(s)
    print("\n")

Open VB
the DT
pod NNP
bay NNP
doors NNS
, ,
Hal NNP


Frankly RB
, ,
my PRP$
dear NN
, ,
I PRP
do VBP
n't RB
give VB
a DT
damn NN


May MD
the DT
Force NNP
be VB
with IN
you PRP


One CD
morning NN
I PRP
shot VBD
an DT
elephant NN
in IN
my PRP$
pajamas NNS
. .
How WRB
he PRP
got VBD
in IN
my PRP$
pajamas NNS
, ,
I PRP
do VBP
n't RB
know VB


