# A Brief Introduction to POS Tagging

Identifying the part of speech associated with a particular word is complex, even for humans. Let's talk through how a computer would go about doing this. We have to get really basic. 

In Python there are a few base level units of data. There are more, but to begin let's just look at two:

* Integers
* Strings

In [73]:
# an integer is a whole number that you do number like things to.
4 + 4

8

In [74]:
our_int = 4
our_int + 15

19

In contrast, a string is something we can do word-like things to:

In [75]:
"four".upper()

'FOUR'

In [76]:
our_string = "four"

our_string + our_string
# what happened here?

'fourfour'

Numbers can do numerical things and strings (word-type bits) can do word-type things. You can, of course, go way deeper in Python with data types, but a one more example of things you can do to strings (word-type data):

In [77]:
# we can print each letter out - a string is made up of its constituent pieces.
for letter in our_string:
    print(letter)

f
o
u
r


In [78]:
# is 4 equal to four?
our_int == our_string

False

Python is very literal the number four is not equal to the word four. Similarly, we can see that a word is not equal to its individual letters (we could combine those letters and get a different result.

In [79]:
['f','o','u','r'] == 'four'

False

Given how difficult it is for Python to parse these basic elements - how can it do more complicated things? like recognize the part of speech for a word? for a poem? We don't have to work from scratch - people build on the work of others. We can test out a basic part of speech tagger, but in order to do so we have to feed a series of words (a list) rather than a single word.

In [80]:
import nltk
nltk.pos_tag(["A", "sentence", "made", "of", "words"])

[('A', 'DT'),
 ('sentence', 'NN'),
 ('made', 'VBN'),
 ('of', 'IN'),
 ('words', 'NNS')]

But how did that work? Time for…

## Pause for a Brief interlude on Nicholas Sparks

Given these basic building blocks, let's take a poem and try to work out how we would tag it convert it into a read out of parts of speech. We'll be about as hand-wave-y as you can possibly be here, only gesturing at what the code does on a macro level. Let's run it on "Belief" by Josephine Miles - https://www.poetryfoundation.org/poems/46817/belief.

In [81]:
# Take a poem (exists at belief.txt) and read it in

filename = "belief.txt"

with open(filename, 'r') as filein:
    text = filein.read()

# by default this is going to be the whole text as one long strings (line breaks
# are represented by a \n character)

print(text)

Mother said to call her if the H-bomb exploded
And I said I would, and it about did
When Louis my brother robbed a service station
And lay cursing on the oily cement in handcuffs.

But by that time it was too late to tell Mother,
She was too sick to worry the life out of her
Over why why. Causation is sequence
And everything is one thing after another.

Besides, my other brother, Eddie, had got to be President,
And you can't ask too much of one family.
The chances were as good for a good future
As bad for a bad one.

Therefore it was surprising that, as we kept the newspapers from Mother,
She died feeling responsible for a disaster unverified,
Murmuring, in her sleep as it seemed, the ancient slogan
Noblesse oblige.



In [82]:
# tag the poem!

import nltk
nltk.pos_tag(text)

[('M', 'NNP'),
 ('o', 'MD'),
 ('t', 'VB'),
 ('h', 'NN'),
 ('e', 'NN'),
 ('r', 'NN'),
 (' ', 'NNP'),
 ('s', 'VBZ'),
 ('a', 'DT'),
 ('i', 'JJ'),
 ('d', 'NN'),
 (' ', 'NNP'),
 ('t', 'NN'),
 ('o', 'NN'),
 (' ', 'NNP'),
 ('c', 'VBZ'),
 ('a', 'DT'),
 ('l', 'NN'),
 ('l', 'NN'),
 (' ', 'NNP'),
 ('h', 'NN'),
 ('e', 'NN'),
 ('r', 'NN'),
 (' ', 'NN'),
 ('i', 'NN'),
 ('f', 'VBP'),
 (' ', 'JJ'),
 ('t', 'NN'),
 ('h', 'NN'),
 ('e', 'NN'),
 (' ', 'NNP'),
 ('H', 'NNP'),
 ('-', ':'),
 ('b', 'NN'),
 ('o', 'NN'),
 ('m', 'NN'),
 ('b', 'NN'),
 (' ', 'NNP'),
 ('e', 'NN'),
 ('x', 'NNP'),
 ('p', 'NN'),
 ('l', 'NN'),
 ('o', 'NN'),
 ('d', 'NN'),
 ('e', 'NN'),
 ('d', 'NN'),
 ('\n', 'VBZ'),
 ('A', 'DT'),
 ('n', 'JJ'),
 ('d', 'NN'),
 (' ', 'NN'),
 ('I', 'PRP'),
 (' ', 'VBP'),
 ('s', 'PDT'),
 ('a', 'DT'),
 ('i', 'JJ'),
 ('d', 'NN'),
 (' ', 'NN'),
 ('I', 'PRP'),
 (' ', 'VBP'),
 ('w', 'JJ'),
 ('o', 'NN'),
 ('u', 'JJ'),
 ('l', 'NN'),
 ('d', 'NN'),
 (',', ','),
 (' ', 'VB'),
 ('a', 'DT'),
 ('n', 'JJ'),
 ('d', 'NN'),
 ('

Oops that didn't work. Remember that the POS tagger we're using requires a list of words, and it read our file in as one long string. By default a string is divided into characters - it doesn't know what a "word" is. So we have to break that poem into words.

In [93]:
# break a text into a series of words

words = nltk.word_tokenize(text)
    
words

['Mother',
 'said',
 'to',
 'call',
 'her',
 'if',
 'the',
 'H-bomb',
 'exploded',
 'And',
 'I',
 'said',
 'I',
 'would',
 ',',
 'and',
 'it',
 'about',
 'did',
 'When',
 'Louis',
 'my',
 'brother',
 'robbed',
 'a',
 'service',
 'station',
 'And',
 'lay',
 'cursing',
 'on',
 'the',
 'oily',
 'cement',
 'in',
 'handcuffs',
 '.',
 'But',
 'by',
 'that',
 'time',
 'it',
 'was',
 'too',
 'late',
 'to',
 'tell',
 'Mother',
 ',',
 'She',
 'was',
 'too',
 'sick',
 'to',
 'worry',
 'the',
 'life',
 'out',
 'of',
 'her',
 'Over',
 'why',
 'why',
 '.',
 'Causation',
 'is',
 'sequence',
 'And',
 'everything',
 'is',
 'one',
 'thing',
 'after',
 'another',
 '.',
 'Besides',
 ',',
 'my',
 'other',
 'brother',
 ',',
 'Eddie',
 ',',
 'had',
 'got',
 'to',
 'be',
 'President',
 ',',
 'And',
 'you',
 'ca',
 "n't",
 'ask',
 'too',
 'much',
 'of',
 'one',
 'family',
 '.',
 'The',
 'chances',
 'were',
 'as',
 'good',
 'for',
 'a',
 'good',
 'future',
 'As',
 'bad',
 'for',
 'a',
 'bad',
 'one',
 '.',
 'Th

In [92]:
tag_pairs = nltk.pos_tag(words)

just_tags = []
for tag_pair in tag_pairs:
    just_tags.append(tag_pair[1])
    
' '.join(just_tags)

'NNP VBD TO VB PRP IN DT NNP VBD CC PRP VBD PRP MD , CC PRP RB VBD WRB NNP PRP$ NN VBD DT NN NN CC VBD VBG IN DT JJ NN IN NNS . CC IN DT NN PRP VBD RB JJ TO VB NNP , PRP VBD RB JJ TO VB DT NN IN IN PRP$ NNP WRB WRB . NN VBZ JJ CC NN VBZ CD NN IN DT . IN , PRP$ JJ NN , NNP , VBD VBN TO VB NNP , CC PRP MD RB VB RB JJ IN CD NN . DT NNS VBD RB JJ IN DT JJ NN IN JJ IN DT JJ CD . VB PRP VBD VBG RB , IN PRP VBD DT NNS IN NNP , PRP VBD VBG JJ IN DT NN JJ , NNP , IN PRP$ NN IN PRP VBD , DT NN JJ NNP NN .'

But that is a little unhelpful, because it takes the tags and combines them one long line of text. This is poetry, and we want to respect the lines.

In [84]:
with open(filename, 'r') as filein:
    lines = filein.readlines()

tokenized_lines = []
for line in lines:
    words = nltk.word_tokenize(line)
    tokenized_lines.append(words)

tokenized_lines

[['Mother', 'said', 'to', 'call', 'her', 'if', 'the', 'H-bomb', 'exploded'],
 ['And', 'I', 'said', 'I', 'would', ',', 'and', 'it', 'about', 'did'],
 ['When', 'Louis', 'my', 'brother', 'robbed', 'a', 'service', 'station'],
 ['And',
  'lay',
  'cursing',
  'on',
  'the',
  'oily',
  'cement',
  'in',
  'handcuffs',
  '.'],
 [],
 ['But',
  'by',
  'that',
  'time',
  'it',
  'was',
  'too',
  'late',
  'to',
  'tell',
  'Mother',
  ','],
 ['She',
  'was',
  'too',
  'sick',
  'to',
  'worry',
  'the',
  'life',
  'out',
  'of',
  'her'],
 ['Over', 'why', 'why', '.', 'Causation', 'is', 'sequence'],
 ['And', 'everything', 'is', 'one', 'thing', 'after', 'another', '.'],
 [],
 ['Besides',
  ',',
  'my',
  'other',
  'brother',
  ',',
  'Eddie',
  ',',
  'had',
  'got',
  'to',
  'be',
  'President',
  ','],
 ['And', 'you', 'ca', "n't", 'ask', 'too', 'much', 'of', 'one', 'family', '.'],
 ['The', 'chances', 'were', 'as', 'good', 'for', 'a', 'good', 'future'],
 ['As', 'bad', 'for', 'a', 'bad', '

In [85]:
# now that we have the lines as a series of words (or tokens) - let's go through and tag them

tagged_lines = []
for line in tokenized_lines:
    tagged_lines.append(nltk.pos_tag(line))

just_tags_for_lines = []

for line in tagged_lines:
    this_line = []
    for tag_pair in line:
        this_line.append(tag_pair[1])
    just_tags_for_lines.append(this_line)

just_tags_for_lines

[['NNP', 'VBD', 'TO', 'VB', 'PRP', 'IN', 'DT', 'NNP', 'VBD'],
 ['CC', 'PRP', 'VBD', 'PRP', 'MD', ',', 'CC', 'PRP', 'RB', 'VBD'],
 ['WRB', 'NNP', 'PRP$', 'NN', 'VBD', 'DT', 'NN', 'NN'],
 ['CC', 'VB', 'VBG', 'IN', 'DT', 'JJ', 'NN', 'IN', 'NNS', '.'],
 [],
 ['CC', 'IN', 'DT', 'NN', 'PRP', 'VBD', 'RB', 'JJ', 'TO', 'VB', 'NNP', ','],
 ['PRP', 'VBD', 'RB', 'JJ', 'TO', 'VB', 'DT', 'NN', 'IN', 'IN', 'PRP$'],
 ['IN', 'WRB', 'WRB', '.', 'NN', 'VBZ', 'NN'],
 ['CC', 'NN', 'VBZ', 'CD', 'NN', 'IN', 'DT', '.'],
 [],
 ['IN',
  ',',
  'PRP$',
  'JJ',
  'NN',
  ',',
  'NNP',
  ',',
  'VBD',
  'VBN',
  'TO',
  'VB',
  'NNP',
  ','],
 ['CC', 'PRP', 'MD', 'RB', 'VB', 'RB', 'JJ', 'IN', 'CD', 'NN', '.'],
 ['DT', 'NNS', 'VBD', 'RB', 'JJ', 'IN', 'DT', 'JJ', 'NN'],
 ['IN', 'JJ', 'IN', 'DT', 'JJ', 'CD', '.'],
 [],
 ['IN',
  'PRP',
  'VBD',
  'VBG',
  'RB',
  ',',
  'IN',
  'PRP',
  'VBD',
  'DT',
  'NNS',
  'IN',
  'NNP',
  ','],
 ['PRP', 'VBD', 'VBG', 'JJ', 'IN', 'DT', 'NN', 'JJ', ','],
 ['VBG', ',', 'IN', 'PRP

In [86]:
# but that is kind of gross to read, so let's put things back together as a poem,
# without the brackets, commas, etc. that python requires to run

transformed_poem = []
for line in just_tags_for_lines:
    transformed_poem.append(' '.join(line))
    
    
for line in transformed_poem:
    print(line)

NNP VBD TO VB PRP IN DT NNP VBD
CC PRP VBD PRP MD , CC PRP RB VBD
WRB NNP PRP$ NN VBD DT NN NN
CC VB VBG IN DT JJ NN IN NNS .

CC IN DT NN PRP VBD RB JJ TO VB NNP ,
PRP VBD RB JJ TO VB DT NN IN IN PRP$
IN WRB WRB . NN VBZ NN
CC NN VBZ CD NN IN DT .

IN , PRP$ JJ NN , NNP , VBD VBN TO VB NNP ,
CC PRP MD RB VB RB JJ IN CD NN .
DT NNS VBD RB JJ IN DT JJ NN
IN JJ IN DT JJ CD .

IN PRP VBD VBG RB , IN PRP VBD DT NNS IN NNP ,
PRP VBD VBG JJ IN DT NN JJ ,
VBG , IN PRP$ NN IN PRP VBD , DT NN NN
NNP NN .


In [87]:
#  let's make that a function for ease of use:
def nltk_pos_transform(filename):
    """Given a filename, take a poem and transform it into its POS tags"""
    with open(filename, 'r') as filein:
        lines = filein.readlines()

    tokenized_lines = []
    for line in lines:
        words = nltk.word_tokenize(line)
        tokenized_lines.append(words)

    tagged_lines = []
    for line in tokenized_lines:
        tagged_lines.append(nltk.pos_tag(line))

    just_tags_for_lines = []

    for line in tagged_lines:
        this_line = []
        for tag_pair in line:
            this_line.append(tag_pair[1])
        just_tags_for_lines.append(this_line)

    # reconstituting them now
    transformed_poem = []
    for line in just_tags_for_lines:
        transformed_poem.append(' '.join(line))


    for line in transformed_poem:
        print(line)

nltk_pos_transform('belief.txt')


NNP VBD TO VB PRP IN DT NNP VBD
CC PRP VBD PRP MD , CC PRP RB VBD
WRB NNP PRP$ NN VBD DT NN NN
CC VB VBG IN DT JJ NN IN NNS .

CC IN DT NN PRP VBD RB JJ TO VB NNP ,
PRP VBD RB JJ TO VB DT NN IN IN PRP$
IN WRB WRB . NN VBZ NN
CC NN VBZ CD NN IN DT .

IN , PRP$ JJ NN , NNP , VBD VBN TO VB NNP ,
CC PRP MD RB VB RB JJ IN CD NN .
DT NNS VBD RB JJ IN DT JJ NN
IN JJ IN DT JJ CD .

IN PRP VBD VBG RB , IN PRP VBD DT NNS IN NNP ,
PRP VBD VBG JJ IN DT NN JJ ,
VBG , IN PRP$ NN IN PRP VBD , DT NN NN
NNP NN .


In [88]:
import spacy
import en_core_web_sm

# let's do the same thing with spacy

def spacy_pos_transform(filename):
    """Given a filename, take a poem and transform it into its POS tags"""
    nlp = en_core_web_sm.load()
    with open(filename, 'r') as filein:
        lines = filein.readlines()

    spacy_lines = []
    for line in lines:
        this_line = []
        doc = nlp(line)
        for token in doc:
            this_line.append(token.tag_) 
        spacy_lines.append(this_line)
    # reconstituting them now
    transformed_poem = []
    for line in spacy_lines:
        transformed_poem.append(' '.join(line))


    for line in transformed_poem:
        print(line)

spacy_pos_transform('belief.txt')


NN VBD TO VB PRP IN DT NN HYPH NN VBD _SP
CC PRP VBD PRP MD , CC PRP IN VBD _SP
WRB NNP PRP$ NN VBD DT NN NN _SP
CC VB VBG IN DT JJ NN IN NNS . _SP
_SP
CC IN DT NN PRP VBD RB JJ TO VB NNP , _SP
PRP VBD RB JJ TO VB DT NN IN IN PRP _SP
IN WRB WRB . NN VBZ NN _SP
CC NN VBZ CD NN IN DT . _SP
_SP
RB , PRP$ JJ NN , NNP , VBD VBN TO VB NNP , _SP
CC PRP MD RB VB RB JJ IN CD NN . _SP
DT NNS VBD RB JJ IN DT JJ NN _SP
RB JJ IN DT JJ NN . _SP
_SP
RB PRP VBD JJ IN , IN PRP VBD DT NNS IN NNP , _SP
PRP VBD VBG JJ IN DT NN JJ , _SP
VBG , IN PRP$ NN IN PRP VBD , DT JJ NN _SP
NNP NN . _SP


Let's compare the two against each other:

In [89]:
print('NLTK transform results:')
nltk_pos_transform('belief.txt')
print('=========')
print('Spacy transform results:')
spacy_pos_transform('belief.txt')

NLTK transform results:
NNP VBD TO VB PRP IN DT NNP VBD
CC PRP VBD PRP MD , CC PRP RB VBD
WRB NNP PRP$ NN VBD DT NN NN
CC VB VBG IN DT JJ NN IN NNS .

CC IN DT NN PRP VBD RB JJ TO VB NNP ,
PRP VBD RB JJ TO VB DT NN IN IN PRP$
IN WRB WRB . NN VBZ NN
CC NN VBZ CD NN IN DT .

IN , PRP$ JJ NN , NNP , VBD VBN TO VB NNP ,
CC PRP MD RB VB RB JJ IN CD NN .
DT NNS VBD RB JJ IN DT JJ NN
IN JJ IN DT JJ CD .

IN PRP VBD VBG RB , IN PRP VBD DT NNS IN NNP ,
PRP VBD VBG JJ IN DT NN JJ ,
VBG , IN PRP$ NN IN PRP VBD , DT NN NN
NNP NN .
Spacy transform results:
NN VBD TO VB PRP IN DT NN HYPH NN VBD _SP
CC PRP VBD PRP MD , CC PRP IN VBD _SP
WRB NNP PRP$ NN VBD DT NN NN _SP
CC VB VBG IN DT JJ NN IN NNS . _SP
_SP
CC IN DT NN PRP VBD RB JJ TO VB NNP , _SP
PRP VBD RB JJ TO VB DT NN IN IN PRP _SP
IN WRB WRB . NN VBZ NN _SP
CC NN VBZ CD NN IN DT . _SP
_SP
RB , PRP$ JJ NN , NNP , VBD VBN TO VB NNP , _SP
CC PRP MD RB VB RB JJ IN CD NN . _SP
DT NNS VBD RB JJ IN DT JJ NN _SP
RB JJ IN DT JJ NN . _SP
_SP
RB PRP VBD 

Can be difficult to compare. Let's make a function that compares the two outputs and gives a 1 if they are the same or a 0 if they are different. And since you might want to upload your text, let's change things slightly. Rather than use a poem, the following code block will just take a long pasted string. So you could paste your own poem from on the web if you'd like!  



In [90]:
our_text = """
Mother said to call her if the H-bomb exploded
And I said I would, and it about did
When Louis my brother robbed a service station
And lay cursing on the oily cement in handcuffs.

But by that time it was too late to tell Mother,
She was too sick to worry the life out of her
Over why why. Causation is sequence
And everything is one thing after another.

Besides, my other brother, Eddie, had got to be President,
And you can't ask too much of one family.
The chances were as good for a good future
As bad for a bad one.

Therefore it was surprising that, as we kept the newspapers from Mother,
She died feeling responsible for a disaster unverified,
Murmuring, in her sleep as it seemed, the ancient slogan
Noblesse oblige.
"""

def spacy_pos_transform(text):
    """Given a string pasted in, take a poem and transform it into its POS tags"""
    nlp = en_core_web_sm.load()
    lines = text.split('\n')
    spacy_lines = []
    for line in lines:
        this_line = []
        doc = nlp(line)
        for token in doc:
            this_line.append(token.tag_) 
        spacy_lines.append(this_line)
    # reconstituting them now
    transformed_poem = []
    for line in spacy_lines:
        transformed_poem.append(' '.join(line))

    return transformed_poem

def nltk_pos_transform(text):
    """Given a string pasted in, take a poem and transform it into its POS tags"""
    lines = text.split('\n')
    tokenized_lines = []
    for line in lines:
        words = nltk.word_tokenize(line)
        tokenized_lines.append(words)
    tagged_lines = []
    for line in tokenized_lines:
        tagged_lines.append(nltk.pos_tag(line))
    just_tags_for_lines = []

    for line in tagged_lines:
        this_line = []
        for tag_pair in line:
            this_line.append(tag_pair[1])
        just_tags_for_lines.append(this_line)
    # reconstituting them now
    transformed_poem = []
    for line in just_tags_for_lines:
        transformed_poem.append(' '.join(line))
    return transformed_poem

def binary_poem(spacy_text, nltk_text):
    binary_poem = []
    line_counter = 0
    for line in spacy_text:
        this_line = []
        spacy_line = nltk.word_tokenize(line)
        nltk_line = nltk.word_tokenize(nltk_text[line_counter])
        for num, word in enumerate(spacy_line[:-1], start=0):
            try:
                if word == nltk_line[num]:
                    this_line.append(1)
                else:
                    this_line.append(0)
            except:
                pass
        binary_poem.append(this_line)
        line_counter += 1
    return binary_poem

spacy_text = spacy_pos_transform(our_text)
nltk_text = nltk_pos_transform(our_text)
binary_poem = binary_poem(spacy_text, nltk_text)

print('NLTK transform results:')
for line in nltk_text:
    print(line)
print('=========')
print('Spacy transform results:')
for line in spacy_text:
    print(line)
print('Comparison - zero is where they do not tag the same')
for line in binary_poem:
    line = [str(item) for item in line]
    print(' '.join(line))

NLTK transform results:

NNP VBD TO VB PRP IN DT NNP VBD
CC PRP VBD PRP MD , CC PRP RB VBD
WRB NNP PRP$ NN VBD DT NN NN
CC VB VBG IN DT JJ NN IN NNS .

CC IN DT NN PRP VBD RB JJ TO VB NNP ,
PRP VBD RB JJ TO VB DT NN IN IN PRP$
IN WRB WRB . NN VBZ NN
CC NN VBZ CD NN IN DT .

IN , PRP$ JJ NN , NNP , VBD VBN TO VB NNP ,
CC PRP MD RB VB RB JJ IN CD NN .
DT NNS VBD RB JJ IN DT JJ NN
IN JJ IN DT JJ CD .

IN PRP VBD VBG RB , IN PRP VBD DT NNS IN NNP ,
PRP VBD VBG JJ IN DT NN JJ ,
VBG , IN PRP$ NN IN PRP VBD , DT NN NN
NNP NN .

Spacy transform results:

NN VBD TO VB PRP IN DT NN HYPH NN VBD
CC PRP VBD PRP MD , CC PRP IN VBD
WRB NNP PRP$ NN VBD DT NN NN
CC VB VBG IN DT JJ NN IN NNS .

CC IN DT NN PRP VBD RB JJ TO VB NNP ,
PRP VBD RB JJ TO VB DT NN IN IN PRP
IN WRB WRB . NN VBZ NN
CC NN VBZ CD NN IN DT .

RB , PRP$ JJ NN , NNP , VBD VBN TO VB NNP ,
CC PRP MD RB VB RB JJ IN CD NN .
DT NNS VBD RB JJ IN DT JJ NN
RB JJ IN DT JJ NN .

RB PRP VBD JJ IN , IN PRP VBD DT NNS IN NNP ,
PRP VBD VBG JJ IN D

If you're interested in digging deeper into the systems each of these tagging systems uses for part of speech:

* NLTK is trained on a wall street journal corpus: https://stackoverflow.com/questions/32016545/how-does-nltk-pos-tag-work/41384824#:~:text=This%20basically%20means%20that%20it,not%20the%20guess%20was%20correct. It actually uses weighted averages.
* More information on POS tagging systems - https://universaldependencies.org/docs/u/pos/

* Spacy uses - OntoNotes Release 5.0 is the final release of the OntoNotes project, a collaborative effort between BBN Technologies, the University of Colorado, the University of Pennsylvania and the University of Southern Californias Information Sciences Institute. The goal of the project was to annotate a large corpus comprising various genres of text (news, conversational telephone speech, weblogs, usenet newsgroups, broadcast, talk shows) in three languages (English, Chinese, and Arabic) with structural information (syntax and predicate argument structure) and shallow semantics (word sense linked to an ontology and coreference).



Discussion Questions:

* How could you imagine context playing a role here?
* What are some other literary applications for POS tagging questions?
* For supervised learning problems?
* What other kinds of research questions are available here?

