# Pos, Stem, and Ngrams: Oh My!
<img src="images/churchill2.jpg" height="200" width="200" align="left">
<center>
<h3>In This Worksheet</h3> We will parse a new text using the methods described up to this point and introduce some new ways to characterize words and sentences in texts.
<h3>The Data</h3> <strong>Speech, Blood Toil Tears and Sweat</strong><br><i>Sir Winston Churchill, May 13th 1940</i><br>
This is Churchill's first speech as the prime minister of Great Britain.<br>
https://www.youtube.com/watch?v=8TlkN-dcDCk
</center>

First, let's start by doing some imports and then loading up the Churchill data set we created in the last portion, using pandas's read_csv method.

In [6]:
import nltk
import pandas as pd
from nltk.corpus import stopwords
import string

In [7]:
fp = 'data/parsed_churchill_blood.csv'
speech = pd.read_csv(fp, index_col=0)
speech.head()

Unnamed: 0,sent_id,token,is_stop,is_punct
0,0,on,True,False
1,0,friday,False,False
2,0,evening,False,False
3,0,last,False,False
4,0,i,True,False


## Ngrams
We will cover ngrams first because they are the most easy to visualize with our existing data.  Ngrams represent words that occur sequentially together.  

In [11]:
ex_tokens = ['there', 'is', 'a', 'dog', 'in', 'my', 'purse']
for ngram in nltk.ngrams(ex_tokens, 2):
    print(ngram)

('there', 'is')
('is', 'a')
('a', 'dog')
('dog', 'in')
('in', 'my')
('my', 'purse')


We have seen ngram-related concepts at work previously when we looked at context and collocations.

In [12]:
#make a text out of list of tokens from DataFrame
text = nltk.Text(speech.token.tolist())

Remember when we generated the context of a word, we got back a FreqDist of all of the word pairs it was surrounded by.

In [13]:
contexts = nltk.ContextIndex(text.tokens)
contexts._word_to_contexts['germany']

FreqDist({('with', 'to'): 1})

We can do this manually as well using ngrams.  Let's look for the word 'japanese' and count up all of its contexts as they occur in trigrams in which it is included.

In [14]:
from collections import Counter

context_dict = Counter()
for gram3 in nltk.ngrams(text.tokens, 3):
    if gram3[1] == 'germany':
        context_dict.update([(gram3[0],gram3[2])])
context_dict

Counter({('with', 'to'): 1})

Collocations looked for pairs of words that were commonly seen occurring in the same window.  NLTK's BigramCollocationFinder actually looks at all ngrams in the text, so we will only consider the 4grams as an example.

In [15]:
text.collocations()

british empire; wage war


In [17]:
manual_coll_list = [('british', 'empire'), ('wage', 'war')]
coll_dict = {}
for gram4 in nltk.ngrams(text.tokens, 4):
    for coll in manual_coll_list:
        term1,term2 = coll
        if term1 in gram4 and term2 in gram4:
            coll_dict[coll] = coll_dict.get(coll,[]) + [gram4]
            
print(coll_dict[('british','empire')])

[('for', 'the', 'british', 'empire'), ('the', 'british', 'empire', ','), ('british', 'empire', ',', 'no'), ('that', 'the', 'british', 'empire'), ('the', 'british', 'empire', 'has'), ('british', 'empire', 'has', 'stood')]


## Stemming
Stemming is a crude way of shortening a word so that various lemmas of a word do not prevent us from identifying words as similar.

For example, two words appear in our speech, 'attacked' and 'attack'.  In most cases, we want these to be understood as the same token, 'attack'.  NLTK's PorterStemmer can help us do this.

In [18]:
stemmer = nltk.PorterStemmer()
print(stemmer.stem('attacked'))
print(stemmer.stem('attack'))

attack
attack


We will see that in certain cases, though, the stemmer will fail to do what we want.

In [19]:
print(stemmer.stem('is'))
print(stemmer.stem('are'))

is
are


## Lemmatizing
NLTK's default lemmatizer is the WordNet lemmatizer.  This looks at WordNet's morphy feature in order to generate a lexeme for the word, given the word's part of speech (default is noun).  The availabel part of speeches are:

|POS|Representation|
|---|---|
|ADJ|'a'|
|ADJ_SAT|'s'|
|ADV|'r'|
|NOUN|'n'|
|VERB|'v'|

In [20]:
lemmater = nltk.WordNetLemmatizer()
print(lemmater.lemmatize('is', pos='v'))
print(lemmater.lemmatize('are', pos='v'))

be
be


A few more examples:

In [21]:
print(lemmater.lemmatize('halves'))
print(lemmater.lemmatize('foci'))
print(lemmater.lemmatize('polarizing', pos='v'))
print(lemmater.lemmatize('wandering', pos='v'))

half
focus
polarize
wander


The lemmatizer works pretty poorly on certain words, though, and if WordNet does not find a word, it will return the word unchanged, which is less useful than our stemmer.

In [22]:
print(lemmater.lemmatize('merrily', pos='r'))
print(lemmater.lemmatize('additional', 'a'))

merrily
additional


One way around this is to use the WordNet synset, which is the associated set of 'cognitive synonyms' for the word. 

Here we use 'merrily.r.1' to specify grabbing they synset for 'merrily' of type 'r' at index '0'.

In [23]:
from nltk.corpus import wordnet as wn

synset = wn.synset('merrily.r.0')
print(synset)

Synset('happily.r.01')


We then grab all of its lemmas associated with the synset.

In [24]:
lemmas = synset.lemmas()
print(lemmas)

[Lemma('happily.r.01.happily'), Lemma('happily.r.01.merrily'), Lemma('happily.r.01.mirthfully'), Lemma('happily.r.01.gayly'), Lemma('happily.r.01.blithely'), Lemma('happily.r.01.jubilantly')]


Once one gets to a lemma, you gain access to a whole new set of attributes, to include antonyms, homonyms, pertainyms, etc.  The point here is that the WordNet module is extremely powerful, but it is also very complex and standardizing a way to interact with it, all to acquire the best lexeme for your word, may not be the simplest concept.  This is why you may just want to use a stemmer, which is what we will choose to do in this case.

In [79]:
speech['stem'] = speech.token.apply(stemmer.stem)

## POS Tagging
We had to identify the POS in order to properly lemmatize certain words using the WordNet Lemmatizer.  But how do we get parts of speech in an automated fashion?  One option is to write your own model.  But NLTK also offers a pre-trained built in POS tagger.  We have already discussed some of the features that these models take into account, so let's see how the POS tagger works.

In [80]:
from nltk import pos_tag

sample_tokens = speech[speech.sent_id==20].token.tolist()
print(pos_tag(sample_tokens))

[('i', 'NNS'), ('say', 'VBP'), ('to', 'TO'), ('the', 'DT'), ('house', 'NN'), ('as', 'IN'), ('i', 'NN'), ('said', 'VBD'), ('to', 'TO'), ('ministers', 'NNS'), ('who', 'WP'), ('have', 'VBP'), ('joined', 'VBN'), ('this', 'DT'), ('government', 'NN'), (',', ','), ('i', 'NN'), ('have', 'VBP'), ('nothing', 'NN'), ('to', 'TO'), ('offer', 'VB'), ('but', 'CC'), ('blood', 'NN'), (',', ','), ('toil', 'NN'), (',', ','), ('tears', 'NNS'), (',', ','), ('and', 'CC'), ('sweat', 'NN'), ('.', '.')]


In [81]:
for token,tag in pos_tag(sample_tokens):
    print('|'+token+'|'+tag+'||')

|i|NNS||
|say|VBP||
|to|TO||
|the|DT||
|house|NN||
|as|IN||
|i|NN||
|said|VBD||
|to|TO||
|ministers|NNS||
|who|WP||
|have|VBP||
|joined|VBN||
|this|DT||
|government|NN||
|,|,||
|i|NN||
|have|VBP||
|nothing|NN||
|to|TO||
|offer|VB||
|but|CC||
|blood|NN||
|,|,||
|toil|NN||
|,|,||
|tears|NNS||
|,|,||
|and|CC||
|sweat|NN||
|.|.||


### What do all of these tags mean?
Varies on tagset, but in general:

|Tag|Description|Example|
|---|---|---|
|CC|conjunction, coordinating|and, or, but|
|CD|cardinal number|five, three, 13%|
|DT|determiner|the, a, these |
|EX|existential there|there were six boys |
|FW|foreign word|mais |
|IN|conjunction, subordinating or preposition|of, on, before, unless |
|JJ|adjective|nice, easy|
|JJR|adjective, comparative|nicer, easier|
|JJS|adjective, superlative|nicest, easiest |
|LS|list item marker| |
|MD|verb, modal auxillary|may, should |
|NN|noun, singular or mass|tiger, chair, laughter |
|NNS|noun, plural|tigers, chairs, insects |
|NNP|noun, proper singular|Germany, God, Alice |
|NNPS|noun, proper plural|we met two Christmases ago |
|PDT|predeterminer|both his children |
|POS|possessive ending|'s|
|PRP|pronoun, personal|me, you, it |
|PRP\$|pronoun, possessive|my, your, our |
|RB|adverb|extremely, loudly, hard  |
|RBR|adverb, comparative|better |
|RBS|adverb, superlative|best |
|RP|adverb, particle|about, off, up |
|SYM|symbol|None|
|TO|infinitival to|what to do? |
|UH|interjection|oh, oops, gosh |
|VB|verb, base form|think |
|VBZ|verb, 3rd person singular present|she thinks |
|VBP|verb, non-3rd person singular present|I think |
|VBD|verb, past tense|they thought |
|VBN|verb, past participle|a sunken ship |
|VBG|verb, gerund or present participle|thinking is fun |
|WDT|wh-determiner|which, whatever, whichever |
|WP|wh-pronoun, personal|what, who, whom |
|WP\$|wh-pronoun, possessive|whose, whosever |
|WRB|wh-adverb|where, when |
|.|punctuation mark, sentence closer|.;?* |
|,|punctuation mark, comma|, |
|:|punctuation mark, colon|: |
|(|contextual separator, left paren|( |
|)|contextual separator, right paren|) |

So, we can actually see that our POS's are..

|Token|POS|Interpretation|
|--|--|--|
|i|NNS|noun, plural|
|say|VBP|verb, non-3rd person singular present|
|to|TO|	infinitival to|
|the|DT|determiner|
|house|NN|noun, singular or mass|
|as|IN|conjunction, subordinating or preposition|
|i|NN|noun, singular or mass|
|said|VBD|verb, past tense|
|to|TO|	infinitival to|
|ministers|NNS|noun, plural|
|who|WP|wh-pronoun, personal|
|have|VBP|verb, non-3rd person singular present|
|joined|VBN|verb, past participle|
|this|DT|determiner|
|government|NN|noun, singular or mass|

Notice that "i" receives two different parts of speech at two different places.  This indicates that the context of the word is being considered when parts of speech are being determined (syntactic).

We can get the part of speech for a single token by writing a little function!

In [82]:
def get_pos(token):
    return pos_tag([token])[0][1]

get_pos('i')

'NN'

But, we are much better off doing multiple tokens at once so that more context is given.

In [83]:
tokens = speech.token.tolist()
pos_tags = pos_tag(tokens)
just_tags = [x[1] for x in pos_tags]

And we can create a new feature column for our tokens called 'pos' that stores the POS for each token.

In [84]:
speech['pos'] = pd.Series( just_tags )
speech.head()

Unnamed: 0,sent_id,token,is_stop,is_punct,pos,generic_pos,stem
0,0,on,True,False,IN,Misc,on
1,0,friday,False,False,JJ,Adjective,friday
2,0,evening,False,False,VBG,Verb,even
3,0,last,False,False,JJ,Adjective,last
4,0,i,True,False,JJ,Adjective,i


We can get an idea of the distributions of our parts of speech by using value_counts.

In [85]:
speech.pos.value_counts()

NN      136
IN       92
DT       82
JJ       43
VB       40
.        36
,        28
NNS      28
CC       24
VBP      23
VBN      23
TO       21
PRP      19
MD       17
VBZ      16
RB       14
PRP$     13
CD        6
VBD       6
VBG       5
WP        4
:         3
''        3
JJS       3
POS       2
RBS       2
WDT       2
JJR       2
``        1
RP        1
PDT       1
WRB       1
EX        1
Name: pos, dtype: int64

If I look at just the noun-type parts of speech, I can find out the key focuses of the speech.

In [86]:
noun_pos = ['NN', 'NNS', 'NNP', 'NNPS']
speech[speech.pos.isin(noun_pos)].token.value_counts().head(10)

i                 12
house              6
war                5
victory            4
survival           4
nation             3
today              3
government         3
administration     3
task               3
Name: token, dtype: int64

Since these generic groups of parts of speech are consistent regardless of text, let's make a dictionary to map each specific pos to a generic one based on the set: {'Pronoun', 'Adverb', 'Foreign', 'Determiner', 'Existential', 'Misc', 'Verb', 'Adjective', 'Noun', '.'}

In [94]:
pos_map = {
    'CC':'Misc',
    'CD':'Adjective',
    'DT':'Determiner',
    'EX':'Existential',
    'FW':'Foreign',
    'IN':'Misc',
    'JJ':'Adjective',
    'JJR':'Adjective',
    'JJS':'Adjective',
    'MD':'Verb',
    'NN':'Noun',
    'NNS':'Noun',
    'NNP':'Noun',
    'NNPS':'Noun',
    'PDT':'Determiner',
    'POS':'Misc',
    'PRP':'Pronoun',
    'PRP$':'Pronoun',
    'RB':'Adverb',
    'RBR':'Adverb',
    'RBS':'Adverb',
    'RP':'Adverb',
    'SYM':'Misc',
    'TO':'Misc',
    'UH':'Misc',
    'VB':'Verb',
    'VBZ':'Verb',
    'VBP':'Verb',
    'VBD':'Verb',
    'VBN':'Verb',
    'VBG':'Verb',
    'WDT':'Determiner',
    'WP':'Pronoun',
    'WP$':'Pronoun',
    'WRB':'Adverb',
    '.':'.',
    ',':'.',
    ':':'.',
    '(':'Misc',
    ')':'Misc',
    "''":'Misc',
    "``":'Misc',
    '$':'Misc',
}

def map_generic_pos(tag):
    return pos_map[tag]

print(set(pos_map.values()))

{'Pronoun', 'Adverb', 'Foreign', 'Determiner', 'Existential', 'Misc', 'Verb', 'Adjective', 'Noun', '.'}


In [88]:
speech['generic_pos'] = speech.pos.apply(map_generic_pos)

Now we can repeat what was done previously, without having to manually enter the grouping we want.

In [91]:
speech[speech.generic_pos=='Noun'].token.value_counts().head(20)

i                 12
house              6
war                5
victory            4
survival           4
nation             3
today              3
government         3
administration     3
task               3
ministers          3
part               2
resolution         2
colleagues         2
empire             2
parliament         2
air                2
strength           2
wage               2
appointment        2
Name: token, dtype: int64

In [68]:
speech[speech.generic_pos=='Adjective'].token.value_counts().head(10)

many         4
one          3
other        3
i            2
united       2
new          2
necessary    2
british      2
grievous     1
monstrous    1
Name: token, dtype: int64

Let's see if our answers change when we look at the stems.

In [93]:
speech[speech.generic_pos=='Adjective'].stem.value_counts().head(10)

mani         4
other        3
one          3
british      2
new          2
i            2
necessari    2
unit         2
victori      2
1            1
Name: stem, dtype: int64

And finally, we can get an idea of how often the speech uses each generic pos type.

In [72]:
speech.generic_pos.value_counts()/len(speech)

Noun           0.234957
Misc           0.204871
Verb           0.186246
Determiner     0.121777
.              0.095989
Adjective      0.077364
Pronoun        0.051576
Adverb         0.025788
Existential    0.001433
Name: generic_pos, dtype: float64