# Everything You Always Wanted to Know About NLP but Were Afraid to Ask

See the accompanying slideshow [here](https://docs.google.com/presentation/d/1rYZEd7-8sZGBzg75OOPvSkIfd1FHq_d4elptiZXzJj8/edit?usp=sharing).

### Contact Info

Steven:          
* Github: srbutler
* email: [srbutler@gmail.com](mailto:srbutler@gmail.com)

Max:         
* Github: maxwell-schwartz 
* Twitter: @DeathAndMaxes
* email: [maxwell.schwartz11@gmail.com](mailto:maxwell.schwartz11@gmail.com)

## Morphology: words and what they're made of

These groups are taken from NLTK's implementation of the Snowball stemmer, which is (obviously) significantly more complex than this, and the test words are taken from NLTK's documentation:

In [2]:
# always be prepared from the get-go
from __future__ import division
from __future__ import print_function

In [3]:
sgroup1 = ("'s'", "'s", "'")
sgroup2 = ("sses", "ied", "ies", "us", "ss", "s")
sgroup3 = ("eedly", "ingly", "edly", "eed", "ing", "ed")
sgroup4 = ('ization', 'ational', 'fulness', 'ousness', 'iveness', 'tional', 'biliti', 'lessli', 'entli', 'ation', 'alism', 'aliti', 'ousli',  'iviti', 'fulli', 'enci', 'anci', 'abli', 'izer', 'ator', 'alli', 'bli', 'ogi', 'li')
sgroup5 = ('ational', 'tional', 'alize', 'icate', 'iciti', 'ative', 'ical', 'ness', 'ize', 'ful')
sgroup6 = ('ement', 'ance', 'ence', 'able', 'ible', 'ment', 'ant', 'ent', 'ism', 'ate', 'iti', 'ous', 'ive', 'ion', 'al', 'er', 'ic', 'ology')

groups = [sgroup1, sgroup2, sgroup3, sgroup4, sgroup5, sgroup6]
groups.reverse()

In [4]:
def stem_word(word):
    
    word_out = word
    
    for group in groups:
        
        for suffix in group:
            
            if word.endswith(suffix) or (word+"e").endswith(suffix):
                
                offset = -len(suffix)
                return word[:offset]
            
    return word

In [18]:
words = ['caresses', 'flies', 'dies', 'mules', 'denied', 'died', 
         'agreed', 'owned', 'humbled', 'sized', 'meeting', 'stating', 
         'seizing', 'itemization', 'sensational', 'traditional', 
         'reference', 'colonizer', 'plotted', 'guessed', 'chilling']

stems = [(word + ": " + stem_word(word)) for word in words]
print('\n'.join(stems))

caresses: care
flies: fl
dies: d
mules: mule
denied: deni
died: di
agreed: agr
owned: own
humbled: humbl
sized: siz
meeting: meet
stating: stat
seizing: seiz
itemization: itemizat
sensational: sensation
traditional: tradition
reference: refer
colonizer: coloniz
plotted: plott
guessed: guess
chilling: chill


Of course, using stemmers that take into account more rules and nuance are going to give better results. An algorithm developed by [Martin Porter](https://en.wikipedia.org/wiki/Martin_Porter) is more or less the *de facto* standard/baseline for English stemmers these days, and is of course built into NLTK:

In [17]:
from nltk.stem.porter import *

stemmer = PorterStemmer()
words = ['caresses', 'flies', 'dies', 'mules', 'denied', 'died', 
         'agreed', 'owned', 'humbled', 'sized', 'meeting', 'stating', 
         'seizing', 'itemization', 'sensational', 'traditional', 
         'reference', 'colonizer', 'plotted', 'guessed', 'chilling']

stems = [(word + ": " + stemmer.stem(word)) for word in words]
print('\n'.join(stems))

caresses: caress
flies: fli
dies: die
mules: mule
denied: deni
died: die
agreed: agre
owned: own
humbled: humbl
sized: size
meeting: meet
stating: state
seizing: seiz
itemization: item
sensational: sensat
traditional: tradit
reference: refer
colonizer: colon
plotted: plot
guessed: guess
chilling: chill


There are many other approaches you can take to this, of course. One is to stop thinking about language-specific rules altogether and letting an algorithm find the patterns for you. The method presented below, an implementation of an old algorithm called Morfessor, is an **unsupervised machine learning** approach. That is, it an approach that takes a corpus and makes predictions about that data (in this case, where morpheme boundaries are) without being told anything explicit about the provided data.

This approach loads a wordlist, goes through a training process, and then returns a model that can be used to segment new input based on what it learned from the initial wordlist.

In [7]:
## ensure it's installed first: pip install morfessor
import morfessor

def train_model(input_file, output_file=None):

    ## setup IO and model objects
    morf_io = morfessor.MorfessorIO()
    morf_model = morfessor.BaselineModel()

    ## build a corpus from input file
    train_data = morf_io.read_corpus_file(input_file)

    ## load data into model
    morf_model.load_data(train_data)

    ## train the model in batch form (online training also available)
    morf_model.train_batch()

    ## optionally pickle model
    if output_file is not None:
        morf_io.write_binary_model_file(output_file, morf_model)

    return morf_model

In [8]:
## train a model on the Turkish dataset and save it to a file
## be patient! it will be slow.
# model_english = train_model("data/wordlist.eng", "output/trainedmodel.eng")

## otherwise, use the trained model if it's already saved
morf_io = morfessor.MorfessorIO()
model_english = morf_io.read_binary_model_file("data/trainedmodel.eng")

In [16]:
words = ['caresses', 'flies', 'dies', 'mules', 'denied', 'died', 
         'agreed', 'owned', 'humbled', 'sized', 'meeting', 'stating', 
         'seizing', 'itemization', 'sensational', 'traditional', 
         'reference', 'colonizer', 'plotted', 'guessed', 'chilling']

for word in words:
    
#     root = max(model_english.segment(word), key=len)
    root = model_english.segment(word)[0]
    print(word, ":", root)

caresses : caress
flies : flies
dies : dies
mules : mule
denied : deni
died : died
agreed : agree
owned : own
humbled : humble
sized : s
meeting : meet
stating : stating
seizing : s
itemization : item
sensational : sensation
traditional : traditional
reference : reference
colonizer : colon
plotted : plot
guessed : guess
chilling : chilling


## Finding Word Breaks

In [22]:
from collections import Counter
from nltk.corpus import brown

Get simple probabilities from a corpus (Brown), stored in an easy-to-use collection (a frequency distribution).

In [23]:
def get_corpus_probabilities():

    # get the total counts for each word in the corpus
    counts = Counter(brown.words())

    total = len(counts)

    # overwrite the count of each word with its probability
    for key in counts.keys():
        counts[key] = counts[key] / total

    return counts

probabilities = get_corpus_probabilities()

We can then use a simple implementation of a [Viterbi algorithm](https://en.wikipedia.org/wiki/Viterbi_algorithm) to efficiently search through the string and predict the most likely places for word breaks (spaces) to be inserted.

Algorithms of this type ([dynamic programming](https://en.wikipedia.org/wiki/Dynamic_programming)) are used for a huge number of tasks in NLP, especially in those that involve **decoding** an unclear input (a sentence without word breaks, an audio speech signal, etc.).

In [36]:
def viterbi_search(sent_str, fdist_obj):

    sent_len = len(sent_str) + 1

    # make a list to store the probability that a break happens after each letter in the sentence
    probs = [0.0] * sent_len
    
    # the beginning of the sentence is a break, by definition
    probs[0] = 1.0
    
    # have a list to store your potential start and end points for words, which are 
    # the complements of the break points
    ranges = []

    # outer loop is word ending index
    for i in range(1, sent_len):

        # inner loop is word beginning index
        for j in range(0, i):
            word = sent_str[j:i]
            word_prob = fdist_obj[word]

            # probs[j] is the current start letter's probability
            test_prob = word_prob * probs[j]

            if test_prob > probs[i]:
                probs[i] = test_prob
                ranges.append((j, i-1))

    # now, pick only the word start/end pairs that don't overlap
    ranges.reverse()
    filtered = [ranges[0]]

    for i in range(1, len(ranges)):

        # prevents some slicing issues
        if ranges[i] is None:
            pass

        elif ranges[i][1] == filtered[-1][0] - 1:
            filtered.append(ranges[i])
    
    print(ranges)
    print(filtered)
    return filtered

In [37]:
def print_result(sentence, ranges):
    words = []

    for r in ranges:
        word = sentence[r[0]:r[1]+1]
        words.append(word)

    words.reverse()

    print(" ".join(words))

In [38]:
# test_sentence = "howlongofasentencecanweputherewithoutsomethingbreakingeverythingidon'tevenknowcanweextenditevenfurther"
test_sentence = "thissentence"

word_ranges = viterbi_search(test_sentence, probabilities)
print_result(test_sentence, word_ranges)

[(4, 11), (8, 9), (7, 9), (8, 8), (4, 7), (6, 6), (5, 6), (4, 5), (1, 4), (0, 3), (2, 2), (1, 1), (0, 0)]
[(4, 11), (0, 3)]
this sentence


With the probabilities already loaded, it's fast! But it breaks on words that it doesn't recognize, or that are **out of vocabulary** (OOV):

In [27]:
test_sentence = "thecurrentpresidentoftheunitedstatesisTeddyRoosevelt"
word_ranges = viterbi_search(test_sentence, probabilities)
print_result(test_sentence, word_ranges)

[1.0, 8.919492659257541e-05, 3.182293971941967e-09, 1.1187362862800363, 0.0, 1.9957120186239655e-05, 1.9957120186239655e-05, 3.5601477400217023e-10, 3.5601477400217023e-10, 4.445659628619426e-14, 0.0020156691388102054, 1.0427672730523566e-06, 1.8601910074608996e-11, 1.8601910074608996e-11, 0.0, 0.0, 0.0, 7.191498434843839e-08, 8.98023244981124e-12, 4.494686521777399e-06, 8.018064687331465e-11, 2.8929177391891924e-06, 2.5803358538533923e-10, 9.206114682745749e-15, 3.2364120480541556e-06, 1.1546861401980683e-10, 1.1546861401980683e-10, 6.179528730745857e-15, 5.42702485893092e-09, 5.773430700990341e-10, 1.0969518331881647e-09, 0.0, 0.0, 0.0, 0.0, 1.0645268160164861e-11, 3.0918242082831767e-12, 1.6546501997697933e-16, 5.521567716631801e-13, 7.09193991111707e-16, 3.795390358626257e-20, 6.894941580252708e-17, 0.0, 1.9699833086436307e-17, 1.3002726228626993e-20, 2.3195544229314793e-25, 4.1378497296171386e-30, 7.0285006641227e-22, 3.7614396047537504e-26, 0.0, 6.269066007922917e-26, 5.591688823

<h2>N-Grams</h2>

In [39]:
def myTokenizer(text):
    '''Breaks up text into individual words and punction marks.'''
    
    puncs = ['.', ',', ';', '"']
    
    # We will split on spaces, so we want punctuation separated.
    for punc in puncs:
        text = text.replace(punc, ' ' + punc)
        
    split_txt = text.split()
    
    return split_txt

In [40]:
no_prob = "This sentence, even with punctuation, should work just fine."

In [41]:
print(myTokenizer(no_prob))

['This', 'sentence', ',', 'even', 'with', 'punctuation', ',', 'should', 'work', 'just', 'fine', '.']


In [42]:
prob = "This'll have problems because of the contractions; it isn't gonna work as well."

In [43]:
print(myTokenizer(prob))

["This'll", 'have', 'problems', 'because', 'of', 'the', 'contractions', ';', 'it', "isn't", 'gonna', 'work', 'as', 'well', '.']


In [44]:
from nltk import word_tokenize
from nltk.util import ngrams

In [45]:
tokens1 = word_tokenize(no_prob)
tokens2 = word_tokenize(prob)

In [46]:
print(tokens1)

['This', 'sentence', ',', 'even', 'with', 'punctuation', ',', 'should', 'work', 'just', 'fine', '.']


In [47]:
print(tokens2)

['This', "'ll", 'have', 'problems', 'because', 'of', 'the', 'contractions', ';', 'it', 'is', "n't", 'gon', 'na', 'work', 'as', 'well', '.']


In [48]:
bigrams = ngrams(tokens2, 2)
bigrams

<generator object ngrams at 0x115fffbf8>

In [49]:
for b in bigrams:
    print(b)

('This', "'ll")
("'ll", 'have')
('have', 'problems')
('problems', 'because')
('because', 'of')
('of', 'the')
('the', 'contractions')
('contractions', ';')
(';', 'it')
('it', 'is')
('is', "n't")
("n't", 'gon')
('gon', 'na')
('na', 'work')
('work', 'as')
('as', 'well')
('well', '.')


In [50]:
trigrams = ngrams(tokens2, 3)
for t in trigrams:
    print(t)

('This', "'ll", 'have')
("'ll", 'have', 'problems')
('have', 'problems', 'because')
('problems', 'because', 'of')
('because', 'of', 'the')
('of', 'the', 'contractions')
('the', 'contractions', ';')
('contractions', ';', 'it')
(';', 'it', 'is')
('it', 'is', "n't")
('is', "n't", 'gon')
("n't", 'gon', 'na')
('gon', 'na', 'work')
('na', 'work', 'as')
('work', 'as', 'well')
('as', 'well', '.')


In [51]:
eightgrams = ngrams(tokens2, 8)
for e in eightgrams:
    print(e)

('This', "'ll", 'have', 'problems', 'because', 'of', 'the', 'contractions')
("'ll", 'have', 'problems', 'because', 'of', 'the', 'contractions', ';')
('have', 'problems', 'because', 'of', 'the', 'contractions', ';', 'it')
('problems', 'because', 'of', 'the', 'contractions', ';', 'it', 'is')
('because', 'of', 'the', 'contractions', ';', 'it', 'is', "n't")
('of', 'the', 'contractions', ';', 'it', 'is', "n't", 'gon')
('the', 'contractions', ';', 'it', 'is', "n't", 'gon', 'na')
('contractions', ';', 'it', 'is', "n't", 'gon', 'na', 'work')
(';', 'it', 'is', "n't", 'gon', 'na', 'work', 'as')
('it', 'is', "n't", 'gon', 'na', 'work', 'as', 'well')
('is', "n't", 'gon', 'na', 'work', 'as', 'well', '.')


<h2>Let's Play With a Corpus</h2>

In [52]:
from nltk.corpus import inaugural, brown
import random

In [53]:
def addS(text):
    '''Add open and closed sentence tags to every sentence in a corpus.'''
    
    for i in range(len(text)):
        text[i].insert(0, '<s>')
        text[i].append('</s>')
        
    return text

In [54]:
def nextWord(seed, BGs):
    '''Find all words that follow a chosen word. Randomly choose one.'''
    
    choices = [words[1] for words in BGs if words[0] == seed]
    random.shuffle(choices)
    
    return choices[0]

In [55]:
def tgNextWord(seed1, seed2, TGs):
    '''Find all words that follow two chosen words. Randomly choose one.'''
    
    choices = [words[2] for words in TGs if words[0] == seed1 and words[1] == seed2]
    random.shuffle(choices)
    
    return choices[0]

In [56]:
type(inaugural.sents())

nltk.corpus.reader.util.ConcatenatedCorpusView

In [57]:
print(inaugural.sents()[0])

['Fellow', '-', 'Citizens', 'of', 'the', 'Senate', 'and', 'of', 'the', 'House', 'of', 'Representatives', ':']


In [58]:
inaug = addS(list(inaugural.sents()))
rom = addS(list(brown.sents(categories='romance')))
BGList = []
for sent in inaug:
    bgs = list(ngrams(sent, 2))
    # We now have a list of lists. We just want one list.
    for b in bgs:
        BGList.append(b)
for sent in rom:
    bgs = list(ngrams(sent, 2))
    for b in bgs:
        BGList.append(b)

In [59]:
BGList[0:3]

[('<s>', 'Fellow'), ('Fellow', '-'), ('-', 'Citizens')]

In [60]:
# Now we can generate sentences.
# 1) Start with the open sentence tag as a seed.
# 2) Randomly pick a word from the corpus that can follow the seed.
# 3) That word becomes the next seed.
# 4) Repeat until close sentence tag.
seed = '<s>'
while seed != '</s>':
    word = nextWord(seed, BGList)
    print(word, end=' ')
    seed = word

There is said , our lives . </s> 

In [61]:
# Same as with the bigrams above, but with trigrams this time.
# Let's also expand our corpus a bit.
sci = addS(list(brown.sents(categories='science_fiction')))
adv = addS(list(brown.sents(categories='adventure')))
TGList = []
for sent in inaug:
    tgs = list(ngrams(sent, 3))
    # We now have a list of lists. We just want one list.
    for t in tgs:
        TGList.append(t)
for sent in rom:
    tgs = list(ngrams(sent, 3))
    for t in tgs:
        TGList.append(t)
for sent in sci:
    tgs = list(ngrams(sent, 3))
    for t in tgs:
        TGList.append(t)
for sent in adv:
    tgs = list(ngrams(sent, 3))
    for t in tgs:
        TGList.append(t)

In [63]:
# Trigrams should produce slightly more grammatical results.
# For this, we need to generate the first word with bigrams.
# Then we can use "<s> + FirstWord" as our seed.
firstWord = nextWord('<s>', BGList)
seed1 = '<s>'
seed2 = firstWord
print(firstWord, end=' ')
while seed2 != '</s>':
    word = tgNextWord(seed1, seed2, TGList)
    print(word, end=' ')
    seed1, seed2 = seed2, word

When one surveys the world , we can not be precisely such as these can aid the fulfillment of my humble abilities to their exertions in the outset for the cup of good government , whether you are ! ! </s> 

# Parts of Speech

In [64]:
from collections import defaultdict
from math import log

from nltk.tokenize import word_tokenize
from nltk.corpus import brown
import numpy as np

In [65]:
class CorpusDataset(object):

    def __init__(self, corpus_obj, filter_oovs=False):

        # pulls appropriate sentences based on init values
        self.sents = corpus_obj.tagged_sents(tagset="universal")

        if filter_oovs is True:
            self.sents = self.oov_smoother()

        # init ngrams
        self.unigram_tagfd, self.unigram_cfd, self.bigram_cfd = self.set_ngrams()

        # set probabilities
        self.word_condprob = self.set_probabilities(self.unigram_cfd, "word")
        self.tag_condprob = self.set_probabilities(self.bigram_cfd, "tag")

    def oov_smoother(self):
        # filters through self.sents and replaces first instances of any token
        # with u'<unk>', in order to deal with OOV problem

        # set is used in lieu of a list for speed
        tracker = set()
        out_sents = []

        # filters through each layer, using tracker to test for first instance
        for sentence in self.sents:
            sent_new = []

            for word_tag in sentence:

                if word_tag[0] not in tracker:
                    tracker.add(word_tag[0])
                    sent_new.append((u'<UNK>', word_tag[1]))

                else:
                    sent_new.append(word_tag)

            out_sents.append(sent_new)

        return out_sents

    def set_ngrams(self):

        unigram_tag_fd = defaultdict(int)
        unigram_tag_cfd = {}
        bigram_tag_cfd = {}

        for sentence in self.sents:

            # for sentence in sentence_group:
            # print(sentence)

            # add sentence start and end tags (and a corresponding
            # "part of speech" for both of them)
            sentence.insert(0, (u'<s>', u'<s>'))
            sentence.append((u'</s>', u'</s>'))

            unigram_tag_fd[sentence[0][1]] += 1

            # start iterating from index 1 so that you always
            # have a previous tag to look at
            for i in range(1, len(sentence)):

                # drastically improves readability below
                word = sentence[i][0]
                tag = sentence[i][1]
                prev_tag = sentence[i-1][1]

                # sets tag frequencies
                unigram_tag_fd[tag] += 1

                # sets conditional counts for word given tag
                if word not in unigram_tag_cfd:
                    unigram_tag_cfd[word] = defaultdict(int)

                unigram_tag_cfd[word][tag] += 1

                # sets conditional counts for tag given previous tags
                if tag not in bigram_tag_cfd:
                    bigram_tag_cfd[tag] = defaultdict(int)

                bigram_tag_cfd[tag][prev_tag] += 1

        return unigram_tag_fd, unigram_tag_cfd, bigram_tag_cfd

    def set_probabilities(self, cfd, cfd_type):

        if cfd_type == "word":
            for word in cfd:
                for tag in cfd[word]:
                    cfd[word][tag] = log(cfd[word][tag] /
                                         self.unigram_tagfd[tag])

        elif cfd_type == "tag":
            for tag in cfd:
                for prev_tag in cfd[tag]:
                    cfd[tag][prev_tag] = log(cfd[tag][prev_tag] /
                                             self.unigram_tagfd[tag])

        return cfd

    def get_word_cfd(self):
        """Get a conditional frequency distribution for words given a tag."""
        return self.word_condprob

    def get_tag_cfd(self):
        """Get a conditional frequency distribution for a tag given a previous tag."""
        return self.tag_condprob

In [66]:
def get_search_matrices(sentence, tagset, tag_cfd, word_cfd):
    """Builds the state and trace matrices for the viterbi algorithm.

    A state matrix (a numpy array, with axis 0 as word and axis 1 as POS) and
    a backtrace matrix (a list) are constructed for the input sequence and the
    universal tagset. After initialization, probabilities for the first input
    word and each POS are added to the state matrix, and the first POS is added
    to the traces matrix.

    Returns state matrix and traces list for viterbi_search()
    """

    # set a fallback POS for cases where a tag-tag bigram doesn't exist
    fallback_pos = 'NOUN'

    # initialize matrices based on length of input sentence and number of tags
    states = np.zeros((len(sentence), len(tagset)))
    traces = []

    # set first column transition probabilities
    for i in range(len(tagset)):

        # added for more readable code below
        tag = tagset[i]
        first_word = sentence[0]

        # OOV variable for correct fallback assignment
        first_word_is_oov = False

        word_prob = 0.0
        # test for given word
        if sentence[0] in word_cfd:
            word_prob = word_cfd[first_word][tag]

        # test lowercased if previous fails
        elif sentence[0].lower() in word_cfd:
            word_prob = word_cfd[first_word.lower()][tag]

        # set the OOV variable to true so that fallback is assigned
        else:
            first_word_is_oov = True

        # set the tag probability for beginning a sentence
        try:
            tag_prob = tag_cfd[tag]['<s>']
        # if the lookup fails, use the fallback POS tag
        except KeyError:
            try:
                tag_prob = tag_cfd[tag][fallback_pos]
            # if THIS lookup fails, use the default val (see next comment)
            except KeyError:
                tag_prob = -18.0 * (i + 1)

        # this series deals with word/tag probabilities that end up as 0
        # because adding log probabilities doesn't have the multiply-by-zero
        # failsafe that multiplying probabilities does, is used as a floor
        # value; it approximates the lowest probability for a value in the
        # data
        if word_prob == 0.0:
            states[0][i] = -18.0
        elif tag_prob == 0.0:
            states[0][i] = -18.0
        else:
            states[0][i] = word_prob + tag_prob

    # sets backtrace as fallback POS if the word is OOV
    # this was necessary to prevent certain errors, since the most common
    # sentence initial unknown POSs were things like numbers, which throws off
    # the rest of the algorithm
    # if first_word_is_oov is True:
    #     traces.append(fallback_pos)
    # else:
    # np.argmax gives the index for the highest value in an array
    max_prob = np.argmax(states[0])
    best_pos = tagset[max_prob]
    traces.append(best_pos)

    return states, traces

In [75]:
def get_part_of_speech(sentence, tagset, tag_cfd, word_cfd):
    """Returns a POS tag sequence for an input sentence.

    This is similar in implementation to initMatrices() above. For each word
    and each POS, three log probabilities are added (probability of a word
    given a tag, probability of a tag given a previous tag, and the previous
    word's probability with its best tag). The POS with the highest probability
    for a word is then added to the traces list, which is returned. The <UNK>
    tag is used as a standin for OOV words, and NOUN/NN is used as a fallback
    tag for problems with previous tag probability.

    Returns a list of POS tags corresponding to the input string.
    """

    # get the matrices for tracking input
    states, traces = get_search_matrices(sentence, tagset, tag_cfd, word_cfd)

    # use NOUN as the default tag
    fallback_pos = 'NOUN'

    for i in range(1, len(sentence)):
        for j in range(len(tagset)):

            # the current word and POS being checked
            # added for more readable code below
            word = sentence[i]
            tag = tagset[j]

            # test to see if the current word is in the CFD
            # if it is, get it's probability given the current tag
            if word in word_cfd:
                word_prob = word_cfd[word][tag]

            # test the same word but lowercased if the previous fails
            elif word.lower() in word_cfd:
                word_prob = word_cfd[word.lower()][tag]

            # take the probability for an unknown given this tag
            else:
                word_prob = word_cfd['<UNK>'][tag]

            # tag probability given previous tag
            # because traces is being incrementally filled, get its last entry
            try:
                tag_prob = tag_cfd[tag][traces[-1]]
            except KeyError:
                try:
                    tag_prob = tag_cfd[tag][fallback_pos]
                except KeyError:
                    tag_prob = -18.0 * (i + 1)

            # probability of previous word
            prev_word_prob = np.argmax(states[i - 1])

            if word_prob == 0.0:
                states[i][j] = -18.0 * (i + 1)
            elif tag_prob == 0.0:
                states[i][j] = -18.0 * (i + 1)
            else:
                prob = word_prob + tag_prob + prev_word_prob
                states[i][j] = prob

        # gets the index of the part of speech with the highest probability
        max_prob = np.argmax(states[i])

        # pulls the name of that part of speech out using that index
        best_pos = tagset[max_prob]

        # add the best POS to the eventual output
        traces.append(best_pos)

    return traces

In [76]:
universal_tagset = ['ADJ', 'ADP', 'ADV', 'CONJ', 'DET', 'NOUN', 'NUM', 'PRT', 'PRON', 'VERB', '.', 'X']

trained_corpus = CorpusDataset(brown, filter_oovs=True)
word_cfd = trained_corpus.get_word_cfd()
tag_cdf = trained_corpus.get_tag_cfd()

In [78]:
# sentence = word_tokenize("Time flies like an arrow, fruit flies like a banana.")
sentence = word_tokenize("What is your favorite color?")


pos_list = get_part_of_speech(sentence, universal_tagset, tag_cdf, word_cfd)

print(" ".join(pos_list))

DET VERB DET ADJ NOUN .


<h2>Similar Words</h2>

In [99]:
from nltk import Text

corpus = Text(inaugural.words())

In [100]:
# Find words in the same context as a given word.
# Context = Previous-Word, Our-Word, Following-Word
corpus.similar('in')

of and to by for that with from on is as at upon under all but when
among within through


In [101]:
corpus.similar('America')

it freedom peace government all god congress life them justice which
others us war people nations law power this interest


In [102]:
corpus.similar('man')

government peace them nation life congress people war union america it
freedom all which men us god nations duty power


In [103]:
corpus.similar('woman')

discussion opinions who discord sincerity canals death that
maintaining prosperity women received dependent relying government
nothing he riders irresponsibility this


<h2>WordNet</h2>

In [104]:
from nltk.corpus import wordnet as wn

In [105]:
wn.synsets('dog')

[Synset('dog.n.01'),
 Synset('frump.n.01'),
 Synset('dog.n.03'),
 Synset('cad.n.01'),
 Synset('frank.n.02'),
 Synset('pawl.n.01'),
 Synset('andiron.n.01'),
 Synset('chase.v.01')]

In [106]:
print(wn.synset('dog.n.01').lemma_names())
print(wn.synset('dog.v.01').lemma_names())

['dog', 'domestic_dog', 'Canis_familiaris']
['chase', 'chase_after', 'trail', 'tail', 'tag', 'give_chase', 'dog', 'go_after', 'track']


In [107]:
print(wn.synset('dog.n.01').definition())
print(wn.synset('dog.v.01').definition())

a member of the genus Canis (probably descended from the common wolf) that has been domesticated by man since prehistoric times; occurs in many breeds
go after with the intent to catch


In [108]:
print(wn.synset('dog.n.01').examples())
print(wn.synset('dog.v.01').examples())

['the dog barked all night']
['The policeman chased the mugger down the alley', 'the dog chased the rabbit']


In [109]:
wn.synset('dog.n.01').hyponyms()

[Synset('basenji.n.01'),
 Synset('corgi.n.01'),
 Synset('cur.n.01'),
 Synset('dalmatian.n.02'),
 Synset('great_pyrenees.n.01'),
 Synset('griffon.n.02'),
 Synset('hunting_dog.n.01'),
 Synset('lapdog.n.01'),
 Synset('leonberg.n.01'),
 Synset('mexican_hairless.n.01'),
 Synset('newfoundland.n.01'),
 Synset('pooch.n.01'),
 Synset('poodle.n.01'),
 Synset('pug.n.01'),
 Synset('puppy.n.01'),
 Synset('spitz.n.01'),
 Synset('toy_dog.n.01'),
 Synset('working_dog.n.01')]

In [110]:
wn.synset('dog.n.01').hypernyms()

[Synset('canine.n.02'), Synset('domestic_animal.n.01')]

In [111]:
wn.synset('cat.n.01').hypernyms()

[Synset('feline.n.01')]

In [114]:
wn.synset('cat.n.01').hyponyms()

[Synset('domestic_cat.n.01'), Synset('wildcat.n.03')]

In [112]:
wn.synset('feline.n.01').hypernyms()

[Synset('carnivore.n.01')]

In [115]:
wn.synset('domestic_animal.n.01').hyponyms()

[Synset('dog.n.01'),
 Synset('domestic_cat.n.01'),
 Synset('feeder.n.01'),
 Synset('head.n.02'),
 Synset('stocker.n.01'),
 Synset('stray.n.01')]

In [116]:
dog = wn.synset('dog.n.01')
cat = wn.synset('cat.n.01')
d_cat = wn.synset('domestic_cat.n.01')
truck = wn.synset('truck.n.01')
print(dog.path_similarity(cat))
print(dog.path_similarity(d_cat))
print(dog.path_similarity(truck))

0.2
0.3333333333333333
0.07692307692307693
