Introduction to NLP course (2017-2018).

Notebook 4: Markov Models. Hidden Markov Models. Part of speech Tagging.

by Venelin Kovatchev, University of Barcelona

In [1]:
# Import section

# Import nltk
import nltk
from nltk import bigrams, trigrams

# Import numpy
import numpy as np

# Import codecs
import codecs

# Import taggers
from nltk import DefaultTagger, AffixTagger, UnigramTagger, BigramTagger, TrigramTagger
from nltk import ClassifierBasedPOSTagger

# Import the brown corpus
from nltk.corpus import brown

In the first part of the notebook, we will "train" a very simple 
Bigram Markov Model on a dataset containing speeches by Donald 
Trump. We will use the "model" to generate pseudo text.

The speeches were taken from https://github.com/ryanmcdermott/trump-speeches

The text file is available in the campus virtual and in the github repository.

In [2]:
# Read the corpus
raw_corpus = codecs.open('speeches.txt','r','utf8').read()

# Tokenize the corpus
corpus = nltk.word_tokenize(raw_corpus)

# Generate all bigrams in the corpus
trump_bigr = list(nltk.bigrams(corpus))

# Initialize the dummy markov model stats
markov_stats = {}

# Fill in all the counts
for word_1, word_2 in trump_bigr:
    # Check if there is an entry for the current word
    if word_1 in markov_stats.keys():
        # If it is, append the second one
        markov_stats[word_1].append(word_2)
        
    # If it isn't, create it with the corresponding value
    else:
        markov_stats[word_1] = [word_2]

# Get the first word at random
first_word = np.random.choice(corpus)

# Check if the first word is lowercase, if it is, pick another one
while first_word.islower():
    first_word = np.random.choice(corpus)
    
# Initialize the new speech variable
new_speech = [first_word]

# Generate a sentence of length 10
for i in range(10):
    # Get the next word by randomly getting a word from the corresponding dictionary
    next_word = np.random.choice(markov_stats[new_speech[-1]])
    # Append the word
    new_speech.append(next_word)
    

    
# Join all the words into a string
speech_string = ' '.join(new_speech)
# Encode in utf8
speech_string.encode('utf-8').strip()

# Print the speech
print (u"A sentence that president Trump could have said is: {}".format(speech_string))

A sentence that president Trump could have said is: – six weeks after another after – and your case known


In the second part of this notebook we will see the different taggers built-in NLTK.

We start by analyzing the content of the brown corpus.

In [3]:
# Get a list of all categories of the brown corpus
brown.categories()

# Get the tokenized and tagged version of the "news" category
brown_twords = brown.tagged_words(categories='news')

# Get the sentence segmented, tokenized, and tagged version of the "news" category
brown_tsents = brown.tagged_sents(categories='news')

# Print the first 5 words
print("\nThe first 5 words in the tokenized and tagged version are: {}".format(brown_twords[:5]))
print("\nThe first 2 sentences in the sentence segmented, tokenized and tagged version are {}".format(brown_tsents[:2]))

# Get the set of all the tags in the brown corpus
brown_tags = set([tag for (token,tag) in brown_twords])
print("\nThe set of all original tags in the brown orpus is: {}".format(brown_tags))

# Get the set of all the tags in the universal tagset
brown_utwords = brown.tagged_words(categories='news',tagset='universal')
universal_tags = set([tag for (token,tag) in brown_utwords])
print("\nThe set of universal tags is: {}".format(universal_tags))


The first 5 words in the tokenized and tagged version are: [(u'The', u'AT'), (u'Fulton', u'NP-TL'), (u'County', u'NN-TL'), (u'Grand', u'JJ-TL'), (u'Jury', u'NN-TL')]

The first 2 sentences in the sentence segmented, tokenized and tagged version are [[(u'The', u'AT'), (u'Fulton', u'NP-TL'), (u'County', u'NN-TL'), (u'Grand', u'JJ-TL'), (u'Jury', u'NN-TL'), (u'said', u'VBD'), (u'Friday', u'NR'), (u'an', u'AT'), (u'investigation', u'NN'), (u'of', u'IN'), (u"Atlanta's", u'NP$'), (u'recent', u'JJ'), (u'primary', u'NN'), (u'election', u'NN'), (u'produced', u'VBD'), (u'``', u'``'), (u'no', u'AT'), (u'evidence', u'NN'), (u"''", u"''"), (u'that', u'CS'), (u'any', u'DTI'), (u'irregularities', u'NNS'), (u'took', u'VBD'), (u'place', u'NN'), (u'.', u'.')], [(u'The', u'AT'), (u'jury', u'NN'), (u'further', u'RBR'), (u'said', u'VBD'), (u'in', u'IN'), (u'term-end', u'NN'), (u'presentments', u'NNS'), (u'that', u'CS'), (u'the', u'AT'), (u'City', u'NN-TL'), (u'Executive', u'JJ-TL'), (u'Committee', u'NN-TL

We train and evaluate the built-in taggers:

- default
- affix
- unigram
- bigram
- trigram
- naive bayes


In [4]:
# Default tagger

# Get the sentence segmented and tokenized version of "news"
# This is the non-part of speech tagged version that we will be tagging
brown_sents = brown.sents(categories='news')

# Get a list of all tags in the corpus
tags = [tag for (word, tag) in brown.tagged_words(categories='news')]

# Get the most frequent tag in the corpus
most_frequent_tag = nltk.FreqDist(tags).max()

# Configure a default tagger
# The default tagger assigns the same "default" tag to every token in the corpus
# We configure it to annotate with the most frequent tag
default_tagger = nltk.DefaultTagger(most_frequent_tag)

my_sent = "the quick brown fox jumped over the lazy dog".split()
print("The sentence tagged with default tagger: {}".format(default_tagger.tag(my_sent)))

# Tag the corpus using the default tagger
default_sents = default_tagger.tag_sents(brown_sents)
print("The accuracy of the default tagger on the corpus is: {}".format(round(default_tagger.evaluate(brown_tsents),2)))

The sentence tagged with default tagger: [('the', u'NN'), ('quick', u'NN'), ('brown', u'NN'), ('fox', u'NN'), ('jumped', u'NN'), ('over', u'NN'), ('the', u'NN'), ('lazy', u'NN'), ('dog', u'NN')]
The accuracy of the default tagger on the corpus is: 0.13


In [5]:
# For the rest of the taggers, we will split the corpus to train and test
test_corpus = brown_tsents[:1000]
train_corpus = brown_tsents[1000:]

# Train the affix tagger
affix_tagger = AffixTagger(train_corpus)

# Tag the corpus with the affix tagger
affix_sents = affix_tagger.tag_sents(brown_sents)

# Print the first sentence and accuracy
print("\nThe first sentence, tagged with affix tagger: {}".format(affix_sents[0]))
print("\nThe accuracy of the affix tagger on the corpus is: {}".format(round(affix_tagger.evaluate(test_corpus),2)))

# Train the unigram tagger
unigram_tagger = UnigramTagger(train_corpus) 

# Tag the corpus with the unigram tagger
uni_sents = unigram_tagger.tag_sents(brown_sents)

# Print the first sentence and accuracy
print("\nThe first sentence, tagged with unigram tagger: {}".format(uni_sents[0]))
print("\nThe accuracy of the unigram tagger on the corpus is: {}".format(round(unigram_tagger.evaluate(test_corpus),2)))

# Train the bigram tagger
bigram_tagger = BigramTagger(train_corpus)

# Tag the corpus with the bigram tagger
bi_sents = bigram_tagger.tag_sents(brown_sents)

# Print the first sentence and accuracy
print("\nThe first sentence, tagged with bigram tagger: {}".format(bi_sents[0]))
print("\nThe accuracy of the bigram tagger on the corpus is: {}".format(round(bigram_tagger.evaluate(test_corpus),2)))

# Train the trigram tagger
trigram_tagger = TrigramTagger(train_corpus)

# Tag the corpus with the trigram tagger
tri_sents = trigram_tagger.tag_sents(brown_sents)

tri_sents = trigram_tagger.tag_sents(brown_sents)

# Print the first sentence and accuracy
print("\nThe first sentence, tagged with trigram tagger: {}".format(tri_sents[0]))
print("\nThe accuracy of the trigram tagger on the corpus is: {}".format(round(trigram_tagger.evaluate(test_corpus),2)))



The first sentence, tagged with affix tagger: [(u'The', None), (u'Fulton', u'NP'), (u'County', u'NN'), (u'Grand', u'NP'), (u'Jury', None), (u'said', None), (u'Friday', u'NR'), (u'an', None), (u'investigation', u'NN'), (u'of', None), (u"Atlanta's", u'NP$'), (u'recent', u'NN'), (u'primary', u'JJ'), (u'election', u'NN'), (u'produced', u'VBN'), (u'``', None), (u'no', None), (u'evidence', u'NN'), (u"''", None), (u'that', None), (u'any', None), (u'irregularities', u'NNS'), (u'took', None), (u'place', u'NN'), (u'.', None)]

The accuracy of the affix tagger on the corpus is: 0.26

The first sentence, tagged with unigram tagger: [(u'The', u'AT'), (u'Fulton', None), (u'County', u'NN-TL'), (u'Grand', u'JJ-TL'), (u'Jury', u'NN-TL'), (u'said', u'VBD'), (u'Friday', u'NR'), (u'an', u'AT'), (u'investigation', u'NN'), (u'of', u'IN'), (u"Atlanta's", u'NP$'), (u'recent', u'JJ'), (u'primary', u'JJ'), (u'election', u'NN'), (u'produced', u'VBD'), (u'``', u'``'), (u'no', u'AT'), (u'evidence', u'NN'), (u"''"

In [6]:
# Train the bigram tagger with backoff
bigram_tagger_backoff = BigramTagger(train_corpus,backoff=unigram_tagger)

# Tag the corpus with the bigram tagger with backoff
bi_sents_bo = bigram_tagger_backoff.tag_sents(brown_sents)


# Print the first sentence and accuracy
print("\nThe first sentence, tagged with bigram taggerwith backoff: {}".format(bi_sents_bo[0]))
print("\nThe accuracy of the bigram tagger with backoff on the corpus is: {}".format(round(bigram_tagger_backoff.evaluate(test_corpus),2)))



The first sentence, tagged with bigram taggerwith backoff: [(u'The', u'AT'), (u'Fulton', None), (u'County', u'NN-TL'), (u'Grand', u'JJ-TL'), (u'Jury', u'NN-TL'), (u'said', u'VBD'), (u'Friday', u'NR'), (u'an', u'AT'), (u'investigation', u'NN'), (u'of', u'IN'), (u"Atlanta's", u'NP$'), (u'recent', u'JJ'), (u'primary', u'JJ'), (u'election', u'NN'), (u'produced', u'VBD'), (u'``', u'``'), (u'no', u'AT'), (u'evidence', u'NN'), (u"''", u"''"), (u'that', u'WPS'), (u'any', u'DTI'), (u'irregularities', None), (u'took', u'VBD'), (u'place', u'NN'), (u'.', u'.')]

The accuracy of the bigram tagger with backoff on the corpus is: 0.83


In [7]:
# Train a naive bayes classifier for the tagging
nb_tagger = ClassifierBasedPOSTagger(train=train_corpus)

# Evaluate the tagger
print("\nThe accuracy of the nb tagger on the corpus is: {}".format(round(nb_tagger.evaluate(test_corpus),2)))



The accuracy of the nb tagger on the corpus is: 0.9
