# Python for Linguists

Notebook 8: POS Tagging

Venelin Kovatchev

University of Barcelona 2020

In this notebook we will see how we can do Part of Speech Tagging using NLTK.

We will see different taggers and the importance of the choice of corpus.

We will also learn some more about functions and function variables.

In [None]:
# Import nltk
import nltk

In [None]:
# Download the important packages for today
nltk.download("brown")

In [None]:
# Import all resources needed for today
from nltk import bigrams, trigrams

# Import numpy
import numpy as np

# Import codecs
import codecs

# Import taggers
from nltk import DefaultTagger, AffixTagger, UnigramTagger, BigramTagger, TrigramTagger
from nltk import ClassifierBasedPOSTagger

# Import the brown corpus
from nltk.corpus import brown

In [None]:
# Advanced functions

# Before moving to POS tagging, lets' first observe the the behavior of a very simple function

var_1 = "Boo"
var_2 = "Hoo"
var_3 = "DooDoo"

def my_simple_function(atr_1, atr_2, atr_3 = "ignore me"):
    
    print("My atr_1 is:" + str(atr_1))
    print("My atr_2 is:" + str(atr_2))
    print("My atr_3 is:" + str(atr_3))



In [None]:
# Let's see how the function works
my_simple_function(var_1, var_2, var_3)

In [None]:
# What if we change the order?
# As you can see, the function assigns variables based on their position
my_simple_function(var_3, var_1, var_2)

In [None]:
# What if we don't add a third variable?
my_simple_function(var_1, var_2)

In [None]:
# What if we also don't add a second variable?
my_simple_function(var_1)

In [None]:
# What if we explicitly tell the function what is what?
my_simple_function(atr_3 = var_1, atr_1 = var_2, atr_2 = var_3)

### Part of Speech tagging in python

In this class we will see different ways to automatically tag a corpus with part of speech

We will focus on ngram based taggers

In [None]:
# Get a list of all categories of the brown corpus
brown.categories()

In [None]:
# Get the tokenized and tagged version of the "news" category
brown_twords = brown.tagged_words(categories='news')

# Get the sentence segmented, tokenized, and tagged version of the "news" category
brown_tsents = brown.tagged_sents(categories='news')

In [None]:
# Print the first 5 words
print("\nThe first 5 words in the tokenized and tagged version are:")
print(brown_twords[:5])

In [None]:
# Print the first two sentences
print("\nThe first 2 sentences in the sentence segmented, tokenized and tagged version are:")
print(brown_tsents[:2])

In [None]:
# Get the set of all the tags in the brown corpus
brown_tags = set([tag for (token,tag) in brown_twords])
print("\nThe set of all original tags in the brown orpus is:")
print(brown_tags)

In [None]:
# Get the set of all the tags in the universal tagset
brown_utwords = brown.tagged_words(categories='news',tagset='universal')
universal_tags = set([tag for (token,tag) in brown_utwords])
print("\nThe set of universal tags is:")
print(universal_tags)

We will train and evaluate several different taggers:

- default
- affix
- unigram
- bigram
- trigram

In [None]:
# Default tagger

# Get the sentence segmented and tokenized version of "news"
# This is the non-part of speech tagged version that we will be tagging
brown_sents = brown.sents(categories='news')

# Get a list of all tags in the corpus
tags = [tag for (word, tag) in brown.tagged_words(categories='news')]

# Get the most frequent tag in the corpus
most_frequent_tag = nltk.FreqDist(tags).max()

# Print the most frequent tag:
print(most_frequent_tag)

In [None]:
# Configure a default tagger
# The default tagger assigns the same "default" tag to every token in the corpus
# We configure it to annotate with the most frequent tag
default_tagger = nltk.DefaultTagger(most_frequent_tag)

my_sent = "the quick brown fox jumped over the lazy dog".split()
print("The sentence tagged with default tagger:")
print(default_tagger.tag(my_sent))

In [None]:
# Evaluate the default tagger on the corpus:
print("The accuracy of the default tagger on the corpus is:")
print(round(default_tagger.evaluate(brown_tsents),2))


In [None]:
# For the rest of the taggers, we will split the corpus to train and test
test_corpus = brown_tsents[:1000]
train_corpus = brown_tsents[1000:]


In [None]:
# Train the affix tagger
affix_tagger = AffixTagger(train_corpus)

# Tag the corpus with the affix tagger
affix_sents = affix_tagger.tag_sents(brown_sents)

# Print the first sentence and accuracy
print("\nThe first sentence, tagged with affix tagger:")
print(affix_sents[0])
print("\nThe accuracy of the affix tagger on the corpus is:")
print(round(affix_tagger.evaluate(test_corpus),2))

In [None]:
# Train the unigram tagger
unigram_tagger = UnigramTagger(train_corpus) 

# Tag the corpus with the unigram tagger
uni_sents = unigram_tagger.tag_sents(brown_sents)

# Print the first sentence and accuracy
print("\nThe first sentence, tagged with unigram tagger:")
print(uni_sents[0])
print("\nThe accuracy of the unigram tagger on the corpus is:")
print(round(unigram_tagger.evaluate(test_corpus),2))

In [None]:
# Train the bigram tagger
bigram_tagger = BigramTagger(train_corpus)

# Tag the corpus with the bigram tagger
bi_sents = bigram_tagger.tag_sents(brown_sents)

# Print the first sentence and accuracy
print("\nThe first sentence, tagged with bigram tagger:")
print(bi_sents[0])
print("\nThe accuracy of the bigram tagger on the corpus is:")
print(round(bigram_tagger.evaluate(test_corpus),2))

In [None]:
# Train the trigram tagger
trigram_tagger = TrigramTagger(train_corpus)

# Tag the corpus with the trigram tagger
tri_sents = trigram_tagger.tag_sents(brown_sents)

tri_sents = trigram_tagger.tag_sents(brown_sents)

# Print the first sentence and accuracy
print("\nThe first sentence, tagged with trigram tagger:")
print(tri_sents[0])
print("\nThe accuracy of the trigram tagger on the corpus is:")
print(round(trigram_tagger.evaluate(test_corpus),2))

In [None]:
# Train the bigram tagger with backoff
bigram_tagger_backoff = BigramTagger(train_corpus,backoff=unigram_tagger)

# Tag the corpus with the bigram tagger with backoff
bi_sents_bo = bigram_tagger_backoff.tag_sents(brown_sents)


# Print the first sentence and accuracy
print("\nThe first sentence, tagged with bigram taggerwith backoff:")
print(bi_sents_bo[0])
print("\nThe accuracy of the bigram tagger with backoff on the corpus is:")
print(round(bigram_tagger_backoff.evaluate(test_corpus),2))

In [None]:
# Task 1
# Train and test on the same corpus instead of using different ones
# What is the result?



In [None]:
# Task 2
# Create a sequence of backoff taggers:
# trigram -> bigram -> unigram -> affix -> default
# evaluate the resulting tagger

In [None]:
# Task 3
# Try cross-domain tagging:
# Train the corpus on "news" and evaluate it on "science_fiction"
# Compare the results with a corpus trained and evaluated on the same domain (science_fiction)

In [None]:
# Task 4
# Importance of tagset
# Rerun tasks 2 and 3 using "universal" tagset of the corpus
# Compare the performance on "universal" and full tagset