This notebook shows how to retrain the NLTK backoff tagger.
- You'll see an example in which some recipe text has some errors in tagging, most likely because the training data did not have many examples of the target sentence structure.  
- Next, you'll see the affects of adding a few sentences of training data with the missing sentence structure on the accuracy of the tagger.
- Your assignment is to do something similar on your adopted text.


In [1]:
import nltk, re
from nltk.corpus import brown
from nltk import word_tokenize
import random

Define functions for training and evaluating a backoff tagger.

In [2]:
def create_data_sets(sentences):
    size = int(len(sentences) * 0.9)
    train_sents = sentences[:size]
    test_sents = sentences[size:]
    return train_sents, test_sents

def build_backoff_tagger (train_sents):
    t0 = nltk.DefaultTagger('NN')
    t1 = nltk.UnigramTagger(train_sents, backoff=t0)
    t2 = nltk.BigramTagger(train_sents, backoff=t1)
    return t2

def train_tagger(already_tagged_sents):
    train_sents, test_sents = create_data_sets(already_tagged_sents)
    ngram_tagger = build_backoff_tagger(train_sents)
    print ("%0.3f pos accuracy on test set" % ngram_tagger.evaluate(test_sents))
    return ngram_tagger

Make a specialized function for training a tagger on the brown corpus.

In [3]:
def train_tagger_on_brown():
    brown_tagged_sents = brown.tagged_sents(categories=['adventure', 'belles_lettres', 'editorial', 
                                                        'fiction', 'government', 'hobbies', 'humor', 
                                                        'learned', 'lore', 'mystery', 'religion', 
                                                        'reviews', 'romance','science_fiction'])
    return train_tagger(brown_tagged_sents)

Functions for creating an NLTK corpus object, so we can operate on it using nltk.tokenize_text()

In [4]:
def tokenize_text(corpus):
    sent_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
    raw_sents = sent_tokenizer.tokenize(corpus) # Split text into sentences    
    return [nltk.word_tokenize(word) for word in raw_sents]

def create_corpus(f):
    with open(f, 'r') as text_file:
        new_corpus = text_file.read()
    return new_corpus

Now train and evaluate an ngram backoff tagger, using the brown corpus as the training and testing set.  (This takes a few moments to complete.)

In [5]:
brown_tagger = train_tagger_on_brown()

0.911 pos accuracy on test set


Next, read in a file of recipes and tokenize it.

In [6]:
# cookbook_file = './cookbooks.txt'
# cookbook_sents = tokenize_text(create_corpus(cookbook_file))

novel_file = '../1984.rtf'
content = create_corpus(novel_file)
content = content.replace("\\par", "")
content = content.replace("&", "and")
content = content[171:]
novel_sents = tokenize_text(content)

In the cooking recipe collection,  imperative sentences (sentences that being with a verb) are always mistagged.  The POS tagger marks the initial verb as NN instead of VB.  (There may be other kinds of errors too, but we are only looking at imperative sentences here.) In order to see the sentences where the errors are occuring, the code below finds sentences that begin with imperatives, tags them with the tagger, and returns them in a list. 

> YC: In the 1984 text collection, contractions such as I'm and you'll are mistagged. The POS tagger marks the "'m" and "'ll" as NN instead of BEM and MD. Additionally novel specific word like "Thought Police" got mistagged, where "Thought" was taken as VBD instead of NN.  

> The code below - 
> 1. get_tag_novel: return tagged sentences with the supplied tagger. It takes in an optional input of a single token string. If a token is provided, only sentences containing that token will be returned.
> 2. select_random_sent: select and print n sentences from the list of tagged sentences for better readability

In [7]:
# def get_cookbook_imperatives(sents, tagger):
#     cooking_commands = ["Wash", "Stir", "Moisten", "Drain", "Cook", "Pour", "Chop", 
#                         "Slice", "Season", "Mix", "Fry", "Bake", "Roast", "Wisk"]        
#     return [tagger.tag(sent) for sent in sents if sent[0] in cooking_commands]  

def get_tag_novel (sents, tagger, identifier = None):     
    if identifier == None: return [tagger.tag(sent) for sent in sents]
    else: return [tagger.tag(sent) for sent in sents if identifier in sent]

def select_random_sent(tagged_sents, n):
    num_sents = len(tagged_sents)
    num_sim = n if num_sents > n else num_sents
    random_idx = random.sample(range(num_sents), num_sim)
    selected_sents = [tagged_sents[i] for i in random_idx]
    for i in range(num_sim):
        print("sentence #:", random_idx[i])
        print("tags: ", selected_sents[i], end = " ")
        print("\n\n")

Let's look at those sentences.

In [8]:
# imperatives = get_cookbook_imperatives(cookbook_sents, brown_tagger)
# imperatives[:5]

print("========== Contractions ==========")
print("Example 1: 'm in I'm incorrectly tagged. Examples are shown below: ")
novel_tags_problematic = get_tag_novel(novel_sents, brown_tagger, "'m")
select_random_sent(novel_tags_problematic, 3)
print()

print("========== Contractions ==========")
print("Example 2: 'll in you'll incorrectly tagged. Examples are shown below: ")
novel_tags_problematic = get_tag_novel(novel_sents, brown_tagger, "'ll")
select_random_sent(novel_tags_problematic, 3)
print()

print("========== Past Tense as Noun ==========")
print("Thought in Thought Process incorrectly tagged. Examples are shown below: ")
novel_tags_problematic = get_tag_novel(novel_sents, brown_tagger, "Thought")
select_random_sent(novel_tags_problematic, 3)
print()

Example 1: 'm in I'm incorrectly tagged. Examples are shown below: 
sentence #: 19
tags:  [('On', 'IN'), ('the', 'AT'), ('whole', 'JJ'), ('I', 'PPSS'), ("'m", 'NN'), ('sorry', 'JJ'), ('I', 'PPSS'), ('did', 'DOD'), ("n't", 'NN'), ('.', '.'), ("'", "'")] 


sentence #: 1
tags:  [('I', 'PPSS'), ("'m", 'NN'), ('too', 'QL'), ('busy', 'JJ'), ('to', 'TO'), ('take', 'VB'), ('them', 'PPO'), ('.', '.')] 


sentence #: 24
tags:  [('In', 'IN'), ('this', 'DT'), ('room', 'NN'), ('I', 'PPSS'), ("'m", 'NN'), ('going', 'VBG'), ('to', 'TO'), ('be', 'BE'), ('a', 'AT'), ('woman', 'NN'), (',', ','), ('not', '*'), ('a', 'AT'), ('Party', 'NN-TL'), ('comrade', 'NN'), ('.', '.'), ("'", "'")] 



Example 2: 'll in you'll incorrectly tagged. Examples are shown below: 
sentence #: 1
tags:  [('but', 'CC'), ('in', 'IN'), ('the', 'AT'), ('final', 'JJ'), ('version', 'NN'), ('of', 'IN'), ('Newspeak', 'NN'), ('there', 'EX'), ("'ll", 'NN'), ('be', 'BE'), ('nothing', 'PN'), ('else', 'RB'), ('.', '.')] 


sentence #: 27
t

Notice that most of the initial words are incorrectly tagged as nouns rather than verbs.  How can we fix this?  One way is to label a few rather generic sentences with the structure we are interested in, add them to the start of the training data, and then retrain the tagger.

In [9]:
# Keeping this for better continuity
def train_tagger_on_brown_augmented_with_cooking_sents():

    cooking_action_sents = [[('Strain', 'VB'), ('it', 'PPS'), ('well', 'RB'), ('.', '.')],
                        [('Mix', 'VB'), ('them', 'PPS'), ('well', 'RB'), ('.', '.')],
                        [('Season', 'VB'), ('them', 'PPS'), ('with', 'IN'), ('pepper', 'NN'), ('.', '.')], 
                        [('Wash', 'VB'), ('it', 'PPS'), ('well', 'RB'), ('.', '.')],
                        [('Chop', 'VB'), ('the', 'AT'), ('greens', 'NNS'), ('.', '.')],
                        [('Slice', 'VB'), ('it', 'PPS'), ('well', 'RB'), ('.', '.')],
                        [('Bake', 'VB'), ('the', 'AT'), ('cake', 'NN'), ('.', '.')],
                        [('Pour', 'VB'), ('into', 'IN'), ('a', 'AT'), ('mold', 'NN'), ('.', '.')],
                        [('Stir', 'VB'), ('the', 'AT'), ('mixture', 'NN'), ('.', '.')],
                        [('Moisten', 'VB'), ('the', 'AT'), ('grains', 'NNS'), ('.', '.')],
                        [('Cook', 'VB'), ('the', 'AT'), ('duck', 'NN'), ('.', '.')],
                        [('Drain', 'VB'), ('for', 'IN'), ('one', 'CD'), ('day', 'NN'), ('.', '.')]]


    brown_tagged_sents = brown.tagged_sents(categories=['adventure', 'belles_lettres', 'editorial', 
                                                        'fiction', 'government', 'hobbies', 'humor', 
                                                        'learned', 'lore', 'mystery', 'religion', 
                                                        'reviews', 'romance', 'science_fiction'])
    
    #append hand-tagged cooking sentences to the front of the training data
    all_tagged_sents = cooking_action_sents + brown_tagged_sents
    return train_tagger(all_tagged_sents)

> YC: Following the same method as above, retrain the model with example sentences with desirable structures and tags.

In [10]:
def train_tagger_on_brown_augmented_with_novel_sents():

    contraction_sents = [[('I', 'PPSS'), ("'m", 'BEM'), ('well', 'RB'), ('.', '.')],
                        [('I', 'PPSS'), ("'m", 'BEM'), ('not', '*'), ('very', 'QL'), ('sure', 'JJ'), ('.', '.')],
                        [('I', 'PPSS'), ("'m", 'BEM'), ('interested', 'VBN'), ('.', '.')],
                        [('I', 'PPSS'), ("'m", 'BEM'), ('a', 'AT'), ('person', 'NN'), ('.', '.')],
                        [('You', 'PPSS'), ("'ll", 'MD'), ('learn', 'VB'), ('.', '.')], 
                        [('You', 'PPSS'), ("'ll", 'MD'), ('give', 'VB'), ('me', 'PPO'), ('a', 'AT'), ('chance', 'NN')],
                        [('She', 'PPS'), ("'ll", 'MD'), ('never', 'RB'), ('do', 'VB'), ('that', 'CS'), ('.', '.')]]
    
    thought_as_noun = [[('Who', 'WPS'), ("is", 'BEZ'), ('the', 'AT'), ('Thought', 'NN'), ('Police', 'NN')],
                       [('I', 'PPSS'), ("avoid", 'VB'), ('the', 'AT'), ('Thought', 'NN'), ('Police', 'NN')],
                       [('The', 'AT'), ("Thought", 'NN'), ('Police', 'NN'), ('is', 'BEZ'), ('everywhere', 'RB')]]
                        

    brown_tagged_sents = brown.tagged_sents(categories=['adventure', 'belles_lettres', 'editorial', 
                                                        'fiction', 'government', 'hobbies', 'humor', 
                                                        'learned', 'lore', 'mystery', 'religion', 
                                                        'reviews', 'romance', 'science_fiction'])
    
    #append hand-tagged cooking sentences to the front of the training data
    all_tagged_sents = contraction_sents + thought_as_noun + brown_tagged_sents
    return train_tagger(all_tagged_sents)

Let's retrain the tagger.

In [11]:
# brown_and_cooking_tagger = train_tagger_on_brown_augmented_with_cooking_sents()

brown_and_novel_tagger = train_tagger_on_brown_augmented_with_novel_sents()

0.911 pos accuracy on test set


How well is this working on the cookbook imperatives now? Is more training data needed to change the behavior of the tagger? 

> YC: The cookbook imperatives are improved. So are the contractions and special nouns for the novel 1984!

In [12]:
# better_imperatives = get_cookbook_imperatives(cookbook_sents, brown_and_cooking_tagger)
# better_imperatives[:5]

print("========== Contractions ==========")
print("Example 1: 'm in I'm incorrectly tagged. Corrected by improved tagger: ")
novel_tags_better = get_tag_novel(novel_sents, brown_and_novel_tagger, "'m")
select_random_sent(novel_tags_better, 3)
print()

print("========== Contractions ==========")
print("Example 2: 'll in you'll incorrectly tagged. Corrected by improved tagger: ")
novel_tags_better = get_tag_novel(novel_sents, brown_and_novel_tagger, "'ll")
select_random_sent(novel_tags_better, 3)
print()

print("========== Past Tense as Noun ==========")
print("Thought in Thought Process incorrectly tagged. Corrected by improved tagger: ")
novel_tags_better = get_tag_novel(novel_sents, brown_and_novel_tagger, "Thought")
select_random_sent(novel_tags_better, 3)

Example 1: 'm in I'm incorrectly tagged. Corrected by improved tagger: 
sentence #: 6
tags:  [("'and", 'NN'), ('was', 'BEDZ'), ('it', 'PPS'), ('usual', 'JJ'), ('-', 'IN'), ('I', 'PPSS-NC'), ("'m", 'BEM'), ('only', 'RB'), ('quoting', 'VBG'), ('what', 'WDT'), ('I', 'PPSS'), ("'ve", 'NN'), ('read', 'VBN'), ('in', 'IN'), ('history', 'NN'), ('books', 'NNS'), ('-', 'IN'), ('was', 'BEDZ'), ('it', 'PPS'), ('usual', 'JJ'), ('for', 'IN'), ('these', 'DTS'), ('people', 'NNS'), ('and', 'CC'), ('their', 'PP$'), ('servants', 'NNS'), ('to', 'TO'), ('push', 'VB'), ('you', 'PPO'), ('off', 'RP'), ('the', 'AT'), ('pavement', 'NN'), ('into', 'IN'), ('the', 'AT'), ('gutter', 'NN'), ('?', '.'), ("'", "'")] 


sentence #: 29
tags:  [("'I", 'NN'), ("'m", 'BEM'), ('only', 'RB'), ('an', 'AT'), ('amateur', 'NN'), ('.', '.')] 


sentence #: 39
tags:  [('``', '``'), ('Thank', 'VB'), ('you', 'PPO'), (',', ','), ("''", "''"), ('I', 'PPSS'), ("'m", 'BEM'), ('going', 'VBG'), ('to', 'TO'), ('say', 'VB'), (',', ','), ('`

It worked quite well.  It would be worth experimenting to see if it would still work if I'd supplied fewer of the cooking verbs.

> YC: I noticed that even though only examples of "she'll" and "you'll" are given, the tagger handles "they'll" and "there'll" equally well.

## Assignment: ##

Rewrite this notebook to do the following:
- Tag your adopted text with an NLTK backoff tagger
- Identify a common type of error that is amenable to fixing by making a pattern of training data, similar to what we see with the recipe examples.  You'll want to focus on a particular pattern so that making a few tweaks will have a impact on the results of training.
- Show the before and after effects on the output of the tagger.  Ideally you'll see the errors get fixed not just on the specific examples you fixed, but on similar examples with different words.  In the case of recipes, imperative verbs beyond those in the hardcoded list would be fixed because the tagger would recognize the pattern that verbs can occur at the start of the sentence.