# FLIP(01):  Advanced Data Science
**(Module 03: Natural Language Processing)**

---
- Materials in this module include resources collected from various open-source online repositories.
- You are free to use, but NOT allowed to change or distribute this package.

Prepared by and for 
**Student Members** |
2006-2018 [TULIP Lab](http://www.tulip.org.au)

---


# Session 04 - Categorizing and Tagging Words
### Using a Tagger
A part-of-speech tagger, or POS tagger, processes a sequence of words, and attaches a part of speech tag to each word.

In [None]:
import nltk

In [None]:
text = nltk.word_tokenize("And now for something completely different")

In [None]:
nltk.pos_tag(text)

In [None]:
text = nltk.word_tokenize("They refuse to permit us to obtain the refuse permit")

In [None]:
nltk.pos_tag(text)

In [None]:
# small test:
# Many words, like ski and race, can be used as nouns or verbs with no difference in pronunciation. Can you think of others?
# Hint: think of a commonplace object and try to put the word to before it to see if it can also be a verb, or think of an action and try to put the
# before it to see if it can also be a noun. Now make up a sentence with both uses of this word, and run the POS tagger on this sentence.

In [None]:
text = nltk.Text(word.lower() for word in nltk.corpus.brown.words())

In [None]:
text.similar('woman')

In [None]:
text.similar('bought')

In [None]:
text.similar('over')

In [None]:
text.similar('the')

# Tagged Corpora
## Representing Tagged Tokens

By convention in NLTK, a tagged token is represented using a tuple consisting of the
token and the tag. We can create one of these special tuples from the standard string
representation of a tagged token, using the function str2tuple():

In [None]:
import nltk
tagged_token = nltk.tag.str2tuple('fly/NN')

In [None]:
tagged_token

In [None]:
tagged_token[0]

In [None]:
tagged_token[1]

In [None]:
sent = '''
    The/AT grand/JJ jury/NN commented/VBD on/IN a/AT number/NN of/IN
    other/AP topics/NNS ,/, AMONG/IN them/PPO the/AT Atlanta/NP and/CC
    Fulton/NP-tl County/NN-tl purchasing/VBG departments/NNS which/WDT it/PPS
    said/VBD ``/`` ARE/BER well/QL operated/VBN and/CC follow/VB generally/RB
    accepted/VBN practices/NNS which/WDT inure/VB to/IN the/AT best/JJT
    interest/NN of/IN both/ABX governments/NNS ''/'' ./.
    '''

In [None]:
[nltk.tag.str2tuple(t) for t in sent.split()]

## Reading Tagged Corpora
Several of the corpora included with NLTK have been tagged for their part-of-speech. Here’s an example of what you might see if you opened a file from the Brown Corpus with a text editor:

    The/at Fulton/np-tl County/nn-tl Grand/jj-tl Jury/nn-tl said/vbd Friday/nr an/at investigation/
    nn of/in Atlanta’s/np...
Other corpora use a variety of formats for storing part-of-speech tags. NLTK’s corpus readers provide a uniform interface so that you don’t have to be concerned with the different file formats. In contrast with the file extract just shown, the corpus reader for
the Brown Corpus represents the data as shown next.    

In [None]:
nltk.corpus.brown.tagged_words()

In [None]:
print(nltk.corpus.nps_chat.tagged_words())

In [None]:
nltk.corpus.conll2000.tagged_words()

In [None]:
nltk.corpus.treebank.tagged_words()

In [None]:
nltk.corpus.sinica_treebank.tagged_words()

In [None]:
nltk.corpus.indian.tagged_words()

In [None]:
nltk.corpus.mac_morpho.tagged_words()

In [None]:
nltk.corpus.conll2002.tagged_words()

In [None]:
nltk.corpus.cess_cat.tagged_words()

In [None]:
# Let’s see which of these tags are the most common in the news category of the Brown Corpus:
from nltk.corpus import brown
brown_news_tagged = brown.tagged_words(categories='news')
tag_fd = nltk.FreqDist(tag for (word, tag) in brown_news_tagged)
tag_fd.keys()

In [None]:
# small test: 
# Plot the frequency distribution just shown using tag_fd.plot(cumulative=True). 
# What percentage of words are tagged using the first five tags of the above list?

## Nouns

The simplified noun tags are N for common nouns like book, and NP for proper nouns
like Scotland.
Let’s inspect some tagged text to see what parts-of-speech occur before a noun, with
the most frequent ones first. To begin with, we construct a list of bigrams whose members are themselves word-tag pairs, such as (('The', 'DET'), ('Fulton', 'NP')) and
(('Fulton', 'NP'), ('County', 'N')). Then we construct a FreqDist from the tag parts
of the bigrams.

In [None]:
word_tag_pairs = nltk.bigrams(brown_news_tagged)

In [None]:
list(nltk.FreqDist(a[1] for (a, b) in word_tag_pairs if b[1] == 'N'))

# Verbs

What are the most common verbs in news text? Let’s sort all the verbs by frequency:

In [None]:
wsj = nltk.corpus.treebank.tagged_words()

In [None]:
word_tag_fd = nltk.FreqDist(wsj)

In [None]:
[word + "/" + tag for (word, tag) in word_tag_fd if tag.startswith('V')]

In [None]:
cfd1 = nltk.ConditionalFreqDist(wsj)

In [None]:
cfd1['yield'].keys()

In [None]:
cfd1['cut'].keys()

In [None]:
cfd2 = nltk.ConditionalFreqDist((tag,word) for (word,tag) in wsj)

In [None]:
cfd2['VN'].keys()

In [None]:
[w for w in cfd1.conditions() if 'VD' in cfd1[w] and 'VN' in cfd1[w]]

In [None]:
idx1 = wsj.index(('kicked', 'VD'))
wsj[idx1-4:idx1+1]

In [None]:
idx2 = wsj.index(('kicked', 'VN'))
wsj[idx2-4:idx2+1]

In [None]:
# small test: 
# Given the list of past participles specified by cfd2['VN'].keys(), try to collect a list of all the word-tag pairs 
# that immediately precede items in that list.

# Mapping Words to Properties Using Python Dictionaries
## Dictionaries in Python

Python provides a dictionary data type that can be used for mapping between arbitrary
types. It is like a conventional dictionary, in that it gives you an efficient way to look things up.
To illustrate, we define pos to be an empty dictionary and then add four entries to it, specifying the part-of-speech of some words. We add entries to a dictionary using the familiar square bracket notation:

In [None]:
pos = {}

In [None]:
pos

In [None]:
pos['colorless'] = 'ADJ'

In [None]:
pos

In [None]:
pos['ideas'] = 'N'

In [None]:
pos['sleep'] = 'V'

In [None]:
pos['furiously'] = 'ADV'

In [None]:
pos

In [None]:
pos['ideas']

In [None]:
pos['colorless']

In [None]:
pos['green']

In [None]:
list(pos)

In [None]:
sorted(pos)

In [None]:
[w for w in pos if w.endswith('s')]

In [None]:
for word in sorted(pos):
    print(word + ":", pos[word])

In [None]:
pos.keys()

In [None]:
pos.values()

In [None]:
pos.items()

In [None]:
for key,val in sorted(pos.items()):
    print(key + ":",val)

In [None]:
pos['sleep'] = 'V'
pos['sleep']

In [None]:
pos['sleep'] = 'N'
pos['sleep']

## Defining Dictionaries
We can use the same key-value pair format to create a dictionary. There are a couple
of ways to do this, and we will normally use the first:

In [None]:
pos = {'colorless': 'ADJ', 'ideas': 'N', 'sleep': 'V', 'furiously': 'ADV'}
pos = dict(colorless='ADJ', ideas='N', sleep='V', furiously='ADV')

In [None]:
pos = {['ideas', 'blogs', 'adventures']: 'N'}

## Default Dictionaries
If we try to access a key that is not in a dictionary, we get an error. However, it’s often useful if a dictionary can automatically create an entry for this new key and give it a default value, such as zero or the empty list. Since Python 2.5, a special kind of dictionary called a defaultdict has been available. In order to use it, we have to supply a parameter which can be used to create the default value, e.g., int, float, str, list, dict, tuple.

In [None]:
import nltk
frequency = nltk.defaultdict(int)
frequency['colorless'] = 4
frequency['ideas']

In [None]:
pos = nltk.defaultdict(list)
pos['sleep'] = ['N', 'V']
pos['ideas']

In [None]:
# When we access a non-existent entry , it is automatically added to the dictionary
pos = nltk.defaultdict(lambda: 'N')
pos['colorless'] = 'ADJ'
pos['blog']

In [None]:
pos.items()

In [None]:
f = lambda : 'N'

In [None]:
f()

In [None]:
def g():
    return 'N'

In [None]:
g()

In [None]:
alice = nltk.corpus.gutenberg.words('carroll-alice.txt')
vocab = nltk.FreqDist(alice)
v1000 = list(vocab)[:1000]
mapping = nltk.defaultdict(lambda: 'UNK')

In [None]:
for v in v1000:
    mapping[v] = v

In [None]:
alice2 = [mapping[v] for v in alice]
alice2[:100]

len(set(alice2))

## Incrementally Updating a Dictionary
We begin by initializing an empty defaultdict, then process each part-of-speech tag in the text. If the tag hasn’t been seen before, it will have a zero count by default. Each time we encounter a tag, we increment its count using the += operator

In [None]:
# Incrementally updating a dictionary, and sorting by value.
counts = nltk.defaultdict(int)
from nltk.corpus import brown
for (word, tag) in brown.tagged_words(categories='news'):
    counts[tag] += 1

In [None]:
counts['N']

In [None]:
list(counts)

In [None]:
from operator import itemgetter
sorted(counts.items(), key=itemgetter(1), reverse=True)

In [None]:
[t for t, c in sorted(counts.items(), key=itemgetter(1), reverse=True)]

In [None]:
pair = ('NP', 8336)
pair[1]

In [None]:
itemgetter(1)(pair)

In [None]:
my_dictionary = nltk.defaultdict()

In [None]:
for item in sequence:
    my_dictionary[item_key] is updated with information about item

In [None]:
last_letters = nltk.defaultdict(list)
words = nltk.corpus.words.words('en')
for word in words:
    key = word[-2:]
    last_letters[key].append(word)

In [None]:
last_letters['ly']

In [None]:
last_letters['zy']

In [None]:
anagrams = nltk.defaultdict(list)

In [None]:
for word in words:
    key = ''.join(sorted(word))
    anagrams[key].append(word)

In [None]:
anagrams['aeilnrt']

In [None]:
anagrams = nltk.Index((''.join(sorted(w)), w) for w in words)
anagrams['aeilnrt']

## Complex Keys and Values
We can use default dictionaries with complex keys and values. Let’s study the range of possible tags for a word, given the word itself and the tag of the previous word. We will see how this information can be used by a POS tagger.

In [None]:
pos = nltk.defaultdict(lambda: nltk.defaultdict(int))

In [None]:
brown_news_tagged = brown.tagged_words(categories='news')

In [None]:
for ((w1, t1), (w2, t2)) in nltk.ibigrams(brown_news_tagged):
    pos[(t1, w2)][t2] += 1

In [None]:
pos[('DET', 'right')]

## Inverting a Dictionary
Dictionaries support efficient lookup, so long as you want to get the value for any key.If d is a dictionary and k is a key, we type d[k] and immediately obtain the value. Finding a key given a value is slower and more cumbersome:

In [None]:
counts = nltk.defaultdict(int)
for word in nltk.corpus.gutenberg.words('milton-paradise.txt'):
    counts[word] += 1

In [None]:
[key for (key, value) in counts.items() if value == 32]

In [None]:
pos = {'colorless': 'ADJ', 'ideas': 'N', 'sleep': 'V', 'furiously': 'ADV'}

In [None]:
pos2 = dict((value, key) for (key, value) in pos.items())

In [None]:
pos2['N']

In [None]:
pos.update({'cats': 'N', 'scratch': 'V', 'peacefully': 'ADV', 'old': 'ADJ'})
pos2 = nltk.defaultdict(list)

In [None]:
for key, value in pos.items():
    pos2[value].append(key)

In [None]:
pos2['ADV']

In [None]:
pos2 = nltk.Index((value, key) for (key, value) in pos.items())
pos2['ADV']

# Automatic Tagging
In the rest of this chapter we will explore various ways to automatically add part-ofspeech tags to text. We will see that the tag of a word depends on the word and its context within a sentence. For this reason, we will be working with data at the level of (tagged) sentences rather than words. We’ll begin by loading the data we will be using.

In [None]:
from nltk.corpus import brown

In [None]:
brown_tagged_sents = brown.tagged_sents(categories='news')

In [None]:
brown_sents = brown.sents(categories='news')

## The Default Tagger
The simplest possible tagger assigns the same tag to each token. This may seem to be a rather banal step, but it establishes an important baseline for tagger performance. In order to get the best result, we tag each word with the most likely tag.

In [None]:
tags = [tag for (word, tag) in brown.tagged_words(categories='news')]

In [None]:
import nltk
nltk.FreqDist(tags).max()

In [None]:
# we can create a tagger that tags everything as NN
raw = 'I do not like green eggs and ham, I do not like them Sam I am!'
tokens = nltk.word_tokenize(raw)
default_tagger = nltk.DefaultTagger('NN')
default_tagger.tag(tokens)

In [None]:
default_tagger.evaluate(brown_tagged_sents)

## The Regular Expression Tagger
The regular expression tagger assigns tags to tokens on the basis of matching patterns.For instance, we might guess that any word ending in ed is the past participle of a verb, and any word ending with ’s is a possessive noun. We can express these as a list of
regular expressions:

In [None]:
patterns = [
    (r'.*ing$', 'VBG'), # gerunds
    (r'.*ed$', 'VBD'), # simple past
    (r'.*es$', 'VBZ'), # 3rd singular present
    (r'.*ould$', 'MD'), # modals
    (r'.*\'s$', 'NN$'), # possessive nouns
    (r'.*s$', 'NNS'), # plural nouns
    (r'^-?[0-9]+(.[0-9]+)?$', 'CD'), # cardinal numbers
    (r'.*', 'NN')     # nouns (default)
    ]

In [None]:
regexp_tagger = nltk.RegexpTagger(patterns)

In [None]:
regexp_tagger.tag(brown_sents[3])

In [None]:
# Small test: 
# See if you can come up with patterns to improve the performance of the regular expression tagger just shown. 

## The Lookup Tagger
A lot of high-frequency words do not have the NN tag. Let’s find the hundred most frequent words and store their most likely tag. We can then use this information as the model for a “lookup tagger”.

In [None]:
fd = nltk.FreqDist(brown.words(categories='news'))

In [None]:
sent = brown.sents(categories='news')[3]

In [None]:
cfd = nltk.ConditionalFreqDist(brown.tagged_words(categories='news'))

In [None]:
most_freq_word = fd.keys()

In [None]:
likely_tags = dict((word, cfd[word].max()) for word in most_freq_words)

In [None]:
baseline_tagger = nltk.UnigramTagger(model=likely_tags)

In [None]:
baseline_tagger.evaluate(brown_tagged_sents)

In [None]:
baseline_tagger.tag(sent)

In [None]:
sent = brown.sents(categories='news')[3]

In [None]:
baseline_tagger.tag(sent)

In [None]:
baseline_tagger = nltk.UnigramTagger(model=likely_tags,
                                     backoff=nltk.DefaultTagger('NN'))

In [None]:
# Lookup tagger performance with varying model size
def performance(cfd, wordlist):
    lt = dict((word, cfd[word].max()) for word in wordlist)
    baseline_tagger = nltk.UnigramTagger(model=lt, backoff=nltk.DefaultTagger('NN'))
    return baseline_tagger.evaluate(brown.tagged_sents(categories='news'))
def display():
    import pylab
    words_by_freq = list(nltk.FreqDist(brown.words(categories='news')))
    cfd = nltk.ConditionalFreqDist(brown.tagged_words(categories='news'))
    sizes = 2 ** pylab.arange(15)
    perfs = [performance(cfd, words_by_freq[:size]) for size in sizes]
    pylab.plot(sizes, perfs, '-bo')
    pylab.title('Lookup Tagger Performance with Varying Model Size')
    pylab.xlabel('Model Size')
    pylab.ylabel('Performance')
    pylab.show()

In [None]:
display()

# N-Gram Tagging
## Unigram Tagging

Unigram taggers are based on a simple statistical algorithm: for each token, assign the tag that is most likely for that particular token. For example, it will assign the tag JJ to any occurrence of the word frequent, since frequent is used as an adjective (e.g., a frequent word) more often than it is used as a verb (e.g., I frequent this cafe). A unigram tagger behaves just like a lookup tagger, except there is a more convenient technique for setting it up, called training. In the following code sample, we train a unigram tagger, use it to tag a sentence, and then evaluate:

In [None]:
from nltk.corpus import brown
import nltk

In [None]:
brown_tagged_sents = brown.tagged_sents(categories='news')
brown_sents = brown.sents(categories='news')

In [None]:
unigram_tagger = nltk.UnigramTagger(brown_tagged_sents)
unigram_tagger.tag(brown_sents[2007])

In [None]:
unigram_tagger.evaluate(brown_tagged_sents)

## Separating the Training and Testing Data
Now that we are training a tagger on some data, we must be careful not to test it on the same data, as we did in the previous example. A tagger that simply memorized its training data and made no attempt to construct a general model would get a perfect
score, but would be useless for tagging new text. Instead, we should split the data, training on 90% and testing on the remaining 10%:

In [None]:
size = int(len(brown_tagged_sents) * 0.9)

In [None]:
size

In [None]:
train_sents = brown_tagged_sents[:size]

In [None]:
test_sents = brown_tagged_sents[size:]

In [None]:
unigram_tagger = nltk.UnigramTagger(train_sents)

In [None]:
unigram_tagger.evaluate(test_sents)

## General N-Gram Tagging

The NgramTagger class uses a tagged training corpus to determine which part-of-speech
tag is most likely for each context. Here we see a special case of an n-gram tagger,
namely a bigram tagger. First we train it, then use it to tag untagged sentences:

In [None]:
bigram_tagger = nltk.BigramTagger(train_sents)

In [None]:
bigram_tagger.tag(brown_sents[2007])

In [None]:
unseen_sent = brown_sents[4203]

In [None]:
bigram_tagger.tag(unseen_sent)

In [None]:
bigram_tagger.evaluate(test_sents)

## Combining Taggers

One way to address the trade-off between accuracy and coverage is to use the more accurate algorithms when we can, but to fall back on algorithms with wider coverage when necessary. For example, we could combine the results of a bigram tagger, a unigram tagger, and a default tagger, as follows:
        
        1. Try tagging the token with the bigram tagger.
        2. If the bigram tagger is unable to find a tag for the token, try the unigram tagger.
        3. If the unigram tagger is also unable to find a tag, use a default tagger.
Most NLTK taggers permit a backoff tagger to be specified. The backoff tagger may itself have a backoff tagger:

In [None]:
t0 = nltk.DefaultTagger('NN')

In [None]:
t1 = nltk.UnigramTagger(train_sents, backoff=t0)

In [None]:
t2 = nltk.BigramTagger(train_sents, backoff=t1)

In [None]:
t2.evaluate(test_sents)

In [None]:
# Small Test: 
# Extend the preceding example by defining a TrigramTag ger called t3, which backs off to t2.

## Storing Taggers
Training a tagger on a large corpus may take a significant time. Instead of training a tagger every time we need one, it is convenient to save a trained tagger in a file for later reuse.

In [None]:
from pickle import dump

In [None]:
output = open('t2.pkl','wb')

In [None]:
dump(t2,output,-1)

In [None]:
output.close()

In [None]:
from pickle import load

In [None]:
input = open('t2.pkl','rb')

In [None]:
tagger = load(input)

In [None]:
input.close()

In [None]:
text = """The board's action shows what free enterprise
            is up against in our complex maze of regulatory laws ."""

In [None]:
tokens = text.split()

In [None]:
tagger.tag(tokens)

## Performance Limitations

In [None]:
cfd = nltk.ConditionalFreqDist(
    ((x[1], y[1], z[0]), z[1])
    for sent in brown_tagged_sents
    for x, y, z in nltk.trigrams(sent))

In [None]:
ambiguous_contexts = [c for c in cfd.conditions() if len(cfd[c]) > 1]

In [None]:
sum(cfd[c].N() for c in ambiguous_contexts) / cfd.N()

In [None]:
test_tags = [tag for sent in brown.sents(categories='editorial')
             for (word, tag) in t2.tag(sent)]

In [None]:
gold_tags = [tag for (word, tag) in brown.tagged_words(categories='editorial')]

In [None]:
print(nltk.ConfusionMatrix(gold_tags, test_tags))

## Tagging Across Sentence Boundaries
An n-gram tagger uses recent tags to guide the choice of tag for the current word. When tagging the first word of a sentence, a trigram tagger will be using the part-of-speech tag of the previous two tokens, which will normally be the last word of the previous
sentence and the sentence-ending punctuation. However, the lexical category that closed the previous sentence has no bearing on the one that begins the next sentence.

In [None]:
# N-gram tagging at the sentence level.
brown_tagged_sents = brown.tagged_sents(categories='news')
brown_sents = brown.sents(categories='news')

In [None]:
size = int(len(brown_tagged_sents) * 0.9)

In [None]:
train_sents = brown_tagged_sents[:size]
test_sents = brown_tagged_sents[size:]

In [None]:
t0 = nltk.DefaultTagger('NN')

In [None]:
t1 = nltk.UnigramTagger(train_sents, backoff=t0)

In [None]:
t2 = nltk.BigramTagger(train_sents, backoff=t1)

In [None]:
t2.evaluate(test_sents)