# FLIP(01):  Advanced Data Science
**(Module 7: Natural Language Processing)**

---
- Materials in this module include resources collected from various open-source online repositories.
- You are free to use, but NOT allowed to change or distribute this package.

Prepared by and for 
**Student Members** |
2006-2019 [TULIP Lab](http://www.tulip.org.au)

---


# Session C - Categorizing and Tagging Words 


## Contents

1 [Using a Tagger](#Tagger)

2 [Tagged Corpora](#Corpora)

3 [Default Tagger](#Default)

4 [Query Tagger](#Query)

5 [Unigram Tagger](#Unigram)

6 [Bigram Tagger](#Bigram)

7 [Combining Tagger](#Combining)

8.[Summary](#Summary)

<a id = "Tagger"></a>

## <span style="color:#0b486b">1. Using a Tagger</span>

Tagger processes a sequence of words and appends a part-of-speech tag to each word. Let's look at an example:

In [None]:
import nltk
words = nltk.word_tokenize('And now for something completely different')
print words

In [None]:
word_tag = nltk.pos_tag(words)
print word_tag

`Nltk.word_tokenize(text)`: Segments the specified sentence and returns a list of words.

`Nltk.pos_tag(words)`: Attributes the specified word list and returns a list of tags.

From the results we can see that something is NN, NN means noun.

<a id = "Corpora"></a>

## <span style="color:#0b486b">2. Tagged Corpora</span>

Many corpora in NLTK have been labeled with part of speech. The Brown corpus we have studied before is a corpus with word-of-speech, and the semaphores used in each corpus can be different.

In [None]:
from nltk.corpus import brown
words_tag = brown.tagged_words(categories='news')
print words_tag[:10]

Brown can be thought of as a CategorizedTaggedCorpusReader instance object.

`CategorizedTaggedCorpusReader::tagged_words(fileids, categories)`: this method accepts a text identifier or a category identifier as a parameter, and returns a list of words whose words are tagged with part of speech.

`CategorizedTaggedCorpusReader::tagged_sents(fileids, categories)`: this method accepts a text identifier or a category identifier as a parameter, and returns a list of sentences after the text is tagged with the part of speech, and the sentence is a list of words.

In [None]:
tagged_sents = brown.tagged_sents(categories='news')
print tagged_sents

NLTK also contains a Chinese corpus `sinica_treebank`, which uses Traditional Chinese. The library is also labeled with part of speech. Let's take a look at the library.

In [None]:
from nltk.corpus import sinica_treebank
print sinica_treebank.fileids()

`Sinica_treebank` can be thought of as a SinicaTreebankCorpusReader instance object.

`SinicaTreebankCorpusReader::words(fileids)`: This method takes a text identifier as a parameter and returns a list of words in the text.

`SinicaTreebankCorpusReader::tagged_words(fileids)`: This method accepts the text identifier as a parameter and returns a list of words whose text is tagged with part of speech.

In [None]:
words = sinica_treebank.words('parsed')
print words[:40]
words_tag = sinica_treebank.tagged_words('parsed')
print words_tag[:40]

In [None]:
words_tag = sinica_treebank.tagged_words('parsed')
tag_fd = nltk.FreqDist(tag for (word, tag) in words_tag)
tag_fd.tabulate(5)

<a id = "Default"></a>

## <span style="color:#0b486b">3. Default Tagger</span>

The default tagger assigns the same token to each word, although it is mediocre, but it also works. Let's look at the example:

In [None]:
import nltk
raw = "You are a good man, but i don't love you!"
tokens = nltk.word_tokenize(raw)

In [None]:
default_tagger = nltk.DefaultTagger('NN')
tagged_words = default_tagger.tag(tokens)
print tagged_words

The DefaultTagger constructor takes a tag string as a parameter and generates a default caller object. From the result, you can see that the default tagger marks all words as NN.

`DefaultTagger::tag(tokens):` Marks the specified word list and returns the list of words after the tag.
`DefaultTagger::evaluete(tagged_sents):` Evaluate the annotator with a sentence that has already been marked, returning a correct rate of 0~1.0.

In [None]:
from nltk.corpus import brown
tagged_sents = brown.tagged_sents(categories='news')
print default_tagger.evaluate(tagged_sents)

We can see that the default tagger we created ourselves is only correct at 0.13.

<a id = "Query"></a>

## <span style="color:#0b486b">4. Query Tagger</span>

The default tagger uses the same tag for all words, and the accuracy is too low. We can consider specifying different words as different tags. Let's look at an example:

In [None]:
# Frequency distribution of news texts to find the 100 most commonly used words in news text
fd = nltk.FreqDist(brown.words(categories='news'))
most_common_pairs = fd.most_common(100)
most_common_words = [i[0] for i in most_common_pairs]

In [None]:
most_common_words

In [None]:
cfd = nltk.ConditionalFreqDist(brown.tagged_words(categories='news'))

In [None]:
cfd

In [None]:
likely_tags = dict((word, cfd[word].max()) for word in most_common_words)

In [None]:
likely_tags

In [None]:
baseline_tagger = nltk.UnigramTagger(model=likely_tags)
tagged_sents = brown.tagged_sents(categories='news')
print(baseline_tagger.evaluate(tagged_sents))

The correctness of the tagger we created this time is 0.45, which is much better than our default tagger. The constructor of the `UnigramTagger` class accepts a (word-to-tagger) dictionary as a model and can directly generate an tagger. In fact, both the `UnigramTagger` and the `DefaultTagger` classes inherit from `TaggerI`, which has tag and evaluete methods, so `UnigramTagger` also has tag and evaluete methods.

Since we only specified the markup of 100 words, let's see how the tagger we created is labeled for unspecified words.

In [None]:
raw = "You are a good man, but i don't love you!"
tokens = nltk.word_tokenize(raw)
print baseline_tagger.tag(tokens)

Many words are assigned as None tags because they are not included in 100 words. In this case, we can give them a default tag. In other words, we need to use the lookup table first. If it can't specify a tag, we use the default tagger. This process is called rollback.

In [None]:
baseline_tagger2 = nltk.UnigramTagger(model=likely_tags, backoff=nltk.DefaultTagger('NN'))
tagged_sents = brown.tagged_sents(categories='news')
print baseline_tagger2.evaluate(tagged_sents)

We can see that the correct rate has improved.
If we increase the number of words, the correct rate will increase.

In [None]:
fd = nltk.FreqDist(brown.words(categories='news'))
most_common_pairs = fd.most_common(500)
most_common_words = [i[0] for i in most_common_pairs]

In [None]:
most_common_words

In [None]:
cfd = nltk.ConditionalFreqDist(brown.tagged_words(categories='news'))

In [None]:
likely_tags = dict((word, cfd[word].max()) for word in most_common_words)

In [None]:
baseline_tagger = nltk.UnigramTagger(model=likely_tags, backoff=nltk.DefaultTagger('NN'))
tagged_sents = brown.tagged_sents(categories='news')
print baseline_tagger.evaluate(tagged_sents)

<a id = "Unigram"></a>

## <span style="color:#0b486b">5. Unigram Tagger</span>

The unigram taggeing is based on a simple statistical algorithm: assign each word the most likely mark of the word. The unary annotator behaves like a query annotator, but it doesn't require us to provide a model. We only need to provide a training sample, which is a list of marked sentences. The annotator will use these samples for training, and the most likely mark for all words. Stored in a dictionary, examples are as follows:

In [None]:
import nltk
from nltk.corpus import brown
tagged_sents = brown.tagged_sents(categories='news')

In [None]:
unigram_tagger = nltk.UnigramTagger(train=tagged_sents)
print unigram_tagger.evaluate(tagged_sents)

This result is much better than our previous query tagger, and the unigram tagger does not require us to count the most likely tags for each word.

However, it is not a good practice to use the same data set as the training set and test set. If we train the tagging over-fitting we can't know, then we need to separate the training set and test set, we put the data set 90% is used as a training set and 10% is used as a test set.

In [None]:
tagged_sents = brown.tagged_sents(categories='news')
size = int(len(tagged_sents) * 0.9)
train_sets = tagged_sents[:size]
test_sets = tagged_sents[size:]

In [None]:
unigram_tagger = nltk.UnigramTagger(train=train_sets)
print unigram_tagger.evaluate(train_sets)
print unigram_tagger.evaluate(test_sets)

We can see that the accuary rate of the Unigram Tagging on the test set is 0.81.

<a id = "Bigram"></a>

## <span style="color:#0b486b">6. Bigram Tagger</span>

Although we assign each word the most likely mark of this identifier, in different contexts, the word is likely to be other tags. So the mark of a word is not only related to itself, but also to its previous word or to a preceding word. The bigram taggers is an tagger that considers the word itself and the previous word.

In [None]:
tagged_sents = brown.tagged_sents(categories='news')
size = int(len(tagged_sents) * 0.9)
train_sets = tagged_sents[:size]
test_sets = tagged_sents[size:]

In [None]:
bigram_tagger = nltk.BigramTagger(train=train_sets)
print bigram_tagger.evaluate(train_sets)
print bigram_tagger.evaluate(test_sets)

The bigram taggers will examine the word itself and the mark of its previous word. If a new word is encountered, the bigram taggers can't mark it, and it will cause the next word to be unmarked, so we will see that the bigram taggers has a low accuracy on the test set.

<a id = "Combining"></a>

## <span style="color:#0b486b">7. Combining Tagger</span>

In the previous, we set a backing taggers (the default taggers) for the query taggers. In fact, most NLTK taggers can set the rewinding taggers, so that we can put the bigram taggers, the unigram taggers, and the default. The taggers combine to get a combining taggers, for example we can combine in the following ways:

1. Try tagging the token with the bigram tagger.
2. If the bigram tagger is unable fo find a tag for the token, try the unigram tagger.
3. If the unigram tagger is also unable to find a tag, use a default tagger.

In [None]:
import nltk
from nltk.corpus import brown

In [None]:
tagged_sents = brown.tagged_sents(categories='news')
size = int(len(tagged_sents) * 0.9)
train_sets = tagged_sents[:size]
test_sets = tagged_sents[size:]

In [None]:
t0 = nltk.DefaultTagger('NN')
t1 = nltk.UnigramTagger(train=train_sets, backoff=t0)
t2 = nltk.BigramTagger(train=train_sets, backoff=t1)

In [None]:
print t2.evaluate(train_sets)
print t2.evaluate(test_sets)

In [None]:
import nltk
from nltk.tag import UnigramTagger
from nltk.corpus import treebank

In [None]:
training = treebank.tagged_sents()[:7000]

In [None]:
unitagger = UnigramTagger(training)

In [None]:
treebank.sents()[0]

In [None]:
unitagger.tag(treebank.sents()[0])

In [None]:
testing = treebank.tagged_sents()[2000:]

In [None]:
unitagger.evaluate(testing)

In [None]:
sent = [("A","DT"),("wise","JJ"),("small","JJ"),("girl","NN"),("of","IN"),("village","NN"),("became","VBD"),("leader","NN")]
grammar = "NP: {<DT>?<JJ>*<NN><IN>?<NN>*}"
find = nltk.RegexpParser(grammar)
res = find.parse(sent)
print res

In [None]:
res.draw()

<a id = "Summary"></a>

## <span style="color:#0b486b">7. Summary</span>

* `nltk.word_tokenize（text）:` Classify the specified sentence, return a list of words
* `Nltk.pos_tag(words):` token-based tagging the specified word list, returning the tag list

* `CategorizedTaggedCorpusReader::tagged_words(fileids, categories):` This method accepts a text identifier or a category identifier as a parameter, and returns a list of words after the text is tagged with part of speech.

* `CategorizedTaggedCorpusReader::tagged_sents(fileids, categories):` This method accepts the text identifier or category identifier as a parameter, and returns a list of sentences after the text is tagged with the part of speech. The sentence is a list of words.

* `SinicaTreebankCorpusReader::tagged_words(fileids):` This method accepts the text identifier as a parameter and returns the list of words after the text is tagged with part of speech.

* `SinicaTreebankCorpusReader::tagged_sents(fileids):` This method accepts the text identifier as a parameter, and returns a list of sentences after the text is tagged with the part of speech. The sentence is a list of words.