# <center>6. Learning to Classify Text</center>

# 1.   Supervised Classification

**Classification** is the task of choosing the correct class label for a given input.

In basic classification tasks, **each input** is considered **in isolation** from all other inputs, and the set of **labels is defined in advance**.

- Deciding whether an email is spam or not; (二分类)

- Deciding what the topic of a news article is, from a fixed list of topic areas such as "sports," "technology," and "politics."; (多标签分类)

- Deciding whether a given occurrence of the word *bank* is used to refer to a river bank, a financial institution, the act of tilting to the side, or the act of depositing something in a financial institution (河岸，金融机构，侧身的动作，将某物存入金融机构的动作). (词义消歧)

The basic classification task has a number of **interesting variants**. 

- In multi-class classification, each instance may be assigned multiple labels; 

- In open-class classification, the set of labels is not defined in advance; 

- In sequence classification, a list of inputs are jointly classified.

A classifier is called **supervised** if it is built based on training corpora containing the correct label for each input. 

The framework used by supervised classification is shown in below:

<div align=center>
<img src="https://www.nltk.org/images/supervised-classification.png">
<br>
<center><em><strong>Supervised Classification Framework</strong></em></center>
</div>

## 1.1   Gender Identification

In Chapter 2, we saw that male and female names have some **distinctive characteristics**. 

Names ending in *a*, *e* and *i* are likely to be female, while names ending in *k*, *o*, *r*, *s* and *t* are likely to be male. 

Let's build a classifier to model these differences more precisely.

The **first step** in creating a classifier is deciding what **features** of the input are **relevant**, and how to **encode those features**. 

For this example, we'll start by just **looking at the final letter of a given name**.

The following feature extractor function builds a dictionary containing relevant information about a given name:

In [1]:
import nltk

In [2]:
def gender_features(word):
    return {'last_letter': word[-1]}

In [3]:
gender_features('Shrek')

{'last_letter': 'k'}

The returned dictionary, known as **a feature set**, maps from feature names to their values. 

**Feature names** are case-sensitive strings that typically provide a short **human-readable** description of the feature.

**Feature values** are values with **simple types**, such as booleans, numbers, and strings.

Now that we've defined a feature extractor, we need to prepare a list of examples and corresponding class labels.

In [4]:
from nltk.corpus import names

In [5]:
labeled_names = ([(name, 'male') for name in names.words('male.txt')] 
                 + [(name, 'female') for name in names.words('female.txt')])

In [6]:
type(labeled_names)

list

In [7]:
len(labeled_names)

7944

In [8]:
labeled_names[:5]

[('Aamir', 'male'),
 ('Aaron', 'male'),
 ('Abbey', 'male'),
 ('Abbie', 'male'),
 ('Abbot', 'male')]

In [9]:
import random

# 将标注数据随机打乱
random.seed(10)
random.shuffle(labeled_names)

In [10]:
labeled_names[:5]

[('Gabrila', 'female'),
 ('Rosario', 'female'),
 ('Annabella', 'female'),
 ('Mead', 'female'),
 ('Pepe', 'male')]

Next, we use the feature extractor to process the names data, and divide the resulting list of feature sets into a **training set** and a **test set**.

The training set is used to train a new "naive Bayes" classifier.

In [11]:
featuresets = [(gender_features(n), gender) for (n, gender) in labeled_names]

In [12]:
featuresets[:10]

[({'last_letter': 'a'}, 'female'),
 ({'last_letter': 'o'}, 'female'),
 ({'last_letter': 'a'}, 'female'),
 ({'last_letter': 'd'}, 'female'),
 ({'last_letter': 'e'}, 'male'),
 ({'last_letter': 'a'}, 'female'),
 ({'last_letter': 'i'}, 'female'),
 ({'last_letter': 'e'}, 'female'),
 ({'last_letter': 'a'}, 'female'),
 ({'last_letter': 'e'}, 'female')]

In [13]:
train_set, test_set = featuresets[500:], featuresets[:500]

In [14]:
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [15]:
classifier.classify(gender_features('Neo'))

'male'

In [16]:
classifier.classify(gender_features('Trinity'))

'female'

We can **systematically evaluate the classifier** on a much larger quantity of unseen data.

In [17]:
print(nltk.classify.accuracy(classifier, test_set))

0.77


**Finally**, we can examine the classifier to determine **which features** it found **most effective** for distinguishing the names' genders:

In [18]:
classifier.show_most_informative_features(5)

Most Informative Features
             last_letter = 'a'            female : male   =     35.6 : 1.0
             last_letter = 'k'              male : female =     32.9 : 1.0
             last_letter = 'f'              male : female =     15.4 : 1.0
             last_letter = 'p'              male : female =     12.6 : 1.0
             last_letter = 'v'              male : female =     11.3 : 1.0


These ratios are known as **likelihood ratios**, and can be useful for **comparing different feature-outcome relationships**.

**Your Turn**: 

Modify the `gender_features()` function to provide the classifier with features encoding the length of the name, its first letter, and any other features that seem like they might be informative. Retrain the classifier with these new features, and test its accuracy.

When working with **large corpora**, constructing a single list that contains the features of every instance can use up **a large amount of memory**.

In these cases, use the function `nltk.classify.apply_features`, which returns an object that acts **like a list but does not store all the feature sets in memory**.

In [19]:
from nltk.classify import apply_features

In [20]:
train_set = apply_features(gender_features, labeled_names[500:])

In [21]:
test_set = apply_features(gender_features, labeled_names[:500])

In [22]:
type(train_set)

nltk.collections.LazyMap

## 1.2   Choosing The Right Features

**Selecting relevant features** and deciding how to **encode them** for a learning method can have **an enormous impact** on the learning **method's ability** to extract a good model. 

Although it's often possible to **get decent performance** by using **a fairly simple and obvious set of features**, there are usually **significant gains** to be had by using **carefully constructed features based on a thorough understanding of the task** at hand.

Typically, feature extractors are built through a process of **trial-and-error**, guided by **intuitions** about what information is relevant to the problem. 

It's common to start with a **"kitchen sink"** approach (水槽法), **including all the features** that you can think of, and **then checking** to see which features actually are helpful.

In [23]:
def gender_features2(name):
    features = {}
    # 首字母
    features["first_letter"] = name[0].lower()
    # 尾字母
    features["last_letter"] = name[-1].lower()
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        # a-z计数
        features["count({})".format(letter)] = name.lower().count(letter)
        # 是否包含a-z
        features["has({})".format(letter)] = (letter in name.lower())
    return features

In [24]:
gender_features2('John') 

{'first_letter': 'j',
 'last_letter': 'n',
 'count(a)': 0,
 'has(a)': False,
 'count(b)': 0,
 'has(b)': False,
 'count(c)': 0,
 'has(c)': False,
 'count(d)': 0,
 'has(d)': False,
 'count(e)': 0,
 'has(e)': False,
 'count(f)': 0,
 'has(f)': False,
 'count(g)': 0,
 'has(g)': False,
 'count(h)': 1,
 'has(h)': True,
 'count(i)': 0,
 'has(i)': False,
 'count(j)': 1,
 'has(j)': True,
 'count(k)': 0,
 'has(k)': False,
 'count(l)': 0,
 'has(l)': False,
 'count(m)': 0,
 'has(m)': False,
 'count(n)': 1,
 'has(n)': True,
 'count(o)': 1,
 'has(o)': True,
 'count(p)': 0,
 'has(p)': False,
 'count(q)': 0,
 'has(q)': False,
 'count(r)': 0,
 'has(r)': False,
 'count(s)': 0,
 'has(s)': False,
 'count(t)': 0,
 'has(t)': False,
 'count(u)': 0,
 'has(u)': False,
 'count(v)': 0,
 'has(v)': False,
 'count(w)': 0,
 'has(w)': False,
 'count(x)': 0,
 'has(x)': False,
 'count(y)': 0,
 'has(y)': False,
 'count(z)': 0,
 'has(z)': False}

However, there are usually **limits to the number of features** that you should use with a given learning algorithm.

If you provide **too many features**, then the algorithm will have a higher chance of relying on idiosyncrasies of your training data that don't generalize well to new examples. 
如果提供了太多特征，算法将更有可能依赖训练数据的特质，这些特质不能很好地推广到新示例。

This problem is known as **overfitting**, and can be especially problematic when working with **small training sets**. 

For example, if we train a naive Bayes classifier using the above feature extractor `gender_features2`, it will overfit the relatively small training set, resulting in a system whose accuracy is **about 1% lower** than the accuracy of a classifier that **only pays attention to the final letter of each name**:

In [25]:
featuresets = [(gender_features2(n), gender) for (n, gender) in labeled_names]
train_set, test_set = featuresets[500:], featuresets[:500]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))

0.756


Once **an initial set of features** has been chosen, a very productive method for **refining the feature set** is **error analysis**. 

First, we select a **development set**, containing the corpus **data for creating the model**.

This development set is then subdivided into the **training set** and the **dev-test set**.

In [26]:
train_names = labeled_names[1500:]
devtest_names = labeled_names[500:1500]
test_names = labeled_names[:500]

The training set is used to **train the model**;

The dev-test set is used to perform **error analysis**;

The test set serves in our **final evaluation of the system**.

<div align=center>
<img src="https://www.nltk.org/images/corpus-org.png">
<br>
<center><em><strong>Organization of corpus data for training supervised classifiers</strong></em></center>
</div>

Having divided the corpus into appropriate datasets, we train a model using the training set, and then run it on the dev-test set.

In [27]:
train_set = [(gender_features(n), gender) for (n, gender) in train_names]
devtest_set = [(gender_features(n), gender) for (n, gender) in devtest_names]
test_set = [(gender_features(n), gender) for (n, gender) in test_names]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, devtest_set))

0.763


Using the dev-test set, we can generate a list of the errors that the classifier makes when predicting name genders:

In [28]:
errors = []
for (name, tag) in devtest_names:
    guess = classifier.classify(gender_features(name))
    if guess != tag:
        errors.append((tag, guess, name))

In [29]:
len(errors)

237

Then examine individual error cases where the model predicted the wrong label;

The feature set can then be adjusted accordingly.

In [30]:
for (tag, guess, name) in sorted(errors):
    print('correct={:<8} guess={:<8s} name={:<30}'.format(tag, guess, name))

correct=female   guess=male     name=Aeriell                       
correct=female   guess=male     name=Alisun                        
correct=female   guess=male     name=Allsun                        
correct=female   guess=male     name=Allyn                         
correct=female   guess=male     name=Amabel                        
correct=female   guess=male     name=Amargo                        
correct=female   guess=male     name=Beilul                        
correct=female   guess=male     name=Bird                          
correct=female   guess=male     name=Blair                         
correct=female   guess=male     name=Britt                         
correct=female   guess=male     name=Cam                           
correct=female   guess=male     name=Caril                         
correct=female   guess=male     name=Carilyn                       
correct=female   guess=male     name=Carin                         
correct=female   guess=male     name=Carleen    

Looking through this list of errors makes it clear that **some suffixes that are more than one letter can be indicative of name genders**. 

For example, names ending in *yn* appear to be predominantly female, despite the fact that names ending in *n* tend to be male; 

Names ending in *ch* are usually male, even though names that end in *h* tend to be female.

We therefore adjust our feature extractor to **include features for two-letter suffixes**:

In [31]:
def gender_features(word):
    return {'suffix1': word[-1:],
            'suffix2': word[-2:]}

Rebuilding the classifier with the new feature extractor, the performance on the dev-test dataset improves slightly (76.3% -> 76.6%).

In [32]:
train_set = [(gender_features(n), gender) for (n, gender) in train_names]
devtest_set = [(gender_features(n), gender) for (n, gender) in devtest_names]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, devtest_set))

0.766


This **error analysis** procedure can then be **repeated**, checking for patterns in the errors that are made by the newly improved classifier. 

Each time the error analysis procedure is repeated, we should select **a different dev-test/training split**, to ensure that the classifier does not start to reflect idiosyncrasies in the dev-test set.

Once our model development is complete with the help of the dev-test/training sets, we can use the test set to evaluate how well our model will perform on new input values.

## 1.3   Document Classification

For this example, we've chosen the Movie Reviews Corpus, which categorizes each review as positive or negative.

In [33]:
from nltk.corpus import movie_reviews

In [34]:
documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]

In [35]:
type(documents)

list

In [36]:
print(documents[12])

(['and', 'now', 'the', 'high', '-', 'flying', 'hong', 'kong', 'style', 'of', 'filmmaking', 'has', 'made', 'its', 'way', 'down', 'to', 'the', 'classics', ',', 'and', 'it', 'isn', "'", 't', 'pretty', '.', 'this', 'time', 'out', 'the', 'nod', 'to', 'asia', 'goes', 'by', 'way', 'of', 'france', 'in', 'the', 'excruciating', 'bland', 'and', 'lukewarm', 'production', 'of', 'the', 'musketeer', ',', 'a', 'version', 'of', 'dumas', "'", 's', 'the', 'three', 'musketeers', '.', 'by', 'bringing', 'in', 'popular', 'asian', 'actor', '/', 'stunt', 'coordinator', 'xing', 'xing', 'xiong', '--', 'whose', 'only', 'prior', 'american', 'attempts', 'at', 'stunt', 'choreography', 'have', 'been', 'the', 'laughable', 'van', 'damme', 'vehicle', 'double', 'team', 'and', 'the', 'dennis', 'rodman', 'cinematic', 'joke', 'simon', 'sez', '--', 'our', 'musketeers', 'are', 'thrown', 'into', 'the', 'air', 'to', 'do', 'their', 'fighting', '.', 'the', 'end', 'result', 'is', 'a', 'tepid', 'and', 'dull', 'action', '/', 'advent

In [37]:
type(documents[12])

tuple

In [38]:
len(documents[12])

2

In [39]:
type(documents[12][0])

list

In [40]:
len(documents[12][0])

568

In [41]:
random.shuffle(documents)

Next, we define **a feature extractor** for documents, so the classifier will know **which aspects** of the data it should pay attention to

To **limit the number of features** that the classifier needs to process, we begin by constructing **a list of the 2000 most frequent words** in the overall corpus.

We can then define a feature extractor that simply **checks whether each of these words is present** in a given document.

In [42]:
all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())
word_features = list(all_words)[:2000]

In [43]:
print(word_features[:30])

[',', 'the', '.', 'a', 'and', 'of', 'to', "'", 'is', 'in', 's', '"', 'it', 'that', '-', ')', '(', 'as', 'with', 'for', 'his', 'this', 'film', 'i', 'he', 'but', 'on', 'are', 't', 'by']


In [44]:
def document_features(document):
    document_words = set(document)
    features = {}
    for word in word_features:
        features['contains({})'.format(word)] = (word in document_words)
    return features

In [45]:
print(document_features(movie_reviews.words('pos/cv957_8737.txt')))

{'contains(,)': True, 'contains(the)': True, 'contains(.)': True, 'contains(a)': True, 'contains(and)': True, 'contains(of)': True, 'contains(to)': True, "contains(')": True, 'contains(is)': True, 'contains(in)': True, 'contains(s)': True, 'contains(")': True, 'contains(it)': True, 'contains(that)': True, 'contains(-)': True, 'contains())': True, 'contains(()': True, 'contains(as)': True, 'contains(with)': True, 'contains(for)': True, 'contains(his)': True, 'contains(this)': True, 'contains(film)': False, 'contains(i)': False, 'contains(he)': True, 'contains(but)': True, 'contains(on)': True, 'contains(are)': True, 'contains(t)': False, 'contains(by)': True, 'contains(be)': True, 'contains(one)': True, 'contains(movie)': True, 'contains(an)': True, 'contains(who)': True, 'contains(not)': True, 'contains(you)': True, 'contains(from)': True, 'contains(at)': False, 'contains(was)': False, 'contains(have)': True, 'contains(they)': True, 'contains(has)': True, 'contains(her)': False, 'conta

Training and testing a classifier for document classification.

In [46]:
featuresets = [(document_features(d), c) for (d,c) in documents]

In [47]:
len(featuresets)

2000

In [48]:
train_set, test_set = featuresets[100:], featuresets[:100]
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [49]:
print(nltk.classify.accuracy(classifier, test_set))

0.77


In [50]:
classifier.show_most_informative_features(5)

Most Informative Features
        contains(seagal) = True              neg : pos    =     12.5 : 1.0
   contains(outstanding) = True              pos : neg    =     10.9 : 1.0
         contains(mulan) = True              pos : neg    =      8.9 : 1.0
         contains(damon) = True              pos : neg    =      7.8 : 1.0
   contains(wonderfully) = True              pos : neg    =      6.7 : 1.0


## 1.4   Part-of-Speech Tagging

We can train a classifier to work out which **suffixes are most informative for POS tagging**. 

Let's begin by finding out what **the most common suffixes** are:

In [51]:
from nltk.corpus import brown

In [52]:
suffix_fdist = nltk.FreqDist()

In [53]:
for word in brown.words():
    word = word.lower()
    suffix_fdist[word[-1:]] += 1
    suffix_fdist[word[-2:]] += 1
    suffix_fdist[word[-3:]] += 1

In [54]:
common_suffixes = [suffix for (suffix, count) in suffix_fdist.most_common(100)]

In [55]:
print(common_suffixes)

['e', ',', '.', 's', 'd', 't', 'he', 'n', 'a', 'of', 'the', 'y', 'r', 'to', 'in', 'f', 'o', 'ed', 'nd', 'is', 'on', 'l', 'g', 'and', 'ng', 'er', 'as', 'ing', 'h', 'at', 'es', 'or', 're', 'it', '``', 'an', "''", 'm', ';', 'i', 'ly', 'ion', 'en', 'al', '?', 'nt', 'be', 'hat', 'st', 'his', 'th', 'll', 'le', 'ce', 'by', 'ts', 'me', 've', "'", 'se', 'ut', 'was', 'for', 'ent', 'ch', 'k', 'w', 'ld', '`', 'rs', 'ted', 'ere', 'her', 'ne', 'ns', 'ith', 'ad', 'ry', ')', '(', 'te', '--', 'ay', 'ty', 'ot', 'p', 'nce', "'s", 'ter', 'om', 'ss', ':', 'we', 'are', 'c', 'ers', 'uld', 'had', 'so', 'ey']


Next, we'll define a feature extractor function which **checks a given word for these common suffixes**.

In [56]:
def pos_features(word):
    features = {}
    for suffix in common_suffixes:
        features['endswith({})'.format(suffix)] = word.lower().endswith(suffix)
    return features

Feature extraction functions behave like tinted glasses, highlighting some of the properties (colors) in our data and making it impossible to see other properties. 特征抽取器就像一个有色眼睛，只关注数据中的某些特性，忽略其他特性。

The classifier will rely exclusively on these highlighted properties when determining how to label inputs. 分类器只依赖于这些被关注的特性来确定词性标签。

In this case, the classifier will make its decisions based only on information about which of the common suffixes (if any) a given word has.

Now that we've defined our feature extractor, we can use it to train a new "decision tree" classifier on Brown tagged corpus.

In [57]:
tagged_words = brown.tagged_words(categories='news')

In [58]:
featuresets = [(pos_features(n), g) for (n,g) in tagged_words]

In [59]:
size = int(len(featuresets) * 0.1)

In [60]:
size

10055

In [61]:
train_set, test_set = featuresets[size:], featuresets[:size]

In [62]:
# 此处训练特别慢，建议训练完成后，将模型保存到本地
# classifier = nltk.DecisionTreeClassifier.train(train_set)

In [63]:
import dill

# with open('./pos_tree.pkl', "wb") as f:
#     dill.dump(classifier, f)

In [64]:
with open('./pos_tree.pkl','rb') as f:
    classifier = dill.load(f)

In [65]:
nltk.classify.accuracy(classifier, test_set)

0.6270512182993535

In [66]:
classifier.classify(pos_features('cats'))

'NNS'

One nice feature of decision tree models is that they are often **fairly easy to interpret** — we can even instruct NLTK to **print them out as pseudocode**:

In [67]:
print(classifier.pseudocode(depth=4))

if endswith(the) == False: 
  if endswith(,) == False: 
    if endswith(s) == False: 
      if endswith(.) == False: return '.'
      if endswith(.) == True: return '.'
    if endswith(s) == True: 
      if endswith(is) == False: return 'PP$'
      if endswith(is) == True: return 'BEZ'
  if endswith(,) == True: return ','
if endswith(the) == True: return 'AT'



The actual classifier **contains further nested if-then statements** below the ones shown here, but the `depth=4` argument just displays the top portion of the decision tree. (实际的分类器在此处显示的语句下方包含更多嵌套的 if-then 语句，但 depth=4 参数仅显示决策树的顶部。)

## 1.5   Exploiting Context

By **augmenting the feature extraction function**, we could modify this part-of-speech tagger to leverage **a variety of other word-internal features**, such as the length of the word, the number of syllables it contains, or its prefix. 

However, as long as the feature extractor **just looks at the target word**, we **have no way to add features that depend on the context** that the word appears in. 

But **contextual features** often provide **powerful clues** about the correct tag.

For example, when tagging the word "fly" knowing that the previous word is "a" will allow us to determine that it is functioning as a noun, not a verb.

We will revise the pattern that we used to define our feature extractor and pass in a complete (untagged) sentence, along with the index of the target word. 

The following codes employ a context-dependent feature extractor to define a part of speech tag classifier.

In [68]:
def pos_features(sentence, i):
    features = {"suffix(1)": sentence[i][-1:],
                "suffix(2)": sentence[i][-2:],
                "suffix(3)": sentence[i][-3:]}
    
    # 在特征中增加target word的前一个词
    if i == 0:
        features["prev-word"] = "<START>"
    else:
        features["prev-word"] = sentence[i-1]
    return features

In [69]:
pos_features(brown.sents()[0], 8)

{'suffix(1)': 'n', 'suffix(2)': 'on', 'suffix(3)': 'ion', 'prev-word': 'an'}

In [70]:
tagged_sents = brown.tagged_sents(categories='news')

In [71]:
featuresets = []

In [72]:
for tagged_sent in tagged_sents:
    untagged_sent = nltk.tag.untag(tagged_sent)
    for i, (word, tag) in enumerate(tagged_sent):
        featuresets.append( (pos_features(untagged_sent, i), tag) )

In [73]:
size = int(len(featuresets) * 0.1)
train_set, test_set = featuresets[size:], featuresets[:size]
classifier = nltk.NaiveBayesClassifier.train(train_set)
nltk.classify.accuracy(classifier, test_set)

0.7891596220785678

It is clear that exploiting contextual features improves the performance of our part-of-speech tagger.

However, it is unable to **learn the generalization** that a word is probably a noun if it follows an adjective, because it **doesn't have access to the previous word's part-of-speech tag**.

In general, simple classifiers always **treat each input as independent from all other inputs**.

There are often cases, such as part-of-speech tagging, where we are interested in solving classification problems that are closely related to one another. (在某些情况下，例如词性标注，我们对解决彼此密切相关的分类问题感兴趣)

## 1.6   Sequence Classification

In order to capture **the dependencies between related classification tasks**, we can use **joint classifier models**, which choose an appropriate labeling for a collection of related inputs. 

In the case of part-of-speech tagging, a variety of different sequence classifier models can be used to **jointly choose part-of-speech tags for all the words in a given sentence**. (为给定句子中的所有单词联合地选择词性标注)

One sequence classification strategy, known as **consecutive classification** or **greedy sequence classification** (连续分类或贪婪序贯分类): 

- Find the most likely class label for the first input;

- Use that answer to help find the best label for the next input.

The process can then be repeated until all of the inputs have been labeled.

The Bigram tagger in Chapter 5 has employed this strategy.

First, we must augment our feature extractor function to **take a history argument**, which **provides a list of the tags that we've predicted for the sentence so far**.

In [74]:
def pos_features(sentence, i, history):
    features = {"suffix(1)": sentence[i][-1:],
                 "suffix(2)": sentence[i][-2:],
                 "suffix(3)": sentence[i][-3:]}
    # 将target word的前一个词的pos tag作为特征
    if i == 0:
        features["prev-word"] = "<START>"
        features["prev-tag"] = "<START>"
    else:
        features["prev-word"] = sentence[i-1]
        features["prev-tag"] = history[i-1]
    return features

Having defined a feature extractor, we can proceed to build our sequence classifier.

During training, we use the annotated tags to provide the appropriate history to the feature extractor, but when tagging new sentences, we generate the history list based on the output of the tagger itself.

In [75]:
class ConsecutivePosTagger(nltk.TaggerI):

    def __init__(self, train_sents):
        train_set = []
        for tagged_sent in train_sents:
            untagged_sent = nltk.tag.untag(tagged_sent)
            history = []
            for i, (word, tag) in enumerate(tagged_sent):
                featureset = pos_features(untagged_sent, i, history)
                train_set.append( (featureset, tag) )
                history.append(tag)
        self.classifier = nltk.NaiveBayesClassifier.train(train_set)

    def tag(self, sentence):
        history = []
        for i, word in enumerate(sentence):
            featureset = pos_features(sentence, i, history)
            tag = self.classifier.classify(featureset)
            history.append(tag)
        return zip(sentence, history)

In [76]:
tagged_sents = brown.tagged_sents(categories='news')

In [77]:
size = int(len(tagged_sents) * 0.1)
train_sents, test_sents = tagged_sents[size:], tagged_sents[:size]
tagger = ConsecutivePosTagger(train_sents)
print(tagger.evaluate(test_sents))

0.7980528511821975


## 1.7   Other Methods for Sequence Classification

One shortcoming of this approach is that we commit to every decision that we make. (为每个决策做出承诺)

For example, if we decide to label a word as a noun, but later find evidence that it should have been a verb, there's **no way to go back and fix our mistake**.

One solution to this problem is to adopt a **transformational strategy** instead. 

Transformational joint classifiers work by creating an initial assignment of labels for the inputs, and then iteratively refining that assignment in an attempt to repair inconsistencies between related inputs. (先给定一个初始的标签序列，然后迭代地修正)

The **Brill tagger**, described in Chapter 5, is a good example of this strategy.

Another solution is to **assign scores to all of the possible sequences of part-of-speech tags**, and to **choose the sequence whose overall score is highest**. Such as Hidden Markov Models, Maximum Entropy Markov Models and Linear-Chain Conditional Random Field Models.

# 2.   Further Examples of Supervised Classification

## 2.1   Sentence Segmentation

Sentence segmentation can be viewed as a classification task for punctuation: whenever we encounter a symbol that could possibly end a sentence, such as a period or a question mark, we have to decide whether it terminates the preceding sentence. (标点分类问题)

The first step is to obtain some data that **has already been segmented into sentences** and **convert it into a form that is suitable for extracting features**:

In [78]:
sents = nltk.corpus.treebank_raw.sents()
tokens = []
boundaries = set()
offset = 0
for sent in sents:
    tokens.extend(sent)
    offset += len(sent)
    boundaries.add(offset-1)

In [79]:
print(tokens[:40])

['.', 'START', 'Pierre', 'Vinken', ',', '61', 'years', 'old', ',', 'will', 'join', 'the', 'board', 'as', 'a', 'nonexecutive', 'director', 'Nov', '.', '29', '.', 'Mr', '.', 'Vinken', 'is', 'chairman', 'of', 'Elsevier', 'N', '.', 'V', '.,', 'the', 'Dutch', 'publishing', 'group', '.', '.', 'START', 'Rudolph']


In [80]:
sorted(list(boundaries))

[1,
 20,
 36,
 38,
 64,
 66,
 102,
 134,
 163,
 199,
 211,
 228,
 237,
 258,
 286,
 310,
 344,
 365,
 387,
 406,
 429,
 453,
 489,
 512,
 558,
 617,
 637,
 655,
 671,
 692,
 705,
 739,
 763,
 796,
 811,
 821,
 823,
 846,
 891,
 907,
 934,
 958,
 976,
 1013,
 1048,
 1077,
 1092,
 1117,
 1142,
 1153,
 1192,
 1213,
 1235,
 1273,
 1275,
 1312,
 1332,
 1348,
 1350,
 1378,
 1398,
 1400,
 1424,
 1431,
 1448,
 1463,
 1479,
 1481,
 1505,
 1529,
 1552,
 1570,
 1601,
 1624,
 1626,
 1656,
 1679,
 1702,
 1716,
 1718,
 1751,
 1755,
 1773,
 1791,
 1832,
 1863,
 1892,
 1897,
 1926,
 1943,
 1962,
 1993,
 2009,
 2032,
 2051,
 2065,
 2103,
 2125,
 2148,
 2166,
 2168,
 2196,
 2231,
 2272,
 2299,
 2320,
 2341,
 2368,
 2380,
 2382,
 2411,
 2440,
 2492,
 2513,
 2529,
 2585,
 2599,
 2620,
 2659,
 2712,
 2739,
 2769,
 2782,
 2812,
 2838,
 2861,
 2863,
 2903,
 2946,
 2997,
 3040,
 3096,
 3125,
 3134,
 3148,
 3177,
 3182,
 3183,
 3184,
 3211,
 3245,
 3288,
 3299,
 3326,
 3365,
 3390,
 3392,
 3438,
 3472,
 3474,


Next, we need to specify the features of the data that will be used in order to decide whether punctuation indicates a sentence-boundary:

In [81]:
def punct_features(tokens, i):
    return {'next-word-capitalized': tokens[i+1][0].isupper(),
            'prev-word': tokens[i-1].lower(),
            'punct': tokens[i],
            'prev-word-is-one-char': len(tokens[i-1]) == 1}

Based on this feature extractor, we can create a list of labeled featuresets by selecting all the punctuation tokens, and tagging whether they are boundary tokens or not:

In [82]:
featuresets = [(punct_features(tokens, i), (i in boundaries))
               for i in range(1, len(tokens)-1)
               if tokens[i] in '.?!']

Using these featuresets, we can train and evaluate a punctuation classifier:

In [83]:
size = int(len(featuresets) * 0.1)
train_set, test_set = featuresets[size:], featuresets[:size]
classifier = nltk.NaiveBayesClassifier.train(train_set)
nltk.classify.accuracy(classifier, test_set)

0.936026936026936

To **use this classifier to perform sentence segmentation**, we simply check each punctuation mark to see whether it's labeled as a boundary; and **divide the list of words at the boundary marks**.

In [84]:
def segment_sentences(words):
    start = 0
    sents = []
    for i, word in enumerate(words):
        if word in '.?!' and classifier.classify(punct_features(words, i)) == True:
            sents.append(words[start:i+1])
            start = i+1
    if start < len(words):
        sents.append(words[start:])
    return sents

## 2.2   Identifying Dialogue Act Types

When processing dialogue, it can be useful to **think of utterances as a type of action performed by the speaker**. (将话语视为说话者执行的一种动作)

**Greetings, questions, answers, assertions, and clarifications** 问候、问题、答案、断言和澄清() can all be thought of as types of speech-based actions. 

**Recognizing the dialogue acts underlying the utterances** in a dialogue can be an important first step in **understanding the conversation**. (识别对话行为/意图是理解对话的重要步骤/首要步骤)

The NPS Chat Corpus consists of over 10,000 posts from instant messaging sessions.

These posts have all been labeled with one of 15 dialogue act types, such as "Statement," "Emotion," "ynQuestion", and "Continuer." 

In [85]:
posts = nltk.corpus.nps_chat.xml_posts()[:10000]

In [86]:
type(posts)

nltk.collections.LazySubsequence

Next, we'll define a simple feature extractor that checks what words the post contains:

In [87]:
def dialogue_act_features(post):
    features = {}
    for word in nltk.word_tokenize(post):
        features['contains({})'.format(word.lower())] = True
    return features

Finally, we construct the training and testing data by applying the feature extractor to each post (using `post.get('class')` to get a post's dialogue act type), and create a new classifier:

In [88]:
featuresets = [(dialogue_act_features(post.text), post.get('class'))
               for post in posts]
size = int(len(featuresets) * 0.1)
train_set, test_set = featuresets[size:], featuresets[:size]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))

0.667


## 2.3   Recognizing Textual Entailment

Recognizing textual entailment (RTE) is the task of determining whether a given piece of text T entails another text called the "hypothesis".

**T**: Parviz Davudi was representing Iran at a meeting of the Shanghai Co-operation Organisation (SCO), the fledgling association that binds Russia, China and four former Soviet republics of central Asia together to fight terrorism. (Parviz Davudi 代表伊朗出席了上海合作组织 (SCO) 的会议，该组织是一个将俄罗斯、中国和四个前苏联中亚共和国联合起来打击恐怖主义的新兴组织。)

**H**: China is a member of SCO. (中国是上合组织成员国)

**TRUE**

**T**: According to NC Articles of Organization, the members of LLC company are H. Nelson Beavers, III, H. Chester Beavers and Jennie Beavers Stewart. (根据 NC 组织章程，LLC 公司的成员是 H. Nelson Beavers, III, H. Chester Beavers 和 Jennie Beavers Stewart)

**H**: Jennie Beavers Stewart is a share-holder of Carolina Analytical Laboratory. (Jennie Beavers Stewart 是 Carolina Analytical Laboratory 的股东)

**FALSE**

We can treat RTE as a classification task, in which we try to predict the True/False label for each pair. 

Although it seems likely that successful approaches to this task will involve a combination of parsing, semantics and real world knowledge.

Many early attempts at RTE achieved reasonably good results with shallow analysis, based on **similarity between the text and hypothesis at the word level**.

If there is an entailment, then all the information expressed by the hypothesis should also be present in the text. 

Conversely, if **there is information found in the hypothesis that is absent from the text**, then there will be no entailment.

In our RTE feature detector below, we let **words serve as proxies for information**, and our features count **the degree of word overlap**, and the degree to which **there are words in the hypothesis but not in the text** (captured by the method `hyp_extra()`).

**Not all words are equally important** — **Named Entity mentions** such as the names of people, organizations and places are likely to be **more significant**, which motivates us to extract distinct information for words and nes (Named Entities). 

In addition, some **high frequency function words** are filtered out as "**stopwords**".

In [89]:
def rte_features(rtepair):
    extractor = nltk.RTEFeatureExtractor(rtepair)
    features = {}
    features['word_overlap'] = len(extractor.overlap('word'))
    features['word_hyp_extra'] = len(extractor.hyp_extra('word'))
    features['ne_overlap'] = len(extractor.overlap('ne'))
    features['ne_hyp_extra'] = len(extractor.hyp_extra('ne'))
    return features

In [90]:
# nltk.download('rte')

In [91]:
rtepair = nltk.corpus.rte.pairs(['rte3_dev.xml'])[33]

In [92]:
help(rtepair)

Help on RTEPair in module nltk.corpus.reader.rte object:

class RTEPair(builtins.object)
 |  RTEPair(pair, challenge=None, id=None, text=None, hyp=None, value=None, task=None, length=None)
 |  
 |  Container for RTE text-hypothesis pairs.
 |  
 |  The entailment relation is signalled by the ``value`` attribute in RTE1, and by
 |  ``entailment`` in RTE2 and RTE3. These both get mapped on to the ``entailment``
 |  attribute of this class.
 |  
 |  Methods defined here:
 |  
 |  __init__(self, pair, challenge=None, id=None, text=None, hyp=None, value=None, task=None, length=None)
 |      :param challenge: version of the RTE challenge (i.e., RTE1, RTE2 or RTE3)
 |      :param id: identifier for the pair
 |      :param text: the text component of the pair
 |      :param hyp: the hypothesis component of the pair
 |      :param value: classification label for the pair
 |      :param task: attribute for the particular NLP task that the data was drawn from
 |      :param length: attribute for t

In [93]:
rtepair.hyp

'China is a member of SCO.'

In [94]:
rtepair.text

'Parviz Davudi was representing Iran at a meeting of the Shanghai Co-operation Organisation (SCO), the fledgling association that binds Russia, China and four former Soviet republics of central Asia together to fight terrorism.'

In [95]:
extractor = nltk.RTEFeatureExtractor(rtepair)

In [96]:
print(extractor.text_words)

{'Co', 'four', 'Russia', 'Organisation', 'republics', 'former', 'Shanghai', 'Parviz', 'that', 'Iran', 'central', 'was', 'meeting', 'operation', 'binds', 'SCO', 'Asia', 'together', 'China', 'association', 'Soviet', 'fight', 'at', 'Davudi', 'representing', 'fledgling', 'terrorism.'}


In [97]:
print(extractor.hyp_words)

{'member', 'SCO.', 'China'}


In [98]:
print(extractor.overlap('word'))

set()


In [99]:
print(extractor.overlap('ne'))

{'China'}


In [100]:
print(extractor.hyp_extra('word'))

{'member'}


These features indicate that all important words in the hypothesis are contained in the text, and thus there is some evidence for labeling this as *True*.

## 2.4   Scaling Up to Large Datasets

Python provides an excellent environment for performing basic text processing and feature extraction. However, it is not able to perform the numerically intensive calculations required by machine learning methods nearly as quickly as lower-level languages such as C.

If you plan to train classifiers with large amounts of training data or a large number of features, we recommend that you explore NLTK's facilities for interfacing with external machine learning packages.

# 3.   Evaluation

The result of this evaluation is important for deciding **how trustworthy the model is**, and **for what purposes we can use it**.

Evaluation can also be an effective tool for guiding us in **making future improvements to the model**.

## 3.1   The Test Set

It is very important that the test set be distinct from the training corpus.

When building the test set, there is often a trade-off between the amount of data available for testing and the amount available for training.

For classification tasks that have a small number of well-balanced labels and a diverse test set, a meaningful evaluation can be performed with as few as 100 evaluation instances. 

But if a classification task has a large number of labels, or includes very infrequent labels, then the size of the test set should be chosen to ensure that the least frequent label occurs at least 50 times. 

Additionally, if the test set contains many closely related instances — such as instances drawn from a single document — then the size of the test set should be increased to ensure that this lack of diversity does not skew the evaluation results. 

When large amounts of annotated data are available, **it is common to err on the side of safety by using 10% of the overall data for evaluation**.

下面以POS tagging任务为例说明：**当有大量标注数据可用时，使用10%的整体数据进行模型评估，可能会在安全方面犯错**。

从同一体裁（如news）中选择10%作为测试集，会使得开发集和测试机的样例非常相似，进而影响模型的评估结果（无法根据测试结果准确评估模型的泛化性能），如下：

In [101]:
import random
from nltk.corpus import brown
tagged_sents = list(brown.tagged_sents(categories='news'))
random.shuffle(tagged_sents)
size = int(len(tagged_sents) * 0.1)
train_set, test_set = tagged_sents[size:], tagged_sents[:size]

上面代码中使用了`random.shuffle`方法，会使得测试集和训练集使用来自相同文档的句子，进一步加剧了测试集和训练集的相似性，进而影响模型的评估结果（无法根据测试结果准确评估模型的泛化性能）。

对于同一体裁的标注语料，可以选择让训练数据和测试数据分别来自不同的文档，如下：

In [102]:
file_ids = brown.fileids(categories='news')
size = int(len(file_ids) * 0.1)
train_set = brown.tagged_sents(file_ids[size:])
test_set = brown.tagged_sents(file_ids[:size])

如果想要进行更严格的评估，可以让训练集和测试集使用不同体裁的文档，如下：

In [103]:
train_set = brown.tagged_sents(categories='news')
test_set = brown.tagged_sents(categories='fiction')

如果使用上面的数据集划分构建了一个在测试集上表现良好的分类器，那么可以确信它具有较强的泛化能力。

## 3.2   Accuracy