# 6. Learning to Classify Text

## Part 1.   Supervised Classification
Classification(分類)是對input來給予正確的類別標籤<br>
在基本的分類項目中，每個input都是獨立的，標籤也事先定義<br>

稱為『Supervised』的分類器是建立在訓練文集中對每個input標上正確的類別的方法

![Imgur](https://i.imgur.com/SO0thhn.png)

## 1.1   Gender Identification
### 建立分類器來根據姓名的最後一個字母來判斷性別

In [5]:
#從nltk的文集中載入分別有男生姓名和女生姓名的文件
from nltk.corpus import names
import random
import nltk

#從男生姓名或女生姓名的txt檔抓出來的名字產生出一組tuple
labeled_names = ([(name, 'male') for name in names.words('male.txt')] +[(name, 'female') for name in names.words('female.txt')])

#隨機打亂順序，不然都是排好的
random.shuffle(labeled_names)

#回傳參數最後一個字元
def gender_features(word):
    return {'last_letter': word[-1]}

#featuresets包含tuple，分別是姓名的最後一個字元和性別
featuresets = [(gender_features(n), gender) for (n, gender) in labeled_names]

#將featuresets前後500個data分別存成訓練集和測試集
train_set, test_set = featuresets[500:], featuresets[:500]

#將訓練集使用函式NaiveBayesClassifer來產生一個新的NaiveBayes分類器
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [2]:
#測試分類器
classifier.classify(gender_features('Neo'))

'male'

In [3]:
classifier.classify(gender_features('Trinity'))

'female'

In [6]:
#訓練資料是隨機的，準確度會不一樣
print(nltk.classify.accuracy(classifier, test_set))

0.776


In [7]:
#找出最有資訊的特徵
classifier.show_most_informative_features(5)

Most Informative Features
             last_letter = 'a'            female : male   =     34.2 : 1.0
             last_letter = 'k'              male : female =     31.8 : 1.0
             last_letter = 'f'              male : female =     16.0 : 1.0
             last_letter = 'p'              male : female =     12.6 : 1.0
             last_letter = 'v'              male : female =      9.9 : 1.0


### 當使用List儲存從龐大的文集所產生每個物件包含的特徵時，會佔用空間
### 所以我們可以使用函式nltk.classify.apply_features，可以回傳像List且包含物件和特徵的資料且不佔用空間

In [9]:
from nltk.classify import apply_features

train_set = apply_features(gender_features, labeled_names[500:])
test_set = apply_features(gender_features, labeled_names[:500])

print(test_set)

[({'last_letter': 'n'}, 'male'), ({'last_letter': 'i'}, 'female'), ...]


## 1.2   Choosing The Right Features
### 為分類器選擇好的特點，可以讓準確度更高

In [144]:
def gender_features2(name):
    features = {}
    features["first_letter"] = name[0].lower()
    features["last_letter"] = name[-1].lower()
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        features["count({})".format(letter)] = name.lower().count(letter)
        features["has({})".format(letter)] = (letter in name.lower())
    return features

gender_features2('John') 

{'count(a)': 0,'count(b)': 0,'count(c)': 0,'count(d)': 0,'count(e)': 0,'count(f)': 0,'count(g)': 0,'count(h)': 1,'count(i)': 0,'count(j)': 1,'count(k)': 0,'count(l)': 0,'count(m)': 0,'count(n)': 1,'count(o)': 1,'count(p)': 0,'count(q)': 0,'count(r)': 0,'count(s)': 0,'count(t)': 0,'count(u)': 0,'count(v)': 0,'count(w)': 0,'count(x)': 0,'count(y)': 0,'count(z)': 0,'first_letter': 'j','has(a)': False,'has(b)': False,
 'has(c)': False,'has(d)': False,'has(e)': False,'has(f)': False,'has(g)': False,'has(h)': True,'has(i)': False,'has(j)': True,'has(k)': False,'has(l)': False,'has(m)': False,'has(n)': True,'has(o)': True,'has(p)': False,'has(q)': False,'has(r)': False,'has(s)': False,'has(t)': False,'has(u)': False,'has(v)': False,'has(w)': False,'has(x)': False,'has(y)': False,'has(z)': False,'last_letter': 'n'}

### 但是，選取特點的時候也要適量——如果要選取的特點太多的時候，演算法會更加依賴訓練資料的特點，可能會讓新例子歸納得較差
### 這樣的問題稱為 overfitting，在小的訓練集當中overfitting會更加顯著

In [64]:
featuresets = [(gender_features2(n), gender) for (n, gender) in labeled_names]
train_set, test_set = featuresets[500:], featuresets[:500]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))

0.762


### 從以上的範例來看，gender_features2雖然取得較多的特點，但是產生出來的準確度比只取名字最後一個字元的gender_features還低

### 當選取初始的特點時，error analysis是個可以很有效率去修正的方法
### 一開始我們選取可以用來製造模型且包含文集data的development set，development set再被分成training set和dev-test set

In [65]:
#train_names用來訓練模型，devtest_names用來實作error analysis
train_names = labeled_names[1500:]
devtest_names = labeled_names[500:1500]
test_names = labeled_names[:500]

### 文集被分成不同的集合
![Imgur](https://i.imgur.com/vLxZNXu.png)


In [67]:
train_set = [(gender_features(n), gender) for (n, gender) in train_names]
devtest_set = [(gender_features(n), gender) for (n, gender) in devtest_names]
test_set = [(gender_features(n), gender) for (n, gender) in test_names]

#使用train_set來訓練分類器
classifier = nltk.NaiveBayesClassifier.train(train_set)

#使用devtest_set來測試分類器
print(nltk.classify.accuracy(classifier, devtest_set))

0.738


In [71]:
#將猜錯的名字存進errors
errors = []
for (name, tag) in devtest_names:
    guess = classifier.classify(gender_features(name))
    if guess != tag:
        errors.append( (tag, guess, name) )
#print(errors)

('gender : male', 'guess : female', 'Mortie'), <br>
('gender : male', 'guess : female', 'Bartie'),<br>
('gender : male', 'guess : female', 'Jackie'),<br>
('gender : male', 'guess : female', 'Jeremie'),<br>
('gender : male', 'guess : female', 'Charlie'),<br>
<br>
('gender : female', 'guess : male', 'Dorian'),<br>
('gender : female', 'guess : male', 'Christan'), <br>
('gender : female', 'guess : male', 'Jillian'),<br>
('gender : female', 'guess : male', 'Lilyan')<br>

### 分類器將名字最後一個字元為e的都歸類為female，但是從結果可以看出最後兩個字元為ie的多為男性；而姓名最後兩個字元為an的通常也為女性
### 我們從錯誤的結果可以得出取姓名的最後兩個字元為特點也許會較準確

In [74]:
#定義取姓名最後兩個字元作為特點的函式
def gender_features3(word):
    return {'suffix1': word[-1:],
            'suffix2': word[-2:]}

train_set = [(gender_features3(n), gender) for (n, gender) in train_names]
devtest_set = [(gender_features3(n), gender) for (n, gender) in devtest_names]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, devtest_set))

0.763


### 取姓名的最後兩個字元作為特點後再對分類器進行訓練，再將devtest_set做預測，預測的準確值提高了

## 1.3   Document Classification

將文件movie_reviews所有正面和負面影評載入，找出reviews中最常出現的前2000個token存成word_features<br>
 檢查word_features中的token是否出現在正面或負面的評語中，得出來的結果作為訓練集然後丟入分類器<br>
 最後可以從評語中包含或不包含的token來預測出是否為pos或neg

In [10]:
from nltk.corpus import movie_reviews

documents = [(list(movie_reviews.words(fileid)), category)
              for category in movie_reviews.categories()
              for fileid in movie_reviews.fileids(category)]

random.shuffle(documents)

In [11]:
# 找出最常出現的前2000個token，存成all_words
all_words = nltk.FreqDist(w.lower()for w in movie_reviews.words())
word_features = list(all_words)[:2000]

def document_features(document):
    document_words = set(document)
    features = {}
    for word in word_features:
        features['contains({})'.format(word)] = (word in document_words)
    return features

#找出文件是否包含word_features內的字
#rint(document_features(movie_reviews.words('pos/cv957_8737.txt'))) 

In [15]:
# d為文件內的文字set，c為pos或neg，找出pos或neg的文件內含有的字進行訓練
featuresets = [(document_features(d), c) for (d,c) in documents]

train_set, test_set = featuresets[100:], featuresets[:100]
classifier = nltk.NaiveBayesClassifier.train(train_set)

love = {'contains(i)': True, 'contains(love)': True, 'contains(this)': True, 'contains(movie)': True}
hate = {'contains(i)': True, 'contains(hate)': True, 'contains(this)': True, 'contains(movie)': True}
print(classifier.classify(hate),'\n')
classifier.show_most_informative_features(5)

neg 

Most Informative Features
    contains(schumacher) = True              neg : pos    =      7.4 : 1.0
        contains(wasted) = True              neg : pos    =      7.4 : 1.0
 contains(unimaginative) = True              neg : pos    =      7.0 : 1.0
        contains(turkey) = True              neg : pos    =      6.8 : 1.0
     contains(atrocious) = True              neg : pos    =      6.6 : 1.0


## 1.4   Part-of-Speech Tagging
訓練出一個分類器來找出最有意義的詞尾來作為TAG

In [222]:
# 分別從brown文字的每個token的最後一個、最後兩個、最後三個字元計算頻率，找出最多出現的詞尾
from nltk.corpus import brown
suffix_fdist = nltk.FreqDist()
for word in brown.words():
    word = word.lower()
    suffix_fdist[word[-1:]] += 1
    suffix_fdist[word[-2:]] += 1
    suffix_fdist[word[-3:]] += 1

In [223]:
# common_suffixes儲存前100名最常出現的詞尾
common_suffixes = [suffix for (suffix, count) in suffix_fdist.most_common(100)]
print(common_suffixes)

['e', ',', '.', 's', 'd', 't', 'he', 'n', 'a', 'of', 'the', 'y', 'r', 'to', 'in', 'f', 'o', 'ed', 'nd', 'is', 'on', 'l', 'g', 'and', 'ng', 'er', 'as', 'ing', 'h', 'at', 'es', 'or', 're', 'it', '``', 'an', "''", 'm', ';', 'i', 'ly', 'ion', 'en', 'al', '?', 'nt', 'be', 'hat', 'st', 'his', 'th', 'll', 'le', 'ce', 'by', 'ts', 'me', 've', "'", 'se', 'ut', 'was', 'for', 'ent', 'ch', 'k', 'w', 'ld', '`', 'rs', 'ted', 'ere', 'her', 'ne', 'ns', 'ith', 'ad', 'ry', ')', '(', 'te', '--', 'ay', 'ty', 'ot', 'p', 'nce', "'s", 'ter', 'om', 'ss', ':', 'we', 'are', 'c', 'ers', 'uld', 'had', 'so', 'ey']


In [224]:
#從建立好的詞尾的list來檢查傳送過來的word的詞尾是否符合
def pos_features(word):
    features = {}
    for suffix in common_suffixes:
        features['endswith({})'.format(suffix)] = word.lower().endswith(suffix)
    return features

#print(pos_features('suffixe'))

In [227]:
# 從文集brown找出種類為news的tagged words(裡面存的是詞&詞性)

tagged_words = brown.tagged_words(categories='news')

# 找出詞尾是否符合common_suffixes
featuresets = [(pos_features(n), g) for (n,g) in tagged_words]

# 將size縮成原本的十分之一，不然太大了
size = int(len(featuresets) * 0.1)

train_set, test_set = featuresets[size:], featuresets[:size]

#訓練一個新的DecisionTree分類器
classifier = nltk.DecisionTreeClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))

classifier.classify(pos_features('cats'))

0.6270512182993535


'NNS'

#### 決策樹比較好的特點為他們較容易理解，還能使用NLTK以pseudocode印出

In [226]:
print(classifier.pseudocode(depth=4))

if endswith(the) == False: 
  if endswith(,) == False: 
    if endswith(s) == False: 
      if endswith(.) == False: return '.'
      if endswith(.) == True: return '.'
    if endswith(s) == True: 
      if endswith(is) == False: return 'PP$'
      if endswith(is) == True: return 'BEZ'
  if endswith(,) == True: return ','
if endswith(the) == True: return 'AT'



## 1.5   Exploiting Context
可以從單一token獲得的特點並不多，若是token的前後文通常為tag的準確度提供強大的線索<br>
舉例來說，fly如果前面是a的話，我們可以得知這個fly的詞性是noun而並非verb

In [231]:
# 如果傳進來的參數是0的話，代表這個詞是開頭
# 獲取sentence的最後1，2，3字元
def pos_features(sentence, i):
    features = {"suffix(1)": sentence[i][-1:],
                "suffix(2)": sentence[i][-2:],
                "suffix(3)": sentence[i][-3:]}
    if i == 0:
        features["prev-word"] = "<START>"
    else:
        # 獲得前一個token
        features["prev-word"] = sentence[i-1]
    return features

print(brown.sents()[0])
pos_features(brown.sents()[0], 8)

['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.']


{'prev-word': 'an', 'suffix(1)': 'n', 'suffix(2)': 'on', 'suffix(3)': 'ion'}

In [271]:
# 句子裡每個token的詞性
tagged_sents = brown.tagged_sents(categories='news')

featuresets = []
for tagged_sent in tagged_sents:
    # 將原本有詞性的句子去掉詞性，剩下一般的sentence
    untagged_sent = nltk.tag.untag(tagged_sent)
    #print(tagged_sent)
    #print(untagged_sent)
    
    #為句子加上索引，i總共跑了一個句子的token的數量
    for i, (word, tag) in enumerate(tagged_sent):
        #print(word, tag)
        #featuresets儲存token的suffix和前一個token儲存成tuple
        featuresets.append( (pos_features(untagged_sent, i), tag))

#取一部分的featuresets作為train_set和test_set，作為分類器的訓練集和測試集
size = int(len(featuresets) * 0.1)
train_set, test_set = featuresets[size:], featuresets[:size]
classifier = nltk.NaiveBayesClassifier.train(train_set)

print(test_set[1][0])
classifier.classify(test_set[1][0])
#nltk.classify.accuracy(classifier, test_set)

{'suffix(1)': 'n', 'suffix(2)': 'on', 'suffix(3)': 'ton', 'prev-word': 'The'}


'NN'

## 1.6   Sequence Classification
將獲取特點的函數加入一個參數history，裡面包含了句子中每個token的tag，且對應句子中每個token，根據目前token的上一個token及上一個token的tag來預測目前token本身的tag

In [None]:
def pos_features(sentence, i, history):
    features = {"suffix(1)": sentence[i][-1:],
                 "suffix(2)": sentence[i][-2:],
                 "suffix(3)": sentence[i][-3:]}
    if i == 0:
        features["prev-word"] = "<START>"
        features["prev-tag"] = "<START>"
    else:
        #prev-word和prev-tag分別儲存token的前一個token和tag
        features["prev-word"] = sentence[i-1]
        features["prev-tag"] = history[i-1]
    return features

class ConsecutivePosTagger(nltk.TaggerI):
    def __init__(self, train_sents):
        train_set = []
        for tagged_sent in train_sents:
            #將句子中每個有tag的token去掉token
            untagged_sent = nltk.tag.untag(tagged_sent)
            history = []
            for i, (word, tag) in enumerate(tagged_sent):
                featureset = pos_features(untagged_sent, i, history)
                train_set.append( (featureset, tag) )
                history.append(tag)
        self.classifier = nltk.NaiveBayesClassifier.train(train_set)

    def tag(self, sentence):
        history = []
        for i, word in enumerate(sentence):
            featureset = pos_features(sentence, i, history)
            tag = self.classifier.classify(featureset)
            history.append(tag)
        return zip(sentence, history)
    
tagged_sents = brown.tagged_sents(categories='news')
size = int(len(tagged_sents) * 0.1)
train_sents, test_sents = tagged_sents[size:], tagged_sents[:size]
tagger = ConsecutivePosTagger(train_sents)
print(tagger.evaluate(test_sents))

## Part 2.   Further Examples of Supervised Classification
### 2.1   Sentence Segmentation
要如何分割句子，簡單來說，就是用標點符號來做為分割句子的依據<BR>
判斷哪些標點符號是一個句子的末端<BR><BR>
底下範例來說明哪些標點符號是一段句子的結尾

In [2]:
import nltk
sents = nltk.corpus.treebank_raw.sents() # 抓出某篇文章且已經分割成句子
tokens = []
boundaries = set()
offset = 0
for sent in sents:
    tokens.extend(sent) # 先還原所有的句子變成一篇文章，並分成一個一個token
    offset += len(sent) # 計算目前tokens變數的長度總和
    boundaries.add(offset-1) # 紀錄每個句子末端的位置

In [3]:
print(sents)

[['.', 'START'], ['Pierre', 'Vinken', ',', '61', 'years', 'old', ',', 'will', 'join', 'the', 'board', 'as', 'a', 'nonexecutive', 'director', 'Nov', '.', '29', '.'], ...]


In [6]:
# 定義標點符號所擁有的feature
def punct_features(tokens, i):
    return {'next-word-capitalized': tokens[i+1][0].isupper(), # 判斷下個token第一個字是否為大寫
            'prev-word': tokens[i-1].lower(), # 顯示前一個token
            'punct': tokens[i], # 顯示目前的標點符號
            'prev-word-is-one-char': len(tokens[i-1]) == 1} #判斷前一個token的長度是否長度為1

In [7]:
# featuresets為所有標點符號的feature
featuresets = [(punct_features(tokens, i), (i in boundaries))
               for i in range(1, len(tokens)-1)
               if tokens[i] in '.?!']

In [None]:
featuresets 

有了標點符號的feature set，就能夠丟進去model訓練，做出一個分類器

In [28]:
size = int(len(featuresets) * 0.1)
train_set, test_set = featuresets[size:], featuresets[:size]
classifier = nltk.NaiveBayesClassifier.train(train_set) # 訓練model
nltk.classify.accuracy(classifier, test_set) # 評估精準度

0.936026936026936

有了一個model能得知哪些標點符號是否為句子的結尾<BR>
就能夠做出一個斷句的function

In [9]:
def segment_sentences(words):
    start = 0
    sents = []
    for i, word in enumerate(words):
        if word in '.?!' and classifier.classify(punct_features(words, i)) == True:
            sents.append(words[start:i+1])
            start = i+1
    if start < len(words):
        sents.append(words[start:])
    return sents

In [53]:
names = ['how', 'are', 'you','?','I','am','fine',',','thx','.',' ']
segment_sentences(names)

[['how', 'are', 'you', '?'], ['I', 'am', 'fine', ',', 'thx', '.', ' ']]

### 2.2   Identifying Dialogue Act Types
要辨識對話中言語的對話行為，理解對話是重要的第一步<BR><BR>
NPS corpus是由10000筆以上的message數據所組成<BR>
這些數據都已經被標上了標籤(label)，標籤總共有15種type<BR>
像是"Statement", "Emotion", "ynQuestion", "Continuer."...等等<BR><BR>
底下就要做出一個分類器

In [96]:
# xml_posts()會回傳一個XML的資料
posts = nltk.corpus.nps_chat.xml_posts()[:10000]

<img src="https://i.imgur.com/6DlJ0Gx.jpg" />

In [60]:
# 此function為一個feature extractor
def dialogue_act_features(post):
    features = {}
    # nltk.word_tokenize()會將string做斷詞
    for word in nltk.word_tokenize(post):
        features['contains({})'.format(word.lower())] = True
    return features

In [89]:
# featuresets裡面包含所有對話的feature，以及每個feature set所對應的action_type(label)
featuresets = [(dialogue_act_features(post.text), post.get('class'))
               for post in posts]
size = int(len(featuresets) * 0.1)
train_set, test_set = featuresets[size:], featuresets[:size]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))

0.668


### 2.3   Recognizing Textual Entailment
RTE目的，就是決定某個Text T是否包含著其他含意的text，這個text叫做hypothesis。<BR>
底下有個範例:<BR><BR>
    T代表原本的text，H代表T的意涵。<BR>
    結果為true代表T跟H的含意是相同的，反之為false
<img src="https://i.imgur.com/vscLDy3.jpg"><BR>
對於RTE，可以將它看成是一個分類的工作，來預測每一對的Text/Hypothesis之間的關係是True/False。<br>
在理想的case中，如果有個訊息，出現在Hypothesis以及在原本的text中，那這個訊息就是個entailment。<BR>
相反的，某個訊息只出現在Hypothesis而沒有出現在原本的text中，那這個訊息就不是個entailment。

In [73]:
def rte_features(rtepair):
    extractor = nltk.RTEFeatureExtractor(rtepair)
    features = {}
    features['word_overlap'] = len(extractor.overlap('word'))
    features['word_hyp_extra'] = len(extractor.hyp_extra('word'))
    features['ne_overlap'] = len(extractor.overlap('ne'))
    features['ne_hyp_extra'] = len(extractor.hyp_extra('ne'))
    return features

In [87]:
rtepair = nltk.corpus.rte.pairs(['rte3_dev.xml'])[33]
extractor = nltk.RTEFeatureExtractor(rtepair)
'''
因為並不是所有單詞都是重要的，
所以NLTK作者認為有關人名、組織、地點的單詞可能會是比較重要的資訊。
所以作者將這些重要單詞做了些區分:
'ne'會取出所有有關人名、組織、地點的單詞
'word'則是取出除了ne取出的單字
'''
print("text =>",extractor.text_words,"\n")
print("hyp =>",extractor.hyp_words,"\n")
print("overlap_word =>",extractor.overlap('word'),"\n")
print("overlap_ne =>",extractor.overlap('ne'),"\n")
print("hyp_extra_word =>",extractor.hyp_extra('word'),"\n")
print("hyp_extra_ne =>",extractor.hyp_extra('ne'))

text => {'together', 'meeting', 'republics', 'Davudi', 'that', 'Asia', 'four', 'Russia', 'Parviz', 'at', 'SCO', 'former', 'China', 'fledgling', 'Organisation', 'fight', 'binds', 'Soviet', 'was', 'Shanghai', 'Co', 'Iran', 'terrorism', 'association', 'central', 'representing', 'operation'} 

hyp => {'SCO', 'China', 'fledgling', 'fight', 'member', 'Russia'} 

overlap_word => {'fight', 'fledgling'} 

overlap_ne => {'Russia', 'China', 'SCO'} 

hyp_extra_word => {'member'} 

hyp_extra_ne => set()


<img src="https://i.imgur.com/SmGB1O1.jpg"/>

## part 3 Evaluation
### 3.1   The Test Set

測試資料集必須要跟訓練資料集是不同的資料，但是格式相同<BR>
否則評估出來的結果不會是可靠的<BR>
<BR>
在設計test set數量也要注意<BR>
如果classification tasks中擁有少量且平衡的label以及多樣性test set的話，大約只需要100實例來測試就夠了<BR>
但是，如果classification tasks擁有大量的label或是包含一些出現次數較少的label，那麼test set的內容就必須確保每一個label出現次數至少要超過50次才能做有意義的評估。<BR>
此外，如果有個可取得的極大數據，使用整體的10%來當作test set來評估是很安全的<BR><BR>
另外的考量，如果test set與development set的相似過太相近，評估的結果可能無法推廣到其他不同的data set。

In [19]:
# code說明trainning set與test set取自於同一個doc(較差)
import random
from nltk.corpus import brown
tagged_sents = list(brown.tagged_sents(categories='news'))
size = int(len(tagged_sents) * 0.1)
train_set, test_set = tagged_sents[size:], tagged_sents[:size]

In [15]:
# code說明trainning set與test set取自於不同doc(較優)
file_ids = brown.fileids(categories='news') # file_ids包含了許多文集
size = int(len(file_ids) * 0.1)
train_set = brown.tagged_sents(file_ids[size:])
test_set = brown.tagged_sents(file_ids[:size])

有了test set就要來計算model的正確率是多少了
NLTK提供了nltk.classify.accuracy()來計算分類器的精準度。

### 3.2 Precision and Recall
並不是所有的accuracy都是可靠的，有些甚至會誤導我們，像是search、資料檢索的部分<BR>
因為網路上的文章大多跟自己搜尋的關鍵字不相干，所以對於label都會被標記為'不相干'的精準度非常接近100%<BR>
所以有其他方法來評估這一類的model。<BR>
<img src="https://i.imgur.com/DsrpBx8.jpg" width="600" align="center"/><BR>

分為4種狀態:<BR>
1. True positives: 被檢索到的item與預期的相關<BR>
2. True negatives: 未被檢索到的item與預期的不相關<BR>
3. False positives: 被檢索到的item與預期的不相關<BR>
4. False negatives: 未被檢索到的item與預期的相關<BR>

有了上述的這些numbers，就能定義底下公式<BR>
<ul>
    <li>Precision : 實際被檢索到的item中，正確被檢索的item比例，TP/(TP+FP)。</li>
    <li>Recall : 所有應該被檢索到的item，正確被檢索的item比例，TP/(TP+FN)。</li>
    <li>F-Measure : 結合了Precision與Recall的分數，(2 × Precision × Recall) / (Precision + Recall)</li>
</ul>

### 3.3   Confusion Matrices
confusion matrix是一個table，用來判斷每個item精確度以及錯誤率<BR>
在table對角線代表著預測的精準度，非對角線的部分就是錯誤率

In [12]:
from nltk.corpus import brown
def tag_list(tagged_sents):
    return [tag for sent in tagged_sents for (word, tag) in sent]
def apply_tagger(tagger, corpus):
    return [tagger.tag(nltk.tag.untag(sent)) for sent in corpus]
brown_tagged_sents = brown.tagged_sents(categories='news')
size = int(len(brown_tagged_sents) * 0.9)
train_sents = brown_tagged_sents[:size]
t0 = nltk.DefaultTagger('NN')
t1 = nltk.UnigramTagger(train_sents, backoff=t0)
t2 = nltk.BigramTagger(train_sents, backoff=t1)

gold = tag_list(brown.tagged_sents(categories='editorial'))
test = tag_list(apply_tagger(t2, brown.tagged_sents(categories='editorial')))
cm = nltk.ConfusionMatrix(gold, test)
print(cm.pretty_format(sort_by_count=True, show_percents=True, truncate=9))

    |                                         N                      |
    |      N      I      A      J             N             V      N |
    |      N      N      T      J      .      S      ,      B      P |
----+----------------------------------------------------------------+
 NN | <11.9%>  0.0%      .   0.2%      .   0.0%      .   0.2%   0.0% |
 IN |   0.0%  <9.0%>     .      .      .   0.0%      .      .      . |
 AT |      .      .  <8.6%>     .      .      .      .      .      . |
 JJ |   1.6%      .      .  <4.0%>     .      .      .   0.0%   0.0% |
  . |      .      .      .      .  <4.8%>     .      .      .      . |
NNS |   1.5%      .      .      .      .  <3.3%>     .      .   0.0% |
  , |      .      .      .      .      .      .  <4.4%>     .      . |
 VB |   0.9%      .      .   0.0%      .      .      .  <2.4%>     . |
 NP |   1.0%      .      .   0.0%      .      .      .      .  <1.9%>|
----+----------------------------------------------------------------+
(row =

### 3.4   Cross-Validation


為了評估model，必須要保留一點資料來當作test set<BR>
但如果test set太小，評估的結果就不是那麼準確<BR>
相反的，如果test set過大，就代表training set會太小，model就不是那麼的好。<BR>
可以解決的方法之一就是cross-validation，也就是使用多個來自於不同的test set來評估<BR>

它的原理，將original corpus細分成N個subset，稱作folds<BR>
在訓練model的時候，將那些不在folds的data拿去訓練，訓練好model之後再拿folds裡的data做評估<BR>
如此一來，如果某個folds太小而導致評估的結果不是那麼可靠，但還有其他folds的輔助而形成可靠的評估數據。<BR>

cross-validation另外的優點，可以讓我們查看不同的training set之間的效能差異<BR>
舉例來說，如果N個training set中的分數差異不大，那麼評估出來的結果也是可靠的<BR>
另一方面，如果N個training set中的分數差異很大，那麼評估的結果可能就不是可靠的。