# 6. Learning to Classify Text

## Part 1.   Supervised Classification
### Classification(分類)是對input來給予正確的類別標籤
### 在基本的分類項目中，每個input都是獨立的，標籤也事先定義

###  稱為『Supervised』的分類器是建立在訓練文集中對每個input標上正確的類別的方法

![Imgur](https://i.imgur.com/SO0thhn.png)

## 1.1   Gender Identification
### 建立分類器來根據姓名的最後一個字母來判斷性別

In [62]:
#從nltk的文集中載入分別有男生姓名和女生姓名的文件
from nltk.corpus import names
import random
import nltk

#從男生姓名或女生姓名的txt檔抓出來的名字產生出一組tuple
labeled_names = ([(name, 'male') for name in names.words('male.txt')] +[(name, 'female') for name in names.words('female.txt')])

#隨機打亂順序，不然都是排好的
random.shuffle(labeled_names)

#回傳參數最後一個字元
def gender_features(word):
    return {'last_letter': word[-1]}

#featuresets包含tuple，分別是姓名的最後一個字元和性別
featuresets = [(gender_features(n), gender) for (n, gender) in labeled_names]

#將featuresets前後500個data分別存成訓練集和測試集
train_set, test_set = featuresets[500:], featuresets[:500]

#將訓練集使用函式NaiveBayesClassifer來產生一個新的NaiveBayes分類器
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [35]:
#測試分類器
classifier.classify(gender_features('Neo'))

'male'

In [33]:
classifier.classify(gender_features('Trinity'))

'female'

In [63]:
#訓練資料是隨機的，準確度會不一樣
print(nltk.classify.accuracy(classifier, test_set))

0.774


In [37]:
#找出最有資訊的特徵
classifier.show_most_informative_features(5)

Most Informative Features
             last_letter = 'a'            female : male   =     36.5 : 1.0
             last_letter = 'k'              male : female =     30.5 : 1.0
             last_letter = 'f'              male : female =     17.4 : 1.0
             last_letter = 'v'              male : female =     10.6 : 1.0
             last_letter = 'p'              male : female =     10.6 : 1.0


### 當使用List儲存從龐大的文集所產生每個物件包含的特徵時，會佔用空間
### 所以我們可以使用函式nltk.classify.apply_features，可以回傳像List且包含物件和特徵的資料且不佔用空間

In [46]:
from nltk.classify import apply_features

train_set = apply_features(gender_features, labeled_names[500:])
test_set = apply_features(gender_features, labeled_names[:500])

## 1.2   Choosing The Right Features
### 為分類器選擇好的特點，可以讓準確度更高

In [144]:
def gender_features2(name):
    features = {}
    features["first_letter"] = name[0].lower()
    features["last_letter"] = name[-1].lower()
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        features["count({})".format(letter)] = name.lower().count(letter)
        features["has({})".format(letter)] = (letter in name.lower())
    return features

gender_features2('John') 

{'count(a)': 0,'count(b)': 0,'count(c)': 0,'count(d)': 0,'count(e)': 0,'count(f)': 0,'count(g)': 0,'count(h)': 1,'count(i)': 0,'count(j)': 1,'count(k)': 0,'count(l)': 0,'count(m)': 0,'count(n)': 1,'count(o)': 1,'count(p)': 0,'count(q)': 0,'count(r)': 0,'count(s)': 0,'count(t)': 0,'count(u)': 0,'count(v)': 0,'count(w)': 0,'count(x)': 0,'count(y)': 0,'count(z)': 0,'first_letter': 'j','has(a)': False,'has(b)': False,
 'has(c)': False,'has(d)': False,'has(e)': False,'has(f)': False,'has(g)': False,'has(h)': True,'has(i)': False,'has(j)': True,'has(k)': False,'has(l)': False,'has(m)': False,'has(n)': True,'has(o)': True,'has(p)': False,'has(q)': False,'has(r)': False,'has(s)': False,'has(t)': False,'has(u)': False,'has(v)': False,'has(w)': False,'has(x)': False,'has(y)': False,'has(z)': False,'last_letter': 'n'}

### 但是，選取特點的時候也要適量——如果要選取的特點太多的時候，演算法會更加依賴訓練資料的特點，可能會讓新例子歸納得較差
### 這樣的問題稱為 overfitting，在小的訓練集當中overfitting會更加顯著

In [64]:
featuresets = [(gender_features2(n), gender) for (n, gender) in labeled_names]
train_set, test_set = featuresets[500:], featuresets[:500]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))

0.762


### 從以上的範例來看，gender_features2雖然取得較多的特點，但是產生出來的準確度比只取名字最後一個字元的gender_features還低

### 當選取初始的特點時，error analysis是個可以很有效率去修正的方法
### 一開始我們選取可以用來製造模型且包含文集data的development set，development set再被分成training se和dev-test set

In [65]:
#train_names用來訓練模型，devtest_names用來實作error analysis
train_names = labeled_names[1500:]
devtest_names = labeled_names[500:1500]
test_names = labeled_names[:500]

### 文集被分成不同的集合
![Imgur](https://i.imgur.com/vLxZNXu.png)


In [67]:
train_set = [(gender_features(n), gender) for (n, gender) in train_names]
devtest_set = [(gender_features(n), gender) for (n, gender) in devtest_names]
test_set = [(gender_features(n), gender) for (n, gender) in test_names]

#使用train_set來訓練分類器
classifier = nltk.NaiveBayesClassifier.train(train_set)

#使用devtest_set來測試分類器
print(nltk.classify.accuracy(classifier, devtest_set))

0.738


In [71]:
#將猜錯的名字存進errors
errors = []
for (name, tag) in devtest_names:
    guess = classifier.classify(gender_features(name))
    if guess != tag:
        errors.append( (tag, guess, name) )
#print(errors)

('gender : male', 'guess : female', 'Mortie'), <br>
('gender : male', 'guess : female', 'Bartie'),<br>
('gender : male', 'guess : female', 'Jackie'),<br>
('gender : male', 'guess : female', 'Jeremie'),<br>
('gender : male', 'guess : female', 'Charlie'),<br>
<br>
('gender : female', 'guess : male', 'Dorian'),<br>
('gender : female', 'guess : male', 'Christan'), <br>
('gender : female', 'guess : male', 'Jillian'),<br>
('gender : female', 'guess : male', 'Lilyan')<br>

### 分類器將名字最後一個字元為e的都歸類為female，但是從結果可以看出最後兩個字元為ie的多為男性；而姓名最後兩個字元為an的通常也為女性
### 我們從錯誤的結果可以得出取姓名的最後兩個字元為特點也許會較準確

In [74]:
#定義取姓名最後兩個字元作為特點的函式
def gender_features3(word):
    return {'suffix1': word[-1:],
            'suffix2': word[-2:]}

train_set = [(gender_features3(n), gender) for (n, gender) in train_names]
devtest_set = [(gender_features3(n), gender) for (n, gender) in devtest_names]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, devtest_set))

0.763


### 取姓名的最後兩個字元作為特點後再對分類器進行訓練，再將devtest_set做預測，預測的準確值提高了

## 1.3   Document Classification

### 將文件movie_reviews所有正面和負面影評載入，找出reviews中最常出現的前2000個token存成word_features
### 檢查word_features中的token是否出現在正面或負面的評語中，得出來的結果作為訓練集然後丟入分類器
### 最後可以預測出評語是否為pos或neg

In [155]:
from nltk.corpus import movie_reviews

documents = [(list(movie_reviews.words(fileid)), category)
              for category in movie_reviews.categories()
              for fileid in movie_reviews.fileids(category)]

random.shuffle(documents)

In [157]:
# 找出最常出現的前2000個token，存成all_words
all_words = nltk.FreqDist(w.lower()for w in movie_reviews.words())
word_features = list(all_words)[:2000]

def document_features(document):
    document_words = set(document)
    features = {}
    for word in word_features:
        features['contains({})'.format(word)] = (word in document_words)
    return features

#找出文件是否包含word_features內的字
#rint(document_features(movie_reviews.words('pos/cv957_8737.txt'))) 

In [167]:
# d為文件內的文字set，c為pos或neg，找出pos或neg的文件內含有的字進行訓練
featuresets = [(document_features(d), c) for (d,c) in documents]

train_set, test_set = featuresets[100:], featuresets[:100]
classifier = nltk.NaiveBayesClassifier.train(train_set)

#classifier.classify() 
#classifier.show_most_informative_features(5)

## 1.4   Part-of-Speech Tagging