## feature generate
### 1.TF-IDF <br/>
- 詞頻(term frequency，tf): 指某個字詞在該檔案中出現的次數/該檔案所有的字數量
- 逆向檔案頻率(inverse document frequency，idf): 某字在所有檔案出現次數/所有當檔案 稱為df，idf 則是 df取倒數後對其做log10
- tf-idf: tf * idf
- 例子：
>假如一篇檔案的總詞語數是100個，而詞語「母牛」出現了3次，那麼「母牛」一詞在該檔案中的詞頻就是3/100=0.03。而計算檔案頻率（IDF）的方法是以檔案集的檔案總數，除以出現「母牛」一詞的檔案數。所以，如果「母牛」一詞在1,000份檔案出現過，而檔案總數是10,000,000份的話，其逆向檔案頻率就是lg（10,000,000 / 1,000）=4。最後的tf-idf的分數為0.03 * 4=0.12。
- 在此作業中 tf ：我們將文件數看成 wncat ， 因此 tf 就是在字詞在各類 wncat 出現的次數/各類 wncat所有的字詞數量
- 在此作業中 df ：我們將文件數看成 wncat ， 因此 df 就是在字詞出現在幾個 wncat 中/ wncat類別總數
- 在此作業中 idf：df 倒數取log10

In [1]:
import re
from collections import defaultdict, Counter

def words(text): return re.findall(r'\w+', text.lower())

TF = defaultdict(lambda: defaultdict(lambda: 0))
DF = defaultdict(lambda: [])

wncat_word_sum = defaultdict(lambda: 0)

raw_data = [line.strip().split('\t') for line in open('wn.in.evp.cat.txt', 'r') if line.strip() != '']

for wnid, wncat, senseDef, target in raw_data:
    for word in words(senseDef):
        TF[word][wncat] += 1
        wncat_word_sum[wncat] += 1 
        DF[word] += [] if wncat in DF[word] else [wncat]

- forsake 在 get_rid_of.v.01 類 的 TF-IDF

In [2]:
import math
wncat_len = len(wncat_word_sum)

def tfidf(word, wn_cat):
    tf = TF[word][wn_cat]/wncat_word_sum[wn_cat]
    idf = math.log10(wncat_len/len(DF[word]) + 1)
    return tf*idf

tfidf("forsake", "get_rid_of.v.01")

0.0267668572785115

### data format: ({word:tf-idf,word:tf-idf,...},'wncat')

In [3]:
def feature_generate(line):
    word_tfidf = {}
    for word in words(line[2]):
        word_tfidf.update({word: tfidf(word, line[1])})
    return (word_tfidf, line[1])

[<br/>
'aspirin-n-1', ( 動詞 ) <br/>
'medicine.n.02', ( 類別 ) <br/>
'aspirin acetylsalicylic_acid Bayer Empirin St._Joseph||the acetylated derivative of salicylic acid; used as an analgesic anti-inflammatory drug (trade names Bayer, Empirin, and St. Joseph) usually taken in tablet form; used as an antipyretic; slows clotting of the blood by poisoning platelets||', ( 例句,解釋 ) <br/>
"{'aspirin-n-1': 'medicine.n.02'}" ( 多義 ) <br/> 
]

In [4]:
eval("{'aspirin-n-1': 'medicine.n.02'}")

{'aspirin-n-1': 'medicine.n.02'}

### 打散資料並extract feature  及 為 filter 做 dictionary

In [5]:
import random
random.shuffle(raw_data)

featuresets = []
dict_filter = defaultdict(lambda: None)

for line in raw_data:
    feature_label = feature_generate(line)
    featuresets.append(feature_label)
    dict_filter[str(feature_label)] = (line[0], eval(line[3]))

featuresets[:1]

[({'anywhere': 0.006294748215921136,
   'biggest': 0.006030375237563146,
   'cosmos': 0.00666749506573805,
   'creation': 0.004758748147592026,
   'everything': 0.004315281965155662,
   'evolution': 0.0053939095960928865,
   'existence': 0.007846905087837993,
   'exists': 0.0053939095960928865,
   'in': 0.008709240916855763,
   'macrocosm': 0.007304942511847839,
   'of': 0.01599822512316116,
   'study': 0.0037213643314462064,
   'that': 0.01542966146251423,
   'the': 0.03786717480521369,
   'they': 0.006898331523925011,
   'tree': 0.0038166947785257055,
   'universe': 0.009635609393287801,
   'world': 0.011246831901807492},
  'natural_object.n.01')]

In [6]:
dict_filter

defaultdict(<function __main__.<lambda>>,
            {"({'universe': 0.009635609393287801, 'existence': 0.007846905087837993, 'creation': 0.004758748147592026, 'world': 0.011246831901807492, 'cosmos': 0.00666749506573805, 'macrocosm': 0.007304942511847839, 'everything': 0.004315281965155662, 'that': 0.01542966146251423, 'exists': 0.0053939095960928865, 'anywhere': 0.006294748215921136, 'they': 0.006898331523925011, 'study': 0.0037213643314462064, 'the': 0.03786717480521369, 'evolution': 0.0053939095960928865, 'of': 0.01599822512316116, 'biggest': 0.006030375237563146, 'tree': 0.0038166947785257055, 'in': 0.008709240916855763}, 'natural_object.n.01')": ('existence-n-2',
              {'existence-n-1': 'state.n.02',
               'existence-n-2': 'natural_object.n.01'}),
             "({'juggle': 0.007512316454904619, 'deal': 0.021193944720575838, 'with': 0.006796710083389849, 'simultaneously': 0.003291735531723368, 'she': 0.0069126728466864466, 'had': 0.001368109883122972, 'to': 0.004

## 切成 9:1 9當 train 1當 test

In [7]:
split_point = len(featuresets)*9//10
train, test = featuresets[:split_point], featuresets[split_point:]

In [8]:
from nltk.classify import SklearnClassifier 
from sklearn.linear_model import LogisticRegression
sklearn_classifier = SklearnClassifier(LogisticRegression(C=10e5)).train(train)

In [9]:
sklearn_classifier

<SklearnClassifier(LogisticRegression(C=1000000.0, class_weight=None, dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False))>

### Non Filter

In [10]:
import nltk
nltk.classify.accuracy(sklearn_classifier, test)

0.6310344827586207

### Filter

In [11]:
rank = sklearn_classifier.prob_classify(test[0][0])._prob_dict

In [12]:
def get_candidates(test_set, raw_data):
    for word, wncat, senseDef, target in raw_data:
        word_tfidf = {}
        word_list = words(senseDef)
        for w in word_list: 
            word_tfidf.update({w: tfidf(w, wncat)})
        if test_set == (word_tfidf, wncat):
            return (word_tfidf, wncat, eval(target))
        else:
            continue

In [13]:
test[0]

({'advantages': 0.002972520794449253,
  'agree': 0.0024282477034199314,
  'convert': 0.0028870865616447963,
  'convince': 0.0037274949898293843,
  'convinced': 0.002814878708749388,
  'customers': 0.0024583825046658606,
  'finally': 0.020773334854910272,
  'had': 0.0011699188297840766,
  'he': 0.0033070874340455263,
  'his': 0.005493807357991882,
  'make': 0.01216397364962673,
  'of': 0.006874457816791642,
  'or': 0.011301998220827328,
  'product': 0.002120206885832057,
  'realize': 0.0027523517074117218,
  'several': 0.001629465964695885,
  'someone': 0.001463261359135597,
  'something': 0.006027507410228358,
  'the': 0.016065893490391787,
  'truth': 0.0040040423799562515,
  'understand': 0.002063889025054683,
  'validity': 0.00264791793848511,
  'win_over': 0.0037274949898293843},
 'induce.v.02')

In [15]:
corrections = len(test)
hits = 0
for feature, cat in test:
    output = sklearn_classifier.prob_classify(feature)._prob_dict
    candidates = dict_filter[str((feature, cat))]
    output = [ (ped_cat, prob) for ped_cat, prob in output.items() if ped_cat in candidates[1].values() ]
    if not output: continue
    # candidate 中 最佳的 
    output_class = sorted(output, key=lambda x: x[1], reverse=True)[0][0] # [(cat,prob),(cat,prob)]
    # print(output_class)
    if output_class == cat:
        hits += 1

In [16]:
print(hits/corrections)

0.7991379310344827
