## Lab05
### Use Lesk algorighmn to classify words into right category

Lesk Algorithm – Disambiguating EVP entries to WordNet <br>
Recall that LA relies on the number of shared words to determine the intended sense of a
given context. It is not uncommon to have a tie with two or more senses having the same
number of shared words with the context. In order to break the tie, people have used
the weight such as idf (inverse document frequency) of word to break a tie and even to
improve the performance. Therefore, in the training phrase (preprocessing), we read the
dataset file and compute the following information:
- Document frequency (df): the number of word sense categories where a certain
word appears in. For instance, the word money appears in 80 categories including
get.v.01. Let D be the total number of categories. Then id f = d f /D.
- Within document term frequency: For each defining/example word, compute the
WordNet sense categories the word appears in.


In [1]:
# TF: 指某個字詞在該檔案中出現的次數/該檔案所有的字數量
# IDF: df取倒數後對其做log10
# - tf-idf: tf * idf
## Step 1 : 計算senseDef裡的TF-IDF
# - tf ：在字詞在各類 wncat 出現的次數/各類 wncat所有的字詞數量
# - df ：在字詞出現在幾個 wncat 中/ wncat類別總數
# - idf：df 倒數取log10
## Step 2: split data into 9:1 and train model

In [2]:
#from nltk.corpus import wordnet as wn
import re
from collections import defaultdict, Counter
from nltk.stem.wordnet import WordNetLemmatizer
lmtzr = WordNetLemmatizer()
import nltk, random

# 所有feature都分割成一個字
def words(text): return re.findall(r'\w+', text.lower())
    
TF = defaultdict(lambda: defaultdict(lambda: 0))
DF = defaultdict(lambda: [])
wncat_count = defaultdict(lambda: 0)

#詞性簡寫
def wnTag(pos): return {'noun': 'n', 'verb': 'v', 'adjective': 'a', 'adverb': 'r'}[pos]

def isHead(head, word, tag):
    try:
        return lmtzr.lemmatize(word, tag) == head  #詞型還原
    except:
        return False
training = [line.strip().split('\t') for line in open(r'C:/Users/asus/Downloads/nlp/Lab5/wn.in.evp.cat.txt', 'r', encoding = 'utf8') if line.strip() != '' ]
for wnid, wncat, senseDef, target in training:
    head, pos = wnid.split('-')[:2] #get單字 詞性
    for word in words(senseDef): #把所有單字切割
        if word != head and not isHead(head, word, pos): #單字或單字原型不存在時
            TF[word][wncat] += 1 #[單字原型][類別answer]+1     #該字詞在每一類出現次數
            DF[word] += [] if wncat in DF[word] else [wncat] #該字詞出現在那些類別
            wncat_count[wncat]+=1

In [3]:
##TFIDF 
# tf ：字詞在各類 wncat 出現的次數/各類 wncat所有的字詞數量
# df ：字詞出現在幾個 wncat中/ wncat類別總數  
# idf：df 倒數取log10
import math
def tfidf(word, wncat):
    tf = TF[word][wncat]/wncat_count[wncat]
    df = (len(DF[word]))+1/len(wncat_count)
    idf = math.log10(1/df)
    return tf*idf
## testing
print("TFIDF of (parking, get_rid_of.v.01) is", tfidf("parking", "get_rid_of.v.01"))
print("TFIDF of (lot, get_rid_of.v.01) is", tfidf("parking", "get_rid_of.v.01"))  

##取得每個字的TF-IDF
#將training每一行的第二個(senseDef)切割, 計算每個字的TFIDF
#{'cucumber': -0.002007382871432638, 'shaped': -0.010034365456100688...}
def feature_format(line):
    feature_dict={}
    for word in words(line[2]): 
        feature_dict.update({word: tfidf(word, line[1])}) #format to dictionary
    return (feature_dict, line[1]) 

## Get feature
feature = [feature_format(line) for line in training]
print("feature[0]: ", feature[0])  #多個字詞會對應到一個分類

TFIDF of (parking, get_rid_of.v.01) is -0.010101166546613746
TFIDF of (lot, get_rid_of.v.01) is -0.010101166546613746
feature[0]:  ({'forsake': -0.004819928076979694, 'old': -0.11958360613925677, 'in': -0.09506174447760563, 'abandon': -0.0, 'car': -0.03928251378163691, 'leave': -0.015720270738149425, 'we': -0.04687165711478301, 'parking': -0.010101166546613746, 'the': -0.230895736715218, 'behind': -0.016071399781723568, 'empty': -0.011577162612865805, 'abandoned': -0.0, 'lot': -0.016182463266751875}, 'get_rid_of.v.01')


In [4]:
import nltk, random
from nltk.probability import DictionaryProbDist as D  
from nltk.classify import SklearnClassifier 
from sklearn.linear_model import LogisticRegression

# split the feature set into 9:1
split_ratio = int(len(feature)*9/10)
train, test = feature[:split_ratio], feature[split_ratio:]

# train with SKLearn
sklearn_classifier = SklearnClassifier(LogisticRegression(C=10e5)).train(train)

In [73]:
print("=====Sklearn accuracy=====")
print(nltk.classify.accuracy(sklearn_classifier, test))  #sklearn accuracy

=====Sklearn accuracy=====
0.5344827586206896


In [71]:
## todo
## compare the answer and the predict result
# 剩下的10% test計算correct的次數
# accuracy= correct/number of test

correct=0
for test_feature, result in test:
    #用 test_feature預測結果 看跟result是否相同
    chk_result = sklearn_classifier.prob_classify(test_feature)._prob_dict
    # 機率最高的最為結果
    chk_result = sorted(chk_result.items(), key= lambda x: -x[1])[0][0]
    if result == chk_result:
        correct= correct+1  # count correct
print(correct)
print("=======Accuracy======")
print(correct/ len(test))

1240
0.5344827586206896


In [None]:
print(sorted(rank.items(), key=lambda d: d[1], reverse=True)[0][0])

test_len = len(test)
correct = 0
for feature, answer in test:
    candidates = sklearn_classifier.prob_classify(feature)._prob_dict
    predict = str(sorted(candidates.items(), key=lambda d: d[1], reverse=True)[0][0])
    if answer == predict:
        correct += 1

print('accurency is',correct/test_len)

import nltk
nltk.classify.accuracy(sklearn_classifier, test)