# Recommendation Using TF-IDF weighted Words Embedding

**STEP** <br>
1. Create TF-IDF 
2. Convert a tf-idf dictionary with word as key, idf as a value
3. Get TF-IDF features
4. Combine pretrained words embedding with TF-IDF
5. Calculate Cosine Similarity 
6. Recommend Law 

### Read ckiptagger & Dataframe

In [1]:
import model_building
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import re
from ckiptagger import data_utils, construct_dictionary, WS, POS, NER
import datetime
import pickle

path = "./data"
ws = WS(path)

df = pd.read_csv('data_etl_step3_noPuncDict.csv')
# Replace '@' with ' ' in original dataframe
df.token = df.token.apply(lambda text: str(text).replace('@',' '))

### Import Words Dictionary

In [2]:
# dictionary
dict_path = './dictionary'
legal_name_file = dict_path + '/name_of_legal.txt'
word_file = dict_path + '/oth_words.txt'
split_rule_kw_file = dict_path + '/split_rule_words.txt'

with open(legal_name_file, 'r', encoding='big5') as k1, open(word_file, 'r', encoding='big5') as k2:
    k = k1.read().split('\n') + k2.read().split('\n')
    word_to_weight = dict([(_, 1) for _ in k])
word_dict = construct_dictionary(word_to_weight)

### Read Pretrained Words Embedding
詞向量訓練文本來源為中文維基百科，全部的訓練文本可於[此](https://dumps.wikimedia.org/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2)下載最新版的中文維基百科。<br>
維基百科2014（總詞彙數：655K，400維詞向量，下載大小為2.5G）<br>
來源：[元智大學自然語言處理實驗室](http://nlp.innobic.yzu.edu.tw/demo/word-embedding.html)

In [3]:
# https://ithelp.ithome.com.tw/articles/10194633
embeddings = {}
f = open('wiki.zh.vector', encoding = 'utf8') 
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings[word] = coefs
f.close()

### Tf-idf for Tokenized Text in Dataframe

In [4]:
# TF-IDF Model
tfidf_ml = TfidfVectorizer()
tfidf_ml.fit(df.token)

# TF-IDF Dicitonary
dictionary = dict(zip(tfidf_ml.get_feature_names(), list(tfidf_ml.idf_)))

# feature name
tfidf_feature = tfidf_ml.get_feature_names()

### Newly Entered Text Preprocess function
- Remove Punctuation
- Remove Spaces
- Sentence Segment
- turn into list

In [5]:
def Preprocess(text):
    rule = re.compile(r'[^a-zA-Z0-9\u4e00-\u9fa5]')
    text = rule.sub(' ',str(text))
    text = re.sub(' +', '',text)
    text = ws([text],sentence_segmentation=True, recommend_dictionary=word_dict)
    text = [x for l in text for x in l]
    return(text)

### Calculate TF-IDF Weighted Word Embedding

In [6]:
starttime = datetime.datetime.now()

# TF-IDF weighted Word2Vec
tfidf_text_vect = [] # tfidf-w2v is stored in this list
row = 0

for text in df.token.apply(lambda text: text.split()):
    text_vect = np.zeros(400)
    weight_sum = 0
    for word in text:
        if word in embeddings.keys() and word in tfidf_feature:
            vec = embeddings[word]
            tf_idf = dictionary[word]*(text.count(word)/len(text))
            text_vect += (vec * tf_idf)
            weight_sum += tf_idf
    if weight_sum != 0:
        text_vect /= weight_sum
    tfidf_text_vect.append(text_vect)
    row += 1

# calculate running time
endtime = datetime.datetime.now()
print("建立模型時間: ",endtime - starttime)

建立模型時間:  0:04:03.024476


### Law Recommendation Function
輸入內文 --> 跑出推薦的前十個相近內文對應的法律

In [7]:
def recommend_law(text, tfidf_text_vect = tfidf_text_vect):
    text = Preprocess(text)
    text_vect = np.zeros(400) # w2v size
    weight_sum = 0
    for word in text:
        if word in embeddings.keys() and word in tfidf_feature:
            vec = embeddings[word]
            tf_idf = dictionary[word]*(text.count(word)/len(text))
            text_vect += (vec * tf_idf)
            weight_sum += tf_idf
    if weight_sum != 0:
        text_vect /= weight_sum
    tmp_vect = [*tfidf_text_vect,text_vect]
    new_cos_sim = cosine_similarity(tmp_vect, tmp_vect)
    sim_score = np.sort(new_cos_sim[new_cos_sim.shape[0]-1])[::-1][1:11]
    tmp_top_10_law = df[['法規名稱','條','事實&改進建議']].iloc[np.argsort(new_cos_sim[new_cos_sim.shape[0]-1])[::-1][1:11]]
    tmp_top_10_law['similarity_score'] = [round(score*100,1) for score in sim_score]
    return tmp_top_10_law

### Try an Example
輸入內容便可以推薦出適合的法律<br>
(這邊列出的CE_Comment純粹是用來比對「輸入的內容」跟「原本內文」是否真的相近)

In [8]:
starttime = datetime.datetime.now()

newtext = '開標時有作拒絕往來廠商調查'
result = recommend_law(newtext)

# calculate running time
endtime = datetime.datetime.now()
print("搜尋推薦時間: ",endtime - starttime)

搜尋推薦時間:  0:00:03.189176


In [10]:
result.drop(columns = ['事實&改進建議'])

Unnamed: 0,法規名稱,條,similarity_score
6345,投標廠商資格與特殊或巨額採購認定標準,4,69.3
15015,政府採購法,103,69.2
11178,政府採購法,103,67.7
13418,政府採購法,50,67.4
13476,政府採購法,103,67.4
14369,政府採購法,50,66.2
13536,政府採購法,50,66.1
13549,政府採購法,50,65.8
5791,投標廠商資格與特殊或巨額採購認定標準,4,65.6
12344,政府採購法,50,65.2


### Evaluation for different Recommendation System

In [11]:
from glob import glob
trytext = pd.read_excel('原始意見及定稿意見彙整表_v3.xlsx')
trytext = trytext.drop([trytext.index[1],trytext.index[2],trytext.index[33]])
trytext.reset_index(drop=True, inplace=True)
trytext['條'] = trytext['條'].astype(int).apply(str)

### Import Recommendation Results & Calculate Scores
在採購科提供的46個案例中(扣除"行政疏失"案例)，每題若有推薦到正確法條者score+1，最後以百分比制計算。<br>
e.g. 46條案例中 有23條我們有成功推薦 -> 得分 = 23/46 * 100 = 50 (分)

**資料介紹:**<br>
cut --> 政雲提供之分段資料<br>
d2、d3 --> Bonzo提供之斷詞資料(d2 & d3 為兩個不同斷詞方法的版本)

**從以下結果得出效果最好的為使用「元智科大詞向量&分段資料」的方式**

In [13]:
# Calculate Scores
scores = []
for df in glob("rec*.xlsx"):
    print(df)
    df = pd.ExcelFile(df)
    i = 0
    score = 0
    shtnames = [j for i, j in enumerate(df.sheet_names) if i not in [1,2,33]]
    for shtname in shtnames: # see all sheet names
        sht = df.parse(shtname) # read a specific sheet to DataFrame
        sht.iloc[2] = sht.iloc[2].apply(str)
        # 比對預測結果中是否有出現實際結果的"法規名稱"&"法條" 
        if trytext['法規名稱'][i] in set(sht.iloc[:, 0]) and trytext['條'][i] in set(sht.iloc[:,1]):
            score += 1
#             print(shtname)
        i += 1
    scores.append(score)
    print("Score: ", round(score/46*100,1),"分\n")

rec_元智_cut.xlsx
Score:  56.5 分

rec_元智_d2.xlsx
Score:  54.3 分

rec_元智_d3.xlsx
Score:  54.3 分

