### Recommendation Using TF-IDF weighted Word2Vec
FYI - [tfidf weighted word2vec](https://medium.com/analytics-vidhya/featurization-of-text-data-bow-tf-idf-avgw2v-tfidf-weighted-w2v-7a6c62e8b097)  , 
[中文版](https://renxingkai.github.io/2019/04/05/word-tfidf/?fbclid=IwAR3kWsXJq-SLMSqgUMG6y1ZSEUGtr_6MgLVr9USrUb981_3OjFqh7R_kMUs#TF-IDF%E5%8A%A0%E6%9D%83%E5%B9%B3%E5%9D%87%E8%AF%8D%E5%90%91%E9%87%8F) <br><br>
**STEP** <br>
1. Create TF-IDF 
2. Convert a tf-idf dictionary with word as key, idf as a value
3. Get TF-IDF features
4. Create Word2Vec Model
5. Combine w2v with TF-IDF
6. Calculate Cosine Similarity 
7. Recommend Law 

### Read ckiptagger & Dataframe

In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import re
from ckiptagger import data_utils, construct_dictionary, WS, POS, NER
import datetime

path = "./data"
ws = WS(path)

df = pd.read_csv('data_ETL3noPuncDict.csv')
# Replace '@' with ' ' in original dataframe
df.token = df.token.apply(lambda text: text.replace('@',' '))

### Tf-idf for Tokenized Text in Dataframe

In [2]:
# TF-IDF Model
tfidf_ml = TfidfVectorizer()
tfidf_ml.fit(df.token)

# TF-IDF Dicitonary
dictionary = dict(zip(tfidf_ml.get_feature_names(), list(tfidf_ml.idf_)))

# feature name
tfidf_feature = tfidf_ml.get_feature_names()

### Newly Entered Text Preprocess function
- Remove Punctuation
- Remove Spaces
- Sentence Segment
- turn into list

In [12]:
def Preprocess(text):
    rule = re.compile(r'[^a-zA-Z0-9\u4e00-\u9fa5]')
    text = rule.sub(' ',str(text))
    text = re.sub(' +', '',text)
    text = ws([text], sentence_segmentation=True)
    text = [x for l in text for x in l]
    return text

### Create Word2Vec Model

In [3]:
from gensim.models.word2vec import Word2Vec

w2v_model = Word2Vec(df.token.apply(lambda text: text.split()))
w2v_vocab = list(w2v_model.wv.vocab)
print(w2v_model)

Word2Vec(vocab=6182, size=100, alpha=0.025)


### Calculate TF-IDF Weighted Word2Vec

In [4]:
starttime = datetime.datetime.now()

# TF-IDF weighted Word2Vec
tfidf_text_vect = [] # tfidf-w2v is stored in this list
row = 0

for text in df.token.apply(lambda text: text.split()):
    text_vect = np.zeros(100)
    weight_sum = 0
    for word in text:
        if word in w2v_vocab and word in tfidf_feature:
            vec = w2v_model.wv[word]
            tf_idf = dictionary[word]*(text.count(word)/len(text))
            text_vect += (vec * tf_idf)
            weight_sum += tf_idf
    if weight_sum != 0:
        text_vect /= weight_sum
    tfidf_text_vect.append(text_vect)
    row += 1

# calculate running time
endtime = datetime.datetime.now()
print("建立模型時間: ",endtime - starttime)

0:07:40.872726


### Law Recommendation Function
輸入內文 --> 跑出推薦的前十個相近內文對應的法律

In [46]:
def recommend_law(text, tfidf_text_vect = tfidf_text_vect):
    text = Preprocess(text)
    text_vect = np.zeros(100) # w2v size
    weight_sum = 0
    for word in text:
        if word in w2v_vocab and word in tfidf_feature:
            vec = w2v_model.wv[word]
            tf_idf = dictionary[word]*(text.count(word)/len(text))
            text_vect += (vec * tf_idf)
            weight_sum += tf_idf
    if weight_sum != 0:
        text_vect /= weight_sum
    tmp_vect = [*tfidf_text_vect,tmp_vec]
    new_cos_sim = cosine_similarity(tmp_vect, tmp_vect)
    sim_score = np.sort(new_cos_sim[new_cos_sim.shape[0]-1])[::-1][1:11]
    tmp_top_10_law = df[['Ex_Tittle','CE_Item2','CE_Comment']].iloc[np.argsort(new_cos_sim[new_cos_sim.shape[0]-1])[::-1][1:11]]
    tmp_top_10_law['similarity_score'] = [round(score*100,1) for score in sim_score]
    return tmp_top_10_law

### Try an Example
輸入內容便可以推薦出適合的法律<br>
(這邊列出的CE_Comment純粹是用來比對「輸入的內容」跟「原本內文」是否真的相近)

In [45]:
newtext = '被列為拒絕往來廠商'
law_recommend_tfidfw2v(newtext).drop(columns = )

Unnamed: 0,Ex_Tittle,CE_Item2,CE_Comment,similarity_score
2976,政府採購法,50,本案允許「共同投標」，開標前已查詢拒絕往來廠商名單，惟未於開標後對得標廠商之共同投標成員長泉...,90.5
11178,政府採購法,103,依政府採購法第103條第1項，經刊登政府採購公報之拒絕往來廠商自刊登次日起1年或3年不得參加...,89.4
3405,政府採購法,50,本案開標前所查詢之拒絕往來廠商名單，係自工商登記網頁查詢，非自採購主管機關工程會之政府電子採...,88.7
13476,政府採購法,103,項次16─依政府採購法第103條第1項，經刊登政府採購公報之拒絕往來廠商自刊登次日起1年或3...,88.7
2388,政府採購法,50,本案開標前未查詢拒絕往來廠商名單(依投標廠商投標封套廠商名稱，至政府採購資訊公告系統/拒絕往...,88.3
13171,政府採購法,103,依政府採購法第103條第1項，經刊登政府採購公報之拒絕往來廠商自刊登次日起1年或3年不得參加...,88.3
6345,投標廠商資格與特殊或巨額採購認定標準,4,招標公告「廠商資格摘要」登載「..非拒絕往來戶『或』最進三年內…，」與投標須知第13點(二)...,88.0
9896,投標廠商資格與特殊或巨額採購認定標準,4,本採購招標公告附加說明之廠商信用證明載明「非拒絕往來戶『或』最近1年……」及廠商資格審查表五...,86.4
7939,政府採購法,50,本案稽核文件查詢拒絕往來廠商名單之日期為101年11月3日，非本案開決標日期(101年10月...,85.6
4458,政府採購法,50,本採購案未依規定上網查詢拒絕往來廠商名單，核未符政府採購法第50條之規定，建議爾後於開標前，...,85.3
