# Word Embedding via word2vec
- [NLP 學習筆記](https://hackmd.io/Gw4TgIwEwhaAGYAzOAWKBWYtIYMyxbICGeUATMAKbFRA)
- [以 gensim 訓練中文詞向量](http://zake7749.github.io/2016/08/28/word2vec-with-gensim/)
- [tfidf 關鍵字擷取](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer)
- [On word embeddings](http://ruder.io/secret-word2vec/)

In [1]:
from gensim.models import word2vec
import time

Using TensorFlow backend.


### Train word2vec model

In [2]:
# Extract training datas
sentences = word2vec.LineSentence('datas/merged-seg.txt')

In [3]:
# Show one tokenize "sentence"
for i, s in enumerate(sentences):
    print('')
    print(s)
    print('=' * 100)
    if i > 0:
        break


['歐幾', '裏', '得', '西元前', '三', '世紀', '的', '希臘', '數學家', '現在', '被', '認為', '是', '幾何', '之', '父', '此畫', '為', '拉斐爾', '的', '作品', '雅典', '學院', '數學', '是', '利用', '符號語言', '研究', '數量', '結構', '變化', '以及', '空間', '等概', '唸', '的', '一門', '學科', '從', '某種', '角度看', '屬於', '形式', '科學', '的', '一種', '數學', '透過', '抽象化', '和', '邏輯推理', '的', '使用', '由', '計數', '計算', '數學家', '們', '拓展', '這些', '概念', '對', '數學', '基本', '概', '唸', '的', '完善', '早', '在', '古埃及', '而', '在', '古希臘', '那裡', '有', '更', '為', '嚴謹', '的', '處理', '從', '那時', '開始', '數學', '的', '發展', '便', '持續', '不斷', '地', '小幅', '進展', '世紀', '的', '文藝復興', '時期', '致使', '數學', '的', '加速', '發展', '直至', '今日', '今日', '數學', '使用', '在', '不同', '的', '領域', '中', '包括', '科學', '工程', '醫學', '和', '經濟學', '等', '有時', '亦', '會', '激起', '新', '的', '數學', '發現', '並', '導致', '全新', '學科', '的', '發展', '數學家', '也', '研究', '純數學', '就是', '數學', '本身', '的', '實質性', '內容', '而', '不以', '任何', '實際', '應用', '為', '目標', '雖然', '許多', '研究', '以', '純數學', '開始', '但', '其', '過程', '中', '也', '發現', '許多', '應用', '之', '處', '詞源', '西方', '語言', '中', '數學', '一', '詞源', '自

In [None]:
# Train word2vec model

# !! Warning !!
# Below code will run roughly 30 minutes

# Uncomment to train again
word2vec_model = word2vec.Word2Vec(sentences, size=250, workers=4)
word2vec_model.total_train_time

In [None]:
# !! Warning !!
# Below code will replace original result

word2vec_model.save('models/word2vec_250.model.bin')

### jieba 分詞

In [7]:
import jieba
jieba.set_dictionary('datas/dict/dict.txt.big')
jieba.load_userdict('datas/dict/edu_dict.txt')

Building prefix dict from /home/sunset/word_contest/datas/dict/dict.txt.big ...
Loading model from cache /tmp/jieba.u849ecfdca27003d306f39ca004b82b5b.cache
Loading model cost 1.203 seconds.
Prefix dict has been built succesfully.


In [8]:
import jieba.analyse

In [9]:
setence = '美國總統川普發表「烈焰怒火」言論，誓言如果北韓不停手，就施以世人前所未見的猛烈攻擊。關島政府急急在翌日發表聲明，呼籲民眾和遊客冷靜、不用恐慌，當地並沒有即時危險。'

In [10]:
jieba.analyse.tfidf(setence, topK=10)

['冷靜', '總統', '沒有', '危險', '言論', '即時', '川普', '遊客', '美國', '民眾']

In [11]:
jieba.analyse.textrank(setence, topK=10)

['遊客', '攻擊', '發表', '言論', '烈焰', '怒火', '世人', '關島', '政府', '總統']

### Test model

In [None]:
word2vec_model = word2vec.Word2Vec.load('models/word2vec_250.model.bin')

In [12]:
# Numbers of words in dictionary
len(word2vec_model.wv.vocab)

614202

In [13]:
# Save dictionary of word2vec_model, so that we can convert new 'string' sentence to id
vocab = dict([(k, v.index) for k, v in word2vec_model.wv.vocab.items()])
print('男朋友的 id 是：', vocab['男朋友'])

男朋友的 id 是： 7472


In [None]:
word2vec.vo

#### Word similarity

In [None]:
print(word2vec_model.wv.similarity('男朋友', '可愛'))
print(word2vec_model.wv.similarity('女朋友', '可愛'))
print(word2vec_model.wv.similarity('皮卡丘', '可愛'))
print(word2vec_model.wv.similarity('愛因斯坦', '可愛'))
print(word2vec_model.wv.similarity('傅立葉', '可愛'))

In [None]:
word2vec_model.wv.most_similar(positive=['國王', '女'], negative=['男'], topn=10)

In [None]:
word2vec_model.wv.most_similar('皮卡丘', topn=10)

#### Cosine similarity between two set of words

In [None]:
questions = [['光電效應'], ['高斯分佈'], ['電'], ['原子彈'], ['可愛'], ['革命']]
options = ('皮卡丘', '愛因斯坦', '小智', '傅立葉', '孫文', '高斯', '女朋友')
for question in questions:
    print('Questions', question)
    scores = []
    for option in options:
        scores.append((word2vec_model.wv.n_similarity(question, [option]), option))
    scores.sort(reverse=True)
    for _ in scores:
        print('%15.4f (%s)' % (_[0], _[1]))
    print('=' * 30)

#### Word Mover's Distance between two documents/sentences
- [A Linear Time Histogram Metric for Improved SIFT Matching](http://www.cs.huji.ac.il/~werman/Papers/ECCV2008.pdf)
- [Fast and Robust Earth Mover’s Distances](http://www.cs.huji.ac.il/~werman/Papers/ICCV2009.pdf)
- [From Word Embeddings To Document Distances](http://proceedings.mlr.press/v37/kusnerb15.html)

In [None]:
sentences = [
    '蘋果從樹上掉下來',
    '從直升機上跳下來',
    '蘋果是長在樹上的',
]
for i in range(len(sentences)):
    sentences[i] = list(jieba.cut(sentences[i]))
for i in range(len(sentences)):
    for j in range(i+1, len(sentences)):
        s1 = sentences[i]
        s2 = sentences[j]
        print('%-35s vs. %-35s => %f' % (s1, s2, word2vec_model.wv.wmdistance(s1, s2)))

#### Find most doesn't match

In [None]:
word2vec_model.wv.doesnt_match(['早餐', '午餐', '美食', '電視', '晚餐'])

#### Predict center word probability

In [None]:
word2vec_model.predict_output_word(list(jieba.cut('警察在深夜攻堅敵人總部')))

In [None]:
word2vec_model.wv.similarity('憫惻', '哀憐')

In [None]:
word2vec_model.wv.similarity('可愛', '可惡')

In [None]:
word2vec_model.wv.similarity('可憎', '可愛')