# Word Embedding via word2vec
- [NLP 學習筆記](https://hackmd.io/Gw4TgIwEwhaAGYAzOAWKBWYtIYMyxbICGeUATMAKbFRA)
- [以 gensim 訓練中文詞向量](http://zake7749.github.io/2016/08/28/word2vec-with-gensim/)
- [tfidf 關鍵字擷取](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer)
- [On word embeddings](http://ruder.io/secret-word2vec/)

In [None]:
from gensim.models import word2vec
import time

Using TensorFlow backend.


### Train word2vec model

In [None]:
# Extract training datas
sentences = word2vec.Text8Corpus('datas/wiki-seg.txt')

In [None]:
# Show one tokenize "sentence"
for i, s in enumerate(sentences):
    print('')
    print(s)
    print('=' * 100)
    if i >= 3:
        break

In [None]:
# Train word2vec model

# !! Warning !!
# Below code will run roughly 30 minutes

# Uncomment to train again
# word2vec_model = word2vec.Word2Vec(sentences, size=250, workers=4)
# word2vec_model.total_train_time

In [None]:
# !! Warning !!
# Below code will replace original result

# word2vec_model.save('models/word2vec_500.model.bin')

### jieba 分詞

In [None]:
import jieba
jieba.set_dictionary('datas/dict/dict.txt.big')
jieba.load_userdict('datas/dict/edu_dict.txt')

In [None]:
import jieba.analyse

In [None]:
setence = '美國總統川普發表「烈焰怒火」言論，誓言如果北韓不停手，就施以世人前所未見的猛烈攻擊。關島政府急急在翌日發表聲明，呼籲民眾和遊客冷靜、不用恐慌，當地並沒有即時危險。'

In [None]:
jieba.analyse.tfidf(setence, topK=10)

In [None]:
jieba.analyse.textrank(setence, topK=10)

### Test model

In [None]:
word2vec_model = word2vec.Word2Vec.load('models/word2vec_250.model.bin')

In [None]:
# Numbers of words in dictionary
len(word2vec_model.wv.vocab)

#### Word similarity

In [None]:
print(word2vec_model.wv.similarity('男朋友', '可愛'))
print(word2vec_model.wv.similarity('女朋友', '可愛'))
print(word2vec_model.wv.similarity('皮卡丘', '可愛'))
print(word2vec_model.wv.similarity('愛因斯坦', '可愛'))
print(word2vec_model.wv.similarity('傅立葉', '可愛'))

In [None]:
word2vec_model.wv.most_similar(positive=['國王', '女'], negative=['男'], topn=10)

In [None]:
word2vec_model.wv.most_similar('皮卡丘', topn=10)

#### Cosine similarity between two set of words

In [None]:
questions = [['光電效應'], ['高斯分佈'], ['電'], ['原子彈'], ['可愛'], ['革命']]
options = ('皮卡丘', '愛因斯坦', '小智', '傅立葉', '孫文', '高斯', '女朋友')
for question in questions:
    print('Questions', question)
    scores = []
    for option in options:
        scores.append((word2vec_model.wv.n_similarity(question, [option]), option))
    scores.sort(reverse=True)
    for _ in scores:
        print('%15.4f (%s)' % (_[0], _[1]))
    print('=' * 30)

#### Word Mover's Distance between two documents/sentences
- [A Linear Time Histogram Metric for Improved SIFT Matching](http://www.cs.huji.ac.il/~werman/Papers/ECCV2008.pdf)
- [Fast and Robust Earth Mover’s Distances](http://www.cs.huji.ac.il/~werman/Papers/ICCV2009.pdf)
- [From Word Embeddings To Document Distances](http://proceedings.mlr.press/v37/kusnerb15.html)

In [None]:
sentences = [
    '蘋果從樹上掉下來',
    '從直升機上跳下來',
    '蘋果是長在樹上的',
]
for i in range(len(sentences)):
    sentences[i] = list(jieba.cut(sentences[i]))
for i in range(len(sentences)):
    for j in range(i+1, len(sentences)):
        s1 = sentences[i]
        s2 = sentences[j]
        print('%-35s vs. %-35s => %f' % (s1, s2, word2vec_model.wv.wmdistance(s1, s2)))

#### Find most doesn't match

In [None]:
word2vec_model.wv.doesnt_match(['早餐', '午餐', '美食', '電視', '晚餐'])

#### Predict center word probability

In [None]:
word2vec_model.predict_output_word(list(jieba.cut('警察在深夜攻堅敵人總部')))

In [None]:
word2vec_model.wv.similarity('憫惻', '哀憐')

In [None]:
word2vec_model.wv.similarity('可愛', '可惡')

In [None]:
word2vec_model.wv.similarity('可憎', '可愛')