# Word2Vec

Import python packages

In [26]:
import os
import json
import jieba
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import gensim

## References

* [gensim word2vec model](https://radimrehurek.com/gensim/models/word2vec.html)

## 准备语料数据

Import corpus dataset

In [105]:
BASE_DIR = os.path.dirname(os.path.dirname(os.path.abspath('__file__')))
news_corpus_filepath = os.path.join(BASE_DIR, 'text_summ/datasets/news_corpus.txt')
news_corpus_dataset_filepath = os.path.join(BASE_DIR, 'text_summ/datasets/news_corpus.csv')
news_dataset_filepath = os.path.join(BASE_DIR, 'text_summ/datasets/sqlResult_1558435.csv')

In [106]:
assert os.path.exists(news_corpus_filepath)
assert os.path.exists(news_dataset_filepath)

In [107]:
news_dataset = pd.read_csv(news_dataset_filepath)

In [108]:
news_dataset.head(1)

Unnamed: 0,id,author,source,content,feature,title,url
0,89617,,快科技@http://www.kkj.cn/,此外，自本周（6月12日）起，除小米手机6等15款机型外，其余机型已暂停更新发布（含开发版/...,"{""type"":""科技"",""site"":""cnbeta"",""commentNum"":""37""...",小米MIUI 9首批机型曝光：共计15款,http://www.cnbeta.com/articles/tech/623597.htm


In [109]:
news_corpus_dataset = pd.read_csv(news_corpus_dataset_filepath)

In [110]:
news_corpus_dataset.head()

Unnamed: 0,title,content
0,小米MIUI 9首批机型曝光：共计15款,此外，自本周（6月12日）起，除小米手机6等15款机型外，其余机型已暂停更新发布（含开发版/...
1,骁龙835在Windows 10上的性能表现有望改善,骁龙835作为唯一通过Windows 10桌面平台认证的ARM处理器，高通强调，不会因为只考...
2,一加手机5细节曝光：3300mAh、充半小时用1天,此前的一加3T搭载的是3400mAh电池，DashCharge快充规格为5V/4A。\n至于...
3,葡森林火灾造成至少62人死亡 政府宣布进入紧急状态（组图）,这是6月18日在葡萄牙中部大佩德罗冈地区拍摄的被森林大火烧毁的汽车。新华社记者张立云摄
4,44岁女子约网友被拒暴雨中裸奔 交警为其披衣相随,（原标题：44岁女子跑深圳约会网友被拒，暴雨中裸身奔走……）\n@深圳交警微博称：昨日清晨交...


## 训练词向量

**Gensim examples**

In [50]:
from gensim.models import Word2Vec
from gensim.test.utils import common_texts, get_tmpfile

In [62]:
path = get_tmpfile('word2vec.model')
model = Word2Vec(common_texts, size=64, window=5, min_count=1, workers=4)
model.save(path)

The training is streamed, meaning sentences can be a generator, readinng input data from disk.

In [63]:
model = Word2Vec.load(path)
model.train(['hello', 'world'], total_examples=1, epochs=1)

(0, 10)

The tained vector are stored in a `KeyedVectors` instance in `model.wv`.

In [75]:
# common_texts

In [86]:
# model.wv['computer']

Save and load word vectors

In [95]:
from gensim.models import KeyedVectors

In [77]:
wordvec_path = get_tmpfile('wordvectors.kv')
model.wv.save(wordvec_path)

In [94]:
wordvec_path

'/tmp/wordvectors.kv'

In [90]:
wv = KeyedVectors.load(wordvec_path, mmap='r')
# wv['computer']  # numpy vector of a word

Gensim can load word vectors in the `word2vec C format` as `KeyedVecotrs` instance:

In [98]:
from gensim.test.utils import datapath

In [99]:
wv_from_text = KeyedVectors.load_word2vec_format(datapath('word2vec_pre_kv_c'), binary=False)
wv_from_bin = KeyedVectors.load_word2vec_format(datapath('euclidean_vectors.bin'), binary=True)

It is impossible to continue training the vectors loaded from the C format because the hidden weights, vocabulary frequencies and the binary tree are missing. To continue training, you'll need full `WordVec` object state, as stored by `save()` not just the `KeyedVectors`.

You can perform various NLP word tasks with trained model. 

### Train word2vec

In [113]:
from gensim.models import Word2Vec
from gensim.test.utils import datapath
from gensim.models.word2vec import LineSentence

In [122]:
WORD_VECTOR_FILEPATH = os.path.join(BASE_DIR, 'text_summ/datasets/wordvectors.kv')

In [116]:
word2vec_model = Word2Vec(LineSentence(news_corpus_filepath, max_sentence_length=100),
                          size=128, window=5, min_count=5, workers=8)

In [123]:
word2vec_model.wv.save(WORD_VECTOR_FILEPATH)

In [124]:
word2vec_model.wv.save?