In [5]:
%%HTML
<style>
    /* style for notebook and presentation */
    a { color: green !important; }
    /* style for presentation only */
    .reveal a { color: green !important; }
    .reveal pasfh { color: red; }
</style> 

## What is Word2Vec?

* An algorithm that can learn the meanings of words without much effort from the developer (Rob)
* 一種算法，可以從顯影劑學習單詞的含義不費力

* "Efficient Estimation of Word Representations in Vector Space" (Mikolov et al., 2013)
* 字表示的向量空間有效估計

### Can show similar words - e.g. 衣服
```
鞋子 0.857314109802
大衣 0.82208108902
穿 0.801305949688
穿戴 0.800143063068
裙子 0.795694112778
外套 0.79101806879
衣物 0.786619484425
```
English: shoes, coat, wear, wear, skirts, jacket, clothing, coat, wearing, wear

#### Why are clothes and shoes similar?  為什麼衣服和鞋相似？

* Look at the nearby words:
 * He wears **clothes** to work.  In the morning he puts on **clothes**
 * He wears **shoes** to work.  In the morning he puts on **shoes**
* 看看這些句子：
 * 他穿衣服上班。早上，他穿上衣服
 * 他穿鞋上班。早上，他穿上鞋

## Why Word2Vec?  Why not something else?

* In short: Word2Vec is better at this than previous algorithms.  More details: [Don't count, predict!](http://www.marekrei.com/blog/dont-count-predict/)

* Gensim implementation is easy to use.

* word2vec是魔術

## Install

In [None]:
pip install -U gensim

In [2]:
import gensim
print gensim.models.word2vec.FAST_VERSION

0


In [None]:
# Want FAST_VERSION >= 0.  -1 means your code will run much more slowly

## Download data

All of Chinese Wikipedia:

`wget https://dumps.wikimedia.org/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2`

```
$ du -csh zhwiki-latest-pages-articles.xml.bz2 
1.2G	zhwiki-latest-pages-articles.xml.bz2
```

## Next: Convert .xml.bz2 to plain text

In [None]:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# process_wiki.py
# Source: http://textminingonline.com/tag/word2vec-python
# 中文: http://www.52nlp.cn/中英文维基百科语料上的word2vec实验

import logging
import os.path
import sys
 
from gensim.corpora import WikiCorpus
 
if __name__ == '__main__':
    program = os.path.basename(sys.argv[0])
    logger = logging.getLogger(program)
 
    logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s')
    logging.root.setLevel(level=logging.INFO)
    logger.info("running %s" % ' '.join(sys.argv))
 
    # check and process input arguments
    if len(sys.argv) < 3:
        print globals()['__doc__'] % locals()
        sys.exit(1)
    inp, outp = sys.argv[1:3]
    space = " "
    i = 0
 
    output = open(outp, 'w')
    wiki = WikiCorpus(inp, lemmatize=False, dictionary={})
    for text in wiki.get_texts():
        output.write(space.join(text) + "\n")
        i = i + 1
        if (i % 10000 == 0):
            logger.info("Saved " + str(i) + " articles")
 
    output.close()
    logger.info("Finished Saved " + str(i) + " articles")

## Save and run it

* python process_wiki.py zhwiki-latest-pages-articles.xml.bz2 wiki.zh.text

## Extracted data

In [8]:
%%bash
head -c300 wiki.zh.text 

歐幾里得 西元前三世紀的希臘數學家 現在被認為是幾何之父 此畫為拉斐爾的作品 雅典學院 数学 mathematics 是利用符号语言研究數量 结构 变化以及空间等概念的一門学科 从某种角度看屬於形式科學的一種 數學透過抽象化和邏�

## Segment words with Jeiba

In [10]:
import sys
import time
import jieba


jieba.enable_parallel(4)
jieba.set_dictionary('dict.txt.big')

url = 'wiki.zh.text'
t1 = time.time()

with open(url+'.seg','wb') as w:
  with open(url) as f:
    for line in f:
      words = " ".join(jieba.cut(line))
      w.write(words.encode('utf-8'))

t2 = time.time()
tm_cost = t2-t1

## Check the output

In [11]:
%%bash
head -c 500 wiki.zh.text.seg

歐幾里 得   西元前 三世 紀的 希臘 數學家   現在 被 認為 是 幾何 之父   此畫 為拉斐爾 的 作品   雅典 學院   数学   mathematics   是 利用 符号语言 研究 數量   结构   变化 以及 空间 等 概念 的 一門 学科   从 某种 角度看 屬 於 形式 科學 的 一種   數學 透過 抽象化 和 邏輯 推理 的 使用   由計數   計算   數學家們 拓展 這些 概念   對 數學 基本概念 的 完善   早 在 古埃及   而 �

* Good, more spaces!  It is okay if there are some mistakes.  We can still get useful results

* A different word segmenter might give better results.  See [cws_evaluation](https://github.com/ysc/cws_evaluation)

## Training - train_word2vec_model.py

In [None]:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# train_word2vec_model.py
 
import logging
import os.path
import sys
im port multiprocessing
 
from gensim.corpora import WikiCorpus
from gensim.models import Word2Vec
from gensim.models.word2vec import LineSentence
 
if __name__ == '__main__':
    program = os.path.basename(sys.argv[0])
    logger = logging.getLogger(program)
 
    logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s')
    logging.root.setLevel(level=logging.INFO)
    logger.info("running %s" % ' '.join(sys.argv))

    # check and process input arguments
    if len(sys.argv) < 3:
        print globals()['__doc__'] % locals()
        sys.exit(1)
    inp, outp = sys.argv[1:3]
    ## Each line in inp must contain < 10,000 tokens
    model = Word2Vec(LineSentence(inp), size=150, window=7, min_count=20,
            workers=multiprocessing.cpu_count(), iter=5)
 
    # trim unneeded model memory = use(much) less RAM
    model.init_sims(replace=True)
    model.save(outp)

## Parameters
* Word2Vec(..., size, window, min_count, workers, iter)

* size: 150 -- Usually: 100-400 -- Number of dimensions
 * More is better, but takes more time to train

* window: 7 -- Usually: 5-10 -- Number of words of context

* min_count: 20 -- Usually: 5-20 --  The number of times a word must appear in your data to be included

* workers: Number of CPUs you have.
 * The more, the better.  Gensim's Word2Vec only works on one machine

* iter: 5 -- Normal: 10-20 -- Default is 1.
 * Important for improving your results
 * A higher number will increase run time
 * Higher number will increase model quality, up to a point

## Training
* python train_word2vec_model.py wiki.zh.text.seg wiki.zh.text.model

* Time to run: ~2 hours
 * 4-core 1.6GHz i5 macbook

## Results

In [15]:
ls -1 wiki.zh.text.model*

wiki.zh.text.model
wiki.zh.text.model.syn0.npy
wiki.zh.text.model.syn1.npy


In [9]:
from gensim.models.word2vec import Word2Vec
m = Word2Vec.load('wiki.zh.text.model')

print m.most_similar(u'衣服')

[(u'\u978b\u5b50', 0.8573141098022461), (u'\u5927\u8863', 0.8220810890197754), (u'\u7a7f', 0.8013059496879578), (u'\u7a7f\u6234', 0.8001430630683899), (u'\u88d9\u5b50', 0.79569411277771), (u'\u5916\u5957', 0.7910180687904358), (u'\u8863\u7269', 0.7866194844245911), (u'\u5916\u8863', 0.7854279279708862), (u'\u7a7f\u7740', 0.7846043109893799), (u'\u6234\u4e0a', 0.7763027548789978)]


In [14]:
def sim(m,p,n=[],i=10):
    res = m.most_similar_cosmul(positive=p,negative=n,topn=i)
    for e in res:
        print e[0],e[1]

sim(m,u'衣服')

鞋子 0.928656160831
大衣 0.911039590836
穿 0.900652050972
穿戴 0.900070667267
裙子 0.897846162319
外套 0.895508170128
衣物 0.893308877945
外衣 0.892713069916
穿着 0.892301261425
戴上 0.888150513172


## What else can the results show?
* Paris, France has same **relationship** as Dublin, Ireland
 * X is the capital of Y

* France - Paris = Ireland - Dublin

* We can frame this as a question to Word2Vec:
  * What is the answer to France - Paris + Dublin?

In [21]:
# France - Paris + Dublin = Ireland
sim(m,[u'法國',u'都柏林'],[u'巴黎'],i=1)

愛爾蘭 1.03317070007


* Think of each word as a variable, for example:
  * France = 10; Paris = 50; Dublin = 200

* 10 - 50 + 200 = 160
 * Word2Vec looks for the word whose value is nearest 160 (Ireland)

* Actually:
 * France = [0.8319, .9318, .0831, .1485, ... ]
 * Each word has a list of numbers called a **word vector** or **word embedding**

## Some more examples

In [24]:
# England + Taipei - London = Taiwan
sim(m,[u'英國',u'台北'],[u'倫敦'],i=1)

台灣 0.947201609612


In [26]:
# Unreasonable - reasonable + convenient = inconvenient
sim(m,[u'不合理',u'方便'],[u'合理'],i=1)

不便 0.884090006351


In [29]:
## The classic example didn't work:  king - man + woman = queen

sim(m,[u'王',u'女人'],[u'男人'])

倪齐民 0.880974709988
玫 0.869373738766
苗可丽 0.868878781796
琄 0.864685177803
王仲篪 0.861672520638
王琦 0.86028701067
李 0.859799146652
祺 0.859715044498
宏恩 0.858121335506
劉金環 0.857797265053


In [30]:
# Also doesn't work:  Emperor + Woman - Man
sim(m,[u'皇帝',u'女人'],[u'男人'])

登基 0.893643379211
封号 0.890938282013
忽必烈 0.889511108398
尊号 0.888372242451
封其为 0.88481169939
雍正帝 0.883169054985
册封 0.880160093307
康熙皇帝 0.879712939262
诸王 0.878517746925
觐见 0.877376556396


In [50]:
## Chinese wikipedia apparently has some english

sim(m,['father','woman'],['man'])

mother 1.06075584888
daughter 1.03906154633
daughters 1.00900793076
faith 1.00623333454
our 1.00440311432
child 1.00097632408
almost 1.00089430809
witness 1.00083208084
himself 0.997282207012
wife 0.992625772953


In [15]:
## One more example.  Might be improved with more data,
##                    or adjusting training parameters.

sim(m,[u'爸爸',u'女人'],[u'男人'])

一家人 0.938520371914
奶奶 0.908418834209
老婆 0.904512822628
老爸 0.901482105255
老公 0.900539815426
丈母娘 0.897194266319
幸福 0.887105762959
娘家 0.886565566063
千金 0.883036851883
老伴 0.882726192474


## So what?  Why do analogies matter?
* Unit testing: Test the quality of this part of a software pipeline.

* Extensions: We can use these results to do other tasks.

* Little work required: All we need is lots of raw text and CPU power to learn word meanings in any language.

## Applications
* Search engine

* Sentiment analysis, stock market prediction

* News category classification: {sports, politics, technology}?   What section is this text (title, introduction, main, conclusion)?  [Smartnews.com video](http://www.youtube.com/watch?v=7BgpaZltW8s)

* [Guessing gender of blog author, Dato.com](http://blog.dato.com/practical-text-analysis-using-deep-learning)
 * Was this blog written by man or woman?  How old?
 * [Notebook found here](https://dato.com/learn/gallery/notebooks/deep_text_learning.html)

* [Gender studies, (Ben Schmidt)](http://bookworm.benschmidt.org/posts/2015-10-30-rejecting-the-gender-binary.html)
 * Discovers words used to describe men vs women
  * cologne->perfume
  * cool->sweet
 * Results for Chinese might be interesting!

## Other implementations
* Google's, the original: https://code.google.com/p/word2vec/
 * [Google News vectors](https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing), 2 GB model, pre-trained, on 100 billion words.  Vocab size is 3 million words
* Gensim: https://radimrehurek.com/gensim/models/word2vec.html
 * The fastest, 4x faster than Google's

* DeepLearning4Java: http://deeplearning4j.org/word2vec.html
* Spark ML Lib: https://spark.apache.org/docs/latest/mllib-feature-extraction.html#word2vec
 * Gensim creator [says there are problems with it](https://groups.google.com/forum/#!searchin/gensim/spark$20word2vec/gensim/hAU7syBUqdM/2f_VtVoWAwAJ)

* Misc. projects on GitHub with less documentation

## Related work and extensions
* Word2Vec phrases: https://radimrehurek.com/gensim/models/phrases.html#module-gensim.models.phrases
 * Combines frequently co-occuring tokens like **New_York** and **Chicago_Bulls**
 * Then, run Word2Vec as normal

* [Paragraph Vector](https://cs.stanford.edu/~quocle/paragraph_vector.pdf)
 * Same as Word2Vec, but for sentences, paragraphs or documents
 * Gensim implementation is called [Doc2Vec](https://radimrehurek.com/gensim/models/doc2vec.html)
 * Example notebook on [classifying IMDB movie reviews](https://github.com/linanqiu/word2vec-sentiments/blob/master/word2vec-sentiment.ipynb)

* Sense2vec: http://arxiv.org/abs/1511.06388
 * Combine grammar tags with words to get better word vectors
 * For example: take_VERB could be separate from take_NOUN

### Visualizing via Dimensionality Reduction - T-SNE
<img src='tsne.png'>

### Visualizing via Dimensionality Reduction
[3D Word vectors](https://www.a1k0n.net/spotify/artist-viz/)

## Conclusion
* Word2Vec is a good choice for many applications, and often the first step in a Machine Learning NLP pipeline.
* [Wikipedia example in Chinese](http://ww.52nlp.cn/中英文维基百科语料上的word2vec实验)
* [Wikipedia example in English](http://textminingonline.com/training-word2vec-model-on-english-wikipedia-by-gensim)
* [Gensim google group](https://groups.google.com/forum/#!forum/gensim) for Q & A

## What other data sources can you think of that have lots of text?

* Anything you can crawl on the web - PTT, Twitter, Job postings, blogs, research publications, Guttenberg press

## The end.  Thank you!