# Word2Vec Skip-gram

* author: simonjisu

## How do we represent the meaning of a word?

Definition: **meaning** (webster dictionary)

* the idea that is represented by a word, phrase, etc.
* the idea that a person wants to express by using words, signs, etc.
* the idea that is expressed in a work of writing, art, etc.

Commononest linguistic way of thinking of meaning:
* signifier(기표) $\Longleftrightarrow$ signified(기의) (idea or thing) = denotation(표시, 명시적의미)
    - signifier: 시니피앙(언어가 소리와 그 소리로 표시되는 의미로 성립된다고 할 때, 소리를 가리킴)
    - signified: 시니피에(언어가 소리와 그 소리로 표시되는 의미로 성립된다고 할 때, 의미를 가리킴) 

국어사전에서의 **의미** 정의

1. 말이나 글의 뜻.
    * 단어의 사전적 의미
    * 문장의 의미
    * 두 단어는 같은 의미로 쓰인다.	 
2. 행위나 현상이 지닌 뜻.
    * 삶의 의미
    * 역사적 의미
    * 의미 있는 웃음	 
3. 사물이나 현상의 가치.
    * 의미 있는 삶을 살다
    * 여가를 의미 있게 보내다.
    * 의미 없는 행동
    


## How do we have usable meaning in a computer?

Common answer: Use a taxonomy(분류) like WordNet that has hypernyms(상위어)(is-a) relationships and synonym(동의어) sets

English: wordnet

In [1]:
from nltk.corpus import wordnet as wn
panda = wn.synset('panda.n.01')
hyper = lambda s: s.hypernyms()
list(panda.closure(hyper))

[Synset('procyonid.n.01'),
 Synset('carnivore.n.01'),
 Synset('placental.n.01'),
 Synset('mammal.n.01'),
 Synset('vertebrate.n.01'),
 Synset('chordate.n.01'),
 Synset('animal.n.01'),
 Synset('organism.n.01'),
 Synset('living_thing.n.01'),
 Synset('whole.n.02'),
 Synset('object.n.01'),
 Synset('physical_entity.n.01'),
 Synset('entity.n.01')]

## Problems with this discrete representation
* Great as a resource but missing nuances(의미・소리・색상・감정상의 미묘한 차이, 뉘앙스), e.g., **synonyms**: 단어간의 미묘한 차이를 넣을 수 없음
    * adept, expert, good, practiced, proficient, skillful
    * ex) i'm good (vs expert) at deeplearning
* Missing new words (impossible to keep up to date): 매일같이 업데이트 불가능(비용이 너무큼)
* Subjective: 사람마다 다름, 주관적임
* Requires human labor to create and adapt: 사람 손을 많이 탐
* Hard to compute accurate word similarity: 유사도 계산이 어려움

The vast majority of rele-based and statisical NLP work regards words as atomic symbols: **one-hot-representation**
$$word = [0, 0, 0, 1, 0, 0, 0]$$

BAD because: 
* Dimensionality: too long when there are a lot lot of words 단어가 많아 질 수록 너무길어짐
* Localist representation: Doesn't give inherent notion, independent for each word, which means cannot calculate similarity 단어의 내적의미를 포함하지 않음, 독립적임(즉, 사람눈에 유사한 단어라도 기계입장에서는 다른 단어일 뿐)
    * when someone want to find "Seattle motel", we have to match and give him "Seattle motel"
    * **orthogonal**
$$\begin{aligned} 
motel &= \begin{bmatrix} 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 \end{bmatrix} \\
hotel &= \begin{bmatrix} 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 \end{bmatrix} \\
\end{aligned}$$
$$hotel \cdot motel^T = 0$$

## Distributional similarity based representations

You can get a lot of value by representing a word by means of its neighbors

* When a word $w$ appears in a text, its **context** is the set of words that appear nearby (within a fixed-size window). 단어 $w$ 주변에 나타나는 단어들을 **맥락(문맥)** 이라고함 

## Word Vectors
Build a dense vector for each word type, chosen so that it is good at predicting other words appearing in its context. 문맥에서 비슷한 단어들을 잘 예측 될 수 있게 단어 타입 별로 촘촘한 벡터(0이 별로없는)를 만든다.

그러나 이러한 word vector 가 단어의 개념을 뜻하는 것은 아님, 단지 분포상에서의 의미(distributional meaning)를 뜻함 

Idea:
* We have a large corpus of text
* Every word in a fixed vocabulary is represented by a vector
* Go through each **position** $t$ in the text, which has a **center word** $c$ and **context ("outside") words** $o$
* Use the similarity of the word vectors for $c$ and $o$ to calculate the probability of $o$ given $c$ (or vice versa)
* Keep adjusting the word vectors to maximize this probability

요약: 모든 단어를 대상으로, 중심단어 $c$ 가 주어졌을 때 그 주변 단어 $o$ 를 나오게 하는 하나의 확률분포을 최대화 시킴 

Embedding Example:

$$\begin{bmatrix} 1&4&7&10 \\ 2&5&8&11 \\ 3&6&9&12 \end{bmatrix} \begin{bmatrix} 1\\0\\0\\0 \end{bmatrix} = \begin{bmatrix} 1 \\ 2 \\ 3 \end{bmatrix}$$

get first column of matrix


## Word2Vec

[Word2vec](https://arxiv.org/pdf/1310.4546.pdf) (Mikolov et al. 2013)

Only one probability distribution of a center word

Two Algorithms
1. Skip-grams(SG)
    * Predict context words given target (position independent)
2. Continuous Bag of Words (CBOW)
    * Predict target word from bag-of-words context

Two (moderately efficient) training methods (vs Naive Softmax)
1. Hierarchical softmax
2. Negative sampling


## Propose
for each position of word $c$:

$$\max J(\theta) = \prod_{t=1}^{T} \prod_{-m \leq j \leq m,\ j \neq 0} p(w_{t+j} | w_t; \theta)$$

change it to negative log likelihood:

$$\begin{aligned} 
\min J(\theta) &= -\dfrac{1}{T} \sum_{t=1}^T \sum_{-m \leq j \leq m,\ j \neq 0} \log P(w_{t+j} | w_t) \\
P(o|c) &= \dfrac{\exp(u_o^T V_c)}{\sum_{w=1}^V \exp(u_w^T V_c)}
\end{aligned}$$

* 왜 nll 로 바꾸는 것인가?  
    * https://ratsgo.github.io/statistics/2017/09/22/information/
    * https://ratsgo.github.io/statistics/2017/09/23/MLE/




## Dot product & Softmax

* Dot product: similar to calcuate similarity

$$u^Tv = u\cdot v = \sum_i u_i v_i$$

* Softmax

$$softmax(x_i) = \dfrac{\exp(x_i)}{\sum_{j=1}\exp(x_j)}$$

## Train: Compute all vector gradients
* define the set of all parameters in a model in terms of one long vector $\theta \in \Bbb{R}^{2dV}$
* why $2dV$? Because for each word there is a vector as a center word($c$) and as a context word($o$).

![L2_Skipgram](./figs/L2_Skipgram.png)

## Update?

$$f = \log \dfrac{\exp(u_o^T V_c)}{\sum_{w=1}^V \exp(u_w^T V_c)}$$

$$\begin{aligned} \dfrac{\partial f}{\partial V_c} 
&= \dfrac{\partial }{\partial V_c} \big(\log(\exp(u_o^T V_c)) - \log(\sum_{w=1}^V \exp(u_w^T V_c))\big) \\
&= u_o - \dfrac{1}{\sum_{w=1}^V \exp(u_w^T V_c)}(\sum_{x=1}^V \exp(u_x^T V_c) u_x ) \\
&= u_o - \sum_{x=1}^V \dfrac{\exp(u_x^T V_c)}{\sum_{w=1}^V \exp(u_w^T V_c)} u_x \\ 
&= u_o - \sum_{x=1}^V P(x | c) u_x
\end{aligned}$$

* $u_o$ : observed word, output context word
* $P(x|c)$: probs context word $x$ given center word $c$  
* $P(x|c)u_x$: Expectation of all the context words: likelihood occurance probs $\times$ center vector  