# Summaries For NLP & Word Embeddings (C5W2 - 1)
**Last Edit at: 2018-03-07 08:33:55**

## Word Representation  
- Word representation is a kind of featurized Representation.
- High Dimensional data can be visualized by t-SNE algorithms.
- The dimension of words vectors are often smaller than the size of word vocabularies, usually ranges between 50 and 400.(Exercises)

## Word Embeddings
- When doing transfer learning in this area, 1)learn word embeddings from large text corpus(1-100 Billions), 2)Transfer to embeddings to new task with smaller training set, 3) (Optional) finetune the word embeddings.
- There are differences in size and features between face encoding and word embedding.
- Word embedding keeps the properties that analogies are contained.
- Embedding Matrix is shared in one task during its computation, which is done by $E \bullet \vec{O_{j}}=\vec{e_{j}}$. In this equation, $\vec{e_{j}}$ represents for embedding for $word_{j}$. In practice, use specialized function to look up an embedding(Embedding Layer) for efficiency purpose.  
- Context words(c) and Target words(t)
- When learning word embeddings, we create an artificial task of estimating $P(t \mid c)$. It is okay if we do poorly on this artificial prediction task; the more important by-product of this task is that we learn a useful set of word embeddings. (Exercises)

## Word2Vec
- Word2Vec:    
    - from $c \to t$, we have $O_{c} \to E \to e_{c} \to Softmax \to \hat{y}$, in which the softmax is:  
    
        \begin{equation}\label{eq:1}
            P(t\mid c) = \frac{e^{\theta^{T}_{t}e_{c}}}{\sum_{j=1}^{vocab\ size}e^{\theta^{T}_{j}e_{c}}}
        \end{equation}
        
    - The Loss is pretty simple:  
    
        \begin{equation}\label{eq:2}
            L(\hat{y},y) = - \sum{_{i=1}^{vocab\ size}}y_{i}log(\hat{y_{i}})
        \end{equation}
        
- The problem with softmax classification in using Word2Vec is its heavy computation cost. To address this, we can use 1) hierarchical softmax classifier by which it means adapting a non-balanced(frequently appeared words may be on the top) tree to cut the cost to O(log) level. 2) Negative Sampling.
- Negative Sampling is set a positive line and randomly choose several(k=5-20 samples for SMALL datasets while 2-5 for large ones) negative rows.
    - The compute method changes to:  
    
    \begin{equation}\label{eq:3}
        P(y=1 \mid c,t) = \sigma(\theta_{t}^{T}e_{c})
    \end{equation}
    
    - In each iteration of computation, only K+1 parameters are updated by Binary Classification(Logistic Regression etc.)
    - How to select random samples?
        - By Frequency.</br> Problem: High representation of "the, a, of, ......."
        - $\frac{1}{\left | V \right |}$, V is vocabulary size.</br> Problem: non-representation of the distribution of the words
        - Meet the halfway:  
        
            \begin{equation}\label{eq:4}
                P(w_{i})=  \frac{f(w_{i})^{\frac{3}{4}}}{\sum_{j=1}^{vocab\ size} f(w_{i})^\frac{3}{4}}
            \end{equation}
            
## GloVe            
- GloVe: 
    - Replace t,c with i,j, $X_{ij}$ represents for #times i appears in context of j. $X_{ij}$ may equal to $X_{ji}$
    - Model:   
    
    \begin{equation}\label{eq:5}
        minimize \sum_{i=1}^{vocab size}\sum_{j=1}^{vocab size}f(x_{ij})(\theta_{i}^{T}e_{j} + b_{i} + b_{j}'-log(x_{ij}))^2
    \end{equation}
    
    in equation(5) $f(x_{ij})$ is a weight term, a trade-off strategy to not give too much to high frequency words nor too less to low ones.
    
- A simple way to implement sentiment classification is average the $\vec{e}$ together and pass it to a softmax unit and calculate the scores
- A working good solution to implement sentiment classification is that using RNN. By this way, just pass the $\vec{e}$ of current words to the RNN and get the result until the last one is inputed(Typically many-to-one architecture).
- Debiasing Word Embeddings. (The bias here is not the one in bias and variance)
    - Problem Description: Word embeddings can reflect gender, ethnicity, age, sexual orientation and other biases of the text used to train the model.i.e. Man:Woman as King:Queen is OK while Father:Doctor as Mother:Homemaker is not.
    - To address this kind of biases,
       - Identify bias direction
       - Neutralize: For every word that is not definitional, project to get rid of bias.
       - Equalize pairs.
       - To do the three steps above, we can use neural networks.


-----  

<a rel="license" style="text-decoration:none" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">
    <div style="margin-top:0.5em;margin-bottom:1em;">
        <img alt="Creative Comments" style="display:inline;"  src="https://mirrors.creativecommons.org/presskit/icons/cc.svg"/>
        <img alt="Attribution" style="display:inline;margin-top:0;"  src="https://mirrors.creativecommons.org/presskit/icons/by.svg"/>
        <img alt="Non-Commercial" style="display:inline;margin-top:0;"  src="https://mirrors.creativecommons.org/presskit/icons/nc.svg"/>
        <img alt="Non-Commercial" style="display:inline;margin-top:0;"  src="https://mirrors.creativecommons.org/presskit/icons/nc-jp.svg"/>
        <img alt="Share Alike" style="display:inline;margin-top:0;"  src="https://mirrors.creativecommons.org/presskit/icons/sa.svg"/>
     </div>
</a>
<br />
This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.

CREATED BY [ArvinSiChuan](mailto:arvinsc@foxmail.com?subject=Summaries\%20for\%20NLP\%20Word Embeddings(C5W2)), 04-Mar-2018.  
Updated at 08-Mar-2018, VERSION SNAPSHOT-1.0.0
