collaborative filtering
text clustering
NER

# NLP and Language Modelling

## Language Modeling
Task Definition: Predict next word

### N-gram model
   - Prototype: P(L = 3, e1 = I, e2 = am, e3 = WS) = P(e1 = I) \* P(e2 = am | e1 = I) \* P(e3 = WS | e1 = I, e2 = am) \* P(e4 = EoS |e1 = I, e2 = am, e3 = WS) 
       - NOTE: from beginning of sentence
   - Advantage of N-gram: Using count of different length of grams as they shown in corpus
   - For example: 2-gram (bigram). Calculate probs of "I am", "am WS", "WS EoS" instead of "I am WS EoS"
   - Main problem: **Sparsity**, some senetence may not appear in training set, joint probability will be zero (The same problem as Prototype)
   - Fix by **smoothing**/**interpolation**: $P(e_t|e_{t-1} = (1-\alpha)P_{ML}(e_t|e_{t-1}) + \alpha P_{ML}(e_t)$
       - A combination of unigram and bigram to ensure P>0
       - Variation: more grams, context-dependent alpha, etc
   - Fix unknown words by adding a "**unk**" word
       - Remove singletopns or words only appearing once in training corpus


### NN Model with Fixed Window Size      
   - No Sparsity problem (using input embedding M, so possible to traet similar words similarly during prediction)
   - Model size reduced (Instead of learning all probs {a, b, c} X {A, B, C}, Neural network learn weights to represent the quadratic combination)
   - Ability to skip a previous word
   - BUT: X do not share weight, and how to decide window size
    
    
<img src="http://kiyukuta.github.io/_images/nnlm_bengio.png" width="500">

### RNN Model
   - Any sequence length will work
   - Weights shared, model size doesn't increase
   - BUT: computation is slow (why) and cannot access information from many steps back
   - Others applications of RNN
        * One-to-one: tagging each word
        * many-to-one: sentiment analysis
        * Encoder module: example: element-wise max of all hidden states -> as input for further NN model
    
<img src="https://cdn-images-1.medium.com/max/1600/1*q1wyldq3Nm5pT266eXdfzA.png" width="400">


### Evaluation of LM
- Given 1) Test dataset, and 2) trained language model P with parameter $\theta$
- Log likelihood $log(E_{test};\theta) = \sum_{E\in E_{test}}{log[P(E;\theta)]}$
- Perplexity: $ ppl(E_{test};\theta) = exp(-log(E_{test};\theta) / len(E_{test}))$
    - ppl = 1: perfect
    - ppl = Vocabulary size: random model
    - ppl = +inf: worst model
    - ppl = some value $v$: need to pick $v$ values on average to get the correct one

    
### Alternative tasks in NLP - *Conditioned* Language Models
   -  Speech recognition
   -  Machine translation
   -  Generate summary

##  Word Embedding / Word2vec

- Why other Options not working
    * One-hot vectors (vocabulary list too big; No similarity measurement; how about new words)
    * Co-currence vector (matrix given a certain window size, # of times two words are together)->Sparsity
    * Singular Vector Decomposition (SVD) for cocurrence matrix (too expensive)
    * Use a word's context to represent --> Word embedding
    
    
    
- Key Components
    * Center word *c*, context word *o*
    * Two vectors for each word *w*: $ v_w $ and $ u_w $. $\theta$ contains all *u* and *v* (Input and Output Vector)
    * For example: $ P( w_2|w_1 ) = P(w_2|w_1;  u_{w2}, v_{w1}, \theta )$
    * Loss Function: $ J(\theta) = -\frac{1}{T}\sum_{t}\sum_{m\in window} P(w_{t+j}|w_t)$
    * Calculate u*v for each word, and use softmax to derive probability
      - $ P(O|C) = \frac{exp(u_o^T v_c)}{\sum_{w}exp(u_w^T v_c)} $  
      - $w$ is entire vocabulary
        
    * After optimization for loss, get two vectors for each word. Combine or Use *u* or Use *v*
    
    
    
- Variation
    * Skip-grams (SG):given center, predict context
    * Countinous Bag of Words (CBOW):given bag of context, predict center
    * Negative sampling (maximize p of actual context + minimize p of random context i.e. noise)
    * GloVe: combine count-based and direct-prediction

# Classification
## Word Window Classification
- Difference with typical ML: learn both **W** and word vectors **x**
- Task definition: classify a word in its *Context Window*
    * Advantage: Do not train single word: ambiguity
    * Advantage: Do not just average over window: lose position information
    * Get a vector X with length of 5d where 5 is window size and d is embedding size
    * Predict y based on softmax of WX and minimize cross-entropy error
    
    
- Example: NER (*Named Entity Recognition*)
    * 'Museums in Paris are good". Binary task: whether Paris is a *location* or not.
    
    
- What happens for word embedding x:
    * Updated just as weigh W
    * Pushed into an area helpful for classification task
    * Example: $X_{in}$ may be a sign for location

## CNN


<img src="https://i.stack.imgur.com/a6CJc.png" width="800">


<img src="https://raw.githubusercontent.com/bicepjai/Deep-Survey-Text-Classification/master/images/paper_02_cnn_sent_model.png" width="400">

# Machine Translation

## Problem definition
- Neural Machine Translation (NMT)
- Sequence-to-Sequence(seq2seq) architecture
- Difference from SMT (Statistical MT): calculate P(y|x) directly instead of using Bayes
- Advantage: Single NN, less human engineering
- Disadvantage: less interpretable, less control
- Figure (TBA)

## Main Components
- Encoder RNN: encode source sentence, generate hidden state
- Decoder RNN: **Language Model**, generate target sentence using outputs from encoder RNN; predict next word in *y* conditional on input *x*




<img src="https://cdn-images-1.medium.com/max/1585/1*sO-SP58T4brE9EHazHSeGA.png" width="800">


<tr>
    <td> <img src="https://guillaumegenthial.github.io/assets/img2latex/seq2seq_vanilla_encoder.svg" alt="Drawing" style="width: 400px;"/> </td>
    <td> <img src="https://guillaumegenthial.github.io/assets/img2latex/seq2seq_vanilla_decoder.svg" alt="Drawing" style="width: 500px;"/> </td>
    </tr>

   

Bidirectional Encoder

    * Allow information from future inputs
    * LSTM only allows past information
<img src="https://cdn-images-1.medium.com/max/764/1*6QnPUSv_t9BY9Fv8_aLb-Q.png" width="500">


    


## Beam Search


- Greedy decoding problem
    * Instead of generating argmax each step, use beam search.
    * Keep *k* most probable translations
    * Exactly *k* nodes at each time step *t*
    * *Note*: Length bias, prefer shorter sentence because the log(P) accumulates. Can add prior for sentence length to compensate.
    
<img src="./figure/beam.png" width="300">
https://arxiv.org/pdf/1703.01619.pdf

## Attention model

### General Advantage

- Focus on certain parts of source (Instead of encoding whole sentence in **one** hidden vector.)
- Provides shortcut / Bypass bottleneck
- Get some interpretable results and learn alignment
- "Attention is a mechanism that forces the model to learn to focus (=to attend) on specific parts of the input sequence when decoding, instead of relying only on the hidden vector of the decoder’s LSTM"

### Luong attention mechanism



1. Get encoder hidden states: $ h_1, ..h_k,..., h_N $

1. Get decoder hidden state at time *t*: $ s_t $
    - $s_t = LSTM(s_{t-1}, \hat y_{t-1})$<br/><br/>
    
1. Get attention scores by dot product: 
$ \mathbf e_t = [s^T_t h_1, ..., s^T_t h_N] $
    - Other alignment options available <br/>
    <img src="https://i.stack.imgur.com/tiQkz.png" width="300"> 
    - Penalty available: penalize input tokens that have obtained high attention scores in past decoding steps 
    - $e'_t = e_t\ if\ t = 1\ else\ \frac{exp(e_t)}{\sum_{j=1}^{t-1}{exp(e_j)}} $ for decoder state
    
    
4. Take softmax of $ \mathbf e_t $ and get $ \pmb\alpha_t $ which sum up to one
    - $ \pmb\alpha_t = softmax(\mathbf e_t) $
    - Note: $\pmb\alpha_t$ can be interpreted as attention. For example, when generating word `vas`, the attention for `are` in encoder hidden states should be close to 1, others to 0<br/><br/>
    
    
5. Take weighted sum of hidder states $\mathbf h$ and $\pmb\alpha$, and get context vector **c**
    - $ c_t = \sum_{k=1}^{N} \alpha_{tk}h_k $<br/><br/>
    
6. Generate *Attentional Hidden Layer*
    - $ \tilde h_t = tanh(W_c[c_t;h_t])$<br/><br/>

7. Make Predicition
    - $ p = softmax(W_s \tilde h_t)$


<img src="./figure/attention.png" width="500">



### Bahdanau Attention Mechanism

**Main difference**

1. Get attention scores by dot product: 
    - $ \mathbf e_t = [s^T_{t-1} h_1, ..., s^T_{t-1} h_N] $<br/><br/>

1. Get decoder hidden state at time *t*: $ s_t $
    - $s_t = LSTM(s_{t-1}, \hat y_{t-1}, c_t)$<br/><br/>
    
1. Make Predicition: 
    - $ p = softmax(g(s_t))$

<img src="https://guillaumegenthial.github.io/assets/img2latex/seq2seq_attention_mechanism_new.svg" width="500">



**Comparison of two mechanism**

<img src="http://cnyah.com/2017/08/01/attention-variants/attention-mechanisms.png" width="500">

### Pointer Network
- RNN (LSTM): difficult to predict rare or out-of-vocabulary words
- Pointer Network: generate word from input sentence (i.e., OoV - out of Vocabulary words)

<img src="https://ai2-s2-public.s3.amazonaws.com/figures/2017-08-08/efbd381493bb9636f489b965a2034d529cd56bcd/1-Figure1-1.png" width="500">

- Part I: Seq2Seq Attention Model
    - See above
    - $p_{vocabulary}(word)$
    
    
- Part II: Pointer Generator
    - After getting $ \pmb\alpha_t = softmax(\mathbf e_t) $
    - $p_{pointer}(word) = \sum \alpha_t$, where position t is actually word w


- Weighted sum: 
    - $g * p_{vocabulary}(word) + (1-g) * p_{pointer}(word) $
    
    
- Applications:
    - Summarization
    - Question-Answering

### Self/Intra/Inner Attention

- Compute alignment function f among **decoder** hidden states $s_t$
- Apply softmax for all states before current time $t$
- Weighted sum will get current decoder attention output $c^d_t$
- Why self-attention?


### Transformer Network

https://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf

# Coreference Resolution
## Coreference and Anaphora
- Barak Obama travelled to,..., Obama
- Obama says that he ....

## Mention detection
- Pronouns (I, your, it) - Part-of-Speech (POS) tagger
- Named Entity (People name, place. tec) - NER system
- Noun phrase (The dog stuck in the hole) - constituency parser

## Coreferece model
- Mention pair
    * For each word, look at candidate antecedents, and train a **binary** classifier to predict $p(m_i,m_j)$

- Mention rank
    * Apply softmax to all candidate antecedents, and add highest scoring coreference link
    * Each mention is only linked to **one** antecedent
    

- Clustering


- Neural Coref Model
    * Input layer: word embedding and other catogorical features (e.g., distance, document characteristic)
<img src="./figure/Coref.png" width="500">


- End-to-end Model
    * No separate mention detection step
    * Apply LSTM and attention
    * Consider span of text as a candidiate mention
    * Final score: $s(i, j) = s_m(i) + s_m(j) + s_a(i, j)$, which means Is i, j mentions, and do they look coreferent.
    
<img src="./figure/endtoend.png" width="500">

# RL for NLP

To be added