<a href="https://colab.research.google.com/github/ucaokylong/NLP_learning/blob/main/09_Machine_Translation_Attention.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<center>
<img src='https://github.com/JUSTSUJAY/NLP_One_Shot/blob/main/assets/1/lang-pic.jpg?raw=1' width=600>
</center>
    
# 1. Introduction

## <p style="font-family:JetBrains Mono; font-weight:normal; letter-spacing: 1px; color:#207d06; font-size:100%; text-align:left;padding: 0px; border-bottom: 3px solid #207d06;">1.1 NLP series</p>

This is the **ninth in a series of notebooks** covering the **fundamentals of Natural Language Processing (NLP)**. I find that the best way to learn is by teaching others, hence why I am sharing my journey learning this field from scratch. I hope these notebooks can be helpful to you too.

NLP series:

1. [Tokenization](./01_Tokenization.ipynb)
2. [Preprocessing](./02_Pre_Processing.ipynb)
3. [Bag of Words and Similarity](./03_BOW_Similarity.ipynb)
4. [TF-IDF and Document Search](./04_TFIDF_DocSearch.ipynb)
5. [Naive Bayes Text Classification](./05_NaiveBayes_TextClf.ipynb)
6. [LDA Topic Modelling](./06_LDA_TopicModelling.ipynb)
7. [Word Embeddings](./07_Word_Embeddings.ipynb)
8. [Recurrent Neural Networks (RNNs) and Language Modelling](./08_RNNs_LMs.ipynb)
9. Machine Translation and Attention
<a target="_blank" href="https://colab.research.google.com/github/JUSTSUJAY/NLP_One_Shot/blob/main/Notebooks/09_Machine_Translation_Attention.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>
[![Kaggle](https://kaggle.com/static/images/open-in-kaggle.svg)](https://kaggle.com/kernels/welcome?src=https://github.com/JUSTSUJAY/NLP_One_Shot/blob/main/Notebooks/09_Machine_Translation_Attention.ipynb)

## <p style="font-family:JetBrains Mono; font-weight:normal; letter-spacing: 1px; color:#207d06; font-size:100%; text-align:left;padding: 0px; border-bottom: 3px solid #207d06;">1.2 Outline</p>

Last time, we covered how Recurrent Neural Networks (RNNs) and its variants like GRU and LSTM can model text as a sequence of words. This unlocked a new set of tasks we could perform.

In this notebook, we're going to look at a particular RNN architecture, namely **sequence-to-sequence**, and how it can be used for **Machine Translation**. We'll then discuss its drawbacks and see how the **Attention mechanism** can mitigate these shortcomings.

# 2. Machine Translation

## <p style="font-family:JetBrains Mono; font-weight:normal; letter-spacing: 1px; color:#207d06; font-size:100%; text-align:left;padding: 0px; border-bottom: 3px solid #207d06;">2.1 RNN architectures</p>

Recall that there are **different ways to set up RNNs** to solve different kinds of problems.

<br>
<center>
<img src="https://github.com/JUSTSUJAY/NLP_One_Shot/blob/main/assets/9/rnn-setups.jpg?raw=1" width=600>
</center>
<br>

* **one-to-one** - is equivalent to a feed-forward neural network.
* **one-to-many** - can be used for captioning, decoders or music generation.
* **many-to-one** - is used for classification tasks like sentiment analysis.
* **many-to-many** - used for sequence labelling and language modelling tasks.
* **staggered many-to-many** - (also known as sequence-to-sequence) its applications include translation, summarization and chatbots.

We're going to focus on the last type, i.e. **sequence-to-sequence** (aka seq2seq) in this notebook. This architecture is the most **flexible** because it can take in a sequence of one length and output a sequence of completely **different** length.

This is especially useful in **translation** because a sentence in one language can use a wildy different number of characters in a different language. For example, `"Please can you bring me a glass of water"` in English is `"请您给我带杯水。"` in Mandarin. The grammar rules in Mandarin often allow for more concision in the way information is expressed so we need a model that can deal with this.

## <p style="font-family:JetBrains Mono; font-weight:normal; letter-spacing: 1px; color:#207d06; font-size:100%; text-align:left;padding: 0px; border-bottom: 3px solid #207d06;">2.2 Sequence-to-sequence</p>

The seq2seq architecture can be broken down into 2 submodels - the **encoder** and the **decoder**. The encoder takes in the input sequence and processes it whereas the decoder takes the output of the encoder and uses this to output a new sequence.

The RNN cells can be either plain, GRU or LSTM as this doesn't change the architecture.

<br>
<center>
<img src="https://github.com/JUSTSUJAY/NLP_One_Shot/blob/main/assets/9/seq2seq.png?raw=1" width=600>
</center>
<br>

Observe how the intermediate outputs of the encoder are **not used** at all in this setup. The only thing that gets passed to the decoder is the **final hidden state**, which can be thought of as an **embedding** of the input sequence.

The first step of the decoder takes in the encoder embedding and a beginning of sentence token. It uses this to generate an output that is passed through a **softmax function** to produce a **probability distribution** over the target language vocabulary. The most probable word is selected as the **prediction**, which is then fed in as the **input** for the **next time step** of the decoder and the process repeats.

Because the decoder makes predictions based on the encoder output, this is considered a **conditional language model**. This is trained using a **self-supervised teacher forcing** approach. That is, we get it to **predict the next word** in the sequence given the correct previous words in the translation. **Backpropagation** is used to update the weights of the encoder and decoder simultaneously, i.e. it is treated as a **single system**.

## <p style="font-family:JetBrains Mono; font-weight:normal; letter-spacing: 1px; color:#207d06; font-size:100%; text-align:left;padding: 0px; border-bottom: 3px solid #207d06;">2.3 Challenges</p>

Machine Translation is a very **challenging problem** because different languages can express the same concept in different ways. In particular, some of the difficulties include

* **Word alignment** - a word in one language may be expressed as 2 words in another language and vice versa.
* **Word order structure** - English is a Subject-Verb-Object (SVO) language whereas Japanese for example is a Subject-Object-Verb (SOV) language. So `"Joe ate cake"` would be expressed as `"Joe cake ate"` in Japanese.
* **Adequacy vs fluency** - there are often many ways to express the same information in a language but some of them will sound more fluent and natural to a native speaker than others.
* **Bias** - all humans are biased in some way, therefore very large sets of diverse training data are needed to train accurate and unbiased translation models.

## <p style="font-family:JetBrains Mono; font-weight:normal; letter-spacing: 1px; color:#207d06; font-size:100%; text-align:left;padding: 0px; border-bottom: 3px solid #207d06;">2.4 Beam Search</p>

We are currently using a **greedy decoding** strategy, i.e. the most **probable word** is chosen at each time step. But this doesn't necessarily produce the best translations. This is because we are looking for the **highest probability sentences**, and for example perhaps the second most probable word at the first time step would have led to a better scoring sentence overall because later words would have been a better match. Greedy search only explores a **narrow** set of all the possible combinations.

<br>
<center>
<img src="https://github.com/JUSTSUJAY/NLP_One_Shot/blob/main/assets/9/beamsearch.png?raw=1" width=600>
</center>
<br>

**Beam search** on the other hand tracks the **top k branches** at each time step according to the ones with the **highest cumulative log probabilities**. For example, in the image above a **beam width** of k=2 is being used.

In practice, we **normalize** the log probabilities by the number of words so beam search doesn't have a bias for shorter translations. The formula then is

<br>

$$
\Large
\text{score}(w_1,...,w_T) = \frac{1}{T} \sum_{i=1}^{T} \log \mathbb{P}(w_i | w_1, ..., w_{i-1}, x)
$$

<br>

This allows us to explore **more combinations** and return better translations than greedy search. But this comes at the cost of being more computationally expensive. The larger k is, the more combinations we consider but the longer it takes at **inference time**, so there is a **trade-off**.

## <p style="font-family:JetBrains Mono; font-weight:normal; letter-spacing: 1px; color:#207d06; font-size:100%; text-align:left;padding: 0px; border-bottom: 3px solid #207d06;">2.5 Evaluation with BLEU</p>

BLEU stands for **Bilingual Evaluation Understudy** and is a way to evaluate how good a translation is. Interestingly, it comes from a [paper](https://aclanthology.org/P02-1040.pdf) from 2002, way before computers became good at translation.  

In essence, it measures **similarity** by counting **n-gram overlap** between the prediction and a reference translation. The formula is given by:  

$$
\text{BLEU} = \min\left(1, \exp\left(1 - \frac{\text{reference\_len}}{\text{candidate\_len}}\right)\right) \left(\prod_{i=1}^{4} \text{precision}_i \right)^{\frac{1}{4}}
$$

where

$$
\text{precision}_i = \frac{\sum_{C \in \text{Candidates}} \sum_{n\text{-}gram \in C} \min\left(n^{i}_{\text{cand\_match}}, n^{i}_{\text{ref}}\right)}{\sum_{C \in \text{Candidates}} \sum_{n\text{-}gram \in C} n^{i}_{\text{cand}}}
$$




The first term in the BLEU formula is a **brevity penalty**, which penalizes candidate translations that are short relative to the reference translation. The second term measures the **precision** of 1-gram, 2-grams, 3-grams and 4-grams by counting how many times they **overlaps** in the candidate and reference translations. The min is there to prevent predictions filled with the same word from getting high scores.

Whilst BLEU is a very useful metric and still widely popular today, it does have a significant drawback. This is that it doesn't take **word meaning** into account. So if two sentences have similar meaning but use different words then BLEU won't recognize this. There are other metrics out there like ROUGE but evaluating translation is still an **active area of research**.

## <p style="font-family:JetBrains Mono; font-weight:normal; letter-spacing: 1px; color:#207d06; font-size:100%; text-align:left;padding: 0px; border-bottom: 3px solid #207d06;">2.6 Bottleneck</p>

When sequence-to-sequence models were first being experimented with, they were found to perform impressively for short sequences but really struggle to translate **long sentences**. This is because the model in its current form has a significant **information bottleneck** between the encoder and decoder.

<br>
<center>
<img src="https://github.com/JUSTSUJAY/NLP_One_Shot/blob/main/assets/9/bottleneck.png?raw=1" width=500>
</center>
<br>

All the information from the input sequence has to be **compressed** into a **single fixed length vector**, which is then used to produce a translation via the decoder. As the input sequence gets longer, there is **less space** to carry all of the complex language information that is needed to produce an accurate translation. This is where the **attention mechanism** comes in.

# 3. Attention

## <p style="font-family:JetBrains Mono; font-weight:normal; letter-spacing: 1px; color:#207d06; font-size:100%; text-align:left;padding: 0px; border-bottom: 3px solid #207d06;">3.1 Attention mechanism</p>

Attention is a powerful approach used to deal with the information bottleneck problem. The idea is instead of throwing away all the **hidden states** in the encoder, we let the decoder use these to figure out how **important** each input word is at each time step.

<br>
<center>
<img src="https://github.com/JUSTSUJAY/NLP_One_Shot/blob/main/assets/9/attention.jpg?raw=1" width=450>
</center>
<br>

Let's denote the **hidden states** of the encoder by $h_1, h_2, ...$ and the hidden states of the decoder by $s_1, s_2, ...$. We'll also let $s_0$ be the hidden state that gets passed between the **encoder and decoder**.

Attention works by computing some **scoring function** (aka alignment model) between the encoder and decoder hidden states. We'll denote this by $f(h_i, s_j)$. Intuitively, this measures how much **attention** the model show pay to the input word indexed by $i$ at the $j$-th time step in the decoder. These scores are then passed through a softmax to produce a **distribution of attention weights** $\alpha_{1,j}, \alpha_{2,j},...$.

The attention weights are used to calcualte a **context vector** via a weighted sum of the weights with the hidden states. That is,

<br>

$$
\Large
c_j = \sum_{i} \alpha_{i,j} h_{i}
$$

<br>

At the $j$-th output time step, we **concatenate** the $j$-th context vector with the **embedding** of the $j$-th input token to the decoder. Recall that this decoder input will be determined by the previous time step. The concatenated vector is then fed through the decoder **without any other changes**.

## <p style="font-family:JetBrains Mono; font-weight:normal; letter-spacing: 1px; color:#207d06; font-size:100%; text-align:left;padding: 0px; border-bottom: 3px solid #207d06;">3.2 Additive Attention</p>

The only thing we didn't talk about was what should the **scoring function** be to measure attention? In the original attention paper, **Additive (aka Bahdanau) Attention** was proposed.

The function is calcualted by multiplying the input vectors with matrices and **adding** these together. This is passed through a tanh activation function and then reduced to a scalar. That is,

<br>

$$
\Large
f(h_i, s_j) = v^{T}\tanh(W h_i+U s_j)
$$

<br>

This is simply a **feed-forward neural network**, which gets trained via backpropagation. So by adding more **learnable parameters** we allow the model to learn which hidden encoder states are more important given the current decoder hidden state.

## <p style="font-family:JetBrains Mono; font-weight:normal; letter-spacing: 1px; color:#207d06; font-size:100%; text-align:left;padding: 0px; border-bottom: 3px solid #207d06;">3.3 Multiplicative Attention</p>

An alternative to the additive attention scoring function is **Multiplicative (aka Luong) Attention**. This is simply **multiplies** the input vectors together via the dot product.

<br>

$$
\Large
f(h_i, s_j) = s_j^{T} h_i
$$

<br>

Some implementations use an **additional matrix** of learnable weights in the multiplication.

<br>

$$
\Large
f(h_i, s_j) = s_j^{T} W h_i
$$

<br>

This is usually prefered to additive attention because it is much **faster** to implement and reaches a **similar performance**.

Note that the **practical implementations** of additive and multiplicative attention are slightly different. In the former, the context vector is concatenated before being fed through the decoder RNN cells, whereas in the latter it is concatenated after being fed through the RNN cells. There may also be an **additional feed-forward neural network layer** on these outputs in multiplicative attention.

## <p style="font-family:JetBrains Mono; font-weight:normal; letter-spacing: 1px; color:#207d06; font-size:100%; text-align:left;padding: 0px; border-bottom: 3px solid #207d06;">3.4 Local Attention</p>

What we've seen so far are examples of **Global Attention**, where attention weights are calculated for all the encoder hidden states. This may become computationally expensive for long sequences so an alternative is **Local Attention** that only computes attention weights on a **neighbourhood** of the encoder hidden state corresponding to the time step of the decoder output. The **window/neigbourhood size and location** can be treated as additional **hyperparameters** for the model.

We won't implement attention here because the truth is that the NLP world has **moved on** from simple attention. Nowadays the state-of-the-art models are called **transformers** and these rely on a related concept called **self-attention**. We'll explore this in detail in the next notebook.

# 4. Conclusion

In this notebook we explored a particular RNN architecture, namely **sequence-to-sequence** and how it can be used for **machine translation**. We also saw how the **information bottleneck** between the encoder and decoder can be mitigated via the **attention** mechanism, which enables the model to focus on the important words in the input sequence.

Whilst sequence-to-sequence models and attention were **revolutionary** when they were introduced around 2014, they have already become outdated and are rarely used nowadays. The current state-of-the-art models are called **transformers** and these are built on a concept called **self-attention** introduced in 2017. We will cover transformers in the **next and final notebook of this series** and see the truly incredible things they can do.

**References:**
* [NLP demystified](https://www.nlpdemystified.org/)

### Coming UP
#### [10. Transformers](./10_Transformers.ipynb)

Thanks for reading!