## 1. Human Translation
**How can you translate the following sentence (taken from Europal dataset) into Hindi?**

"It has been of such depth that we can imagine how, for example, Ireland has made such extremely difficult adjustments, not because the International Money Fund says so, or because it has been imposed by anyone from Brussels, but because the Irish authorities consider it to be the best way to adjust its economy as soon as possible and move forward with the same impetus that it had before the crisis."

**Answer 1**:
1. Get an overall idea of what it is trying to convey by reading the sentence
2. Start writing the translation without going back to the source sentence

**Answer 2**:
1. Get an overall idea of what it is trying to convey by reading the sentence
2. Start writing the translation and while doing so, **revisit the different part of the source sentence** to ensure:
    1. Our translation covers the entire sentence
    2. Our translation ensures it is in equivalent tense
    
**Tip**: Try to mimic the same process in machine translation

## 2. Machine Translation (Neural)

### 2.1 Input, Output, & Problem Category
**Input**: English sentence

**Output**: Hindi sentence

**Problem Category**: Sequence-sequence transformation

### 2.2 Architecture 1: Without Attention

The following figure shows architecture for neural machine translation:

<figure>
  <br>
  <img src="./assets/images/enc_dec.jpeg" style="width:80%;">
  <img src="./assets/images/enc_dec_unrolled.jpeg" style="width:80%">
  <figcaption style="text-align:center">Architecture for neural machine translation</figcaption>
</figure>


### 2.2.1 Architecture 1 Internals
It follows encoder-decoder paradigm.

**Encoder**
It consists of following layers:
1. Embedding layer
2. Recurrent layer

*Input to recurrent layer*:
1. Embedding for the ith word at ith timestamp

**Decoder**
It consists of following layers:
1. Embedding layer
2. Recurrent layer
3. Softmax layer

*Input to recurrent layer*:
1. Hidden state from the last timestep of the encoder -> **input** at every timestep of the decoder
2. Embedding for the produced word from the last timestep of the decoder -> **input** at current timestep of the decoder
    1.  At timestep zero, there is no produced word: So we will use <START> embedding
3. The above two inputs at every timestep of the decoder are concatenated and then fed
    
### 2.2.2 Architecture 1 Downsides
    1. Does not work well for long sentences since it is encoding the long sentence in a fixed-size vector

## 2.3 Architecture 2: With Attention

The following figure shows architecture for neural machine translation:

<figure>
  <br>
  <img src="./assets/images/enc_dec_with_attn.jpeg" style="width:80%;">
  <img src="./assets/images/enc_dec_with_attn_unrolled.jpeg" style="width:80%">
  <figcaption style="text-align:center">Architecture for neural machine translation</figcaption>
</figure>


### 2.3.1 Architecture 1 Internals
It follows encoder-decoder paradigm.

**Encoder**
It consists of following layers:
1. Embedding layer
2. Recurrent layer

*Input to recurrent layer*:
1. Embedding for the ith word at ith timestamp

*Change*: 
1. Store the hidden state at every timestep (we will make use of it during decoding) 

**Decoder**
It consists of following layers:
1. Embedding layer
2. Recurrent layer
3. Softmax layer

*Input to recurrent layer*:
1. Hidden state from the last timestep of the encoder -> **input** at every timestep of the decoder (**Instead of hidden state from the last timestep of the encoder we will feed the resulting vector of the encoder which we will discuss**)
2. Embedding for the produced word from the last timestep of the decoder -> **input** at current timestep of the decoder
    1.  At timestep zero, there is no produced word: So we will use <START> embedding
3. The above two inputs at every timestep of the decoder are concatenated and then fed
    
*Change*
1. Compute the aligment score for each source hidden state vector at each timestep
    1. Significance of alignment score: How much attention to pay to that source hidden state vector at a each timestep
    2. How we will calculate it: We will discuss it
2. Normalize the aligment scores at each timestep 
3. Calculate weighted sum at each timestep
    1. Multiply each source hidden state vector by its alignment score
    2. Add the resulting vectors
4. The resulting vector is fed to decoder at each timestep
    1. This vector is different at each timestep (t) as alignment scores are controlled by the internal state of the decoder at timestep (t-1)

<figure>
  <br>
  <img src="./assets/images/align_score.jpeg" style="width:80%">
  <figcaption style="text-align:center">How encoder internal state is combined with alignment vectors to create decoder input for each timestep</figcaption>
</figure>
    
**Tip**:
1. When we try to generate *I* at first timestep, the focus will be on *je*
2. When we try to generate *am* at second timestep, the focus will be on *suis*
3. When we try to generate *a* at the third timestep, the focus will on *suis* and *etudiant*
4. When we try to generate *student* at fourth timestep, the focus will be on *etudiant*
    
### 2.3.2 Computing the alignment scores
1. A vector of alignment scores is called alignment vector
    1. An alignment vector consists of T<sub>e</sub> elements (Where T<sub>e</sub> is number of timesteps for the encoder)
    2. We need to calculate T<sub>d</sub> such vectors (Where T<sub>d</sub> is number of timesteps for the decoder)
2. Some decisions which could be used to calculate the alignment scores:
    1. What **input values** to use to compute the vector: decoder hidden state, source hidden state (**very strange as we are using source hidden states to find the alignment scores for source hidden states**)
    2. What **computations** to apply to these input values?

**Mathematical Formulations**:
    
*Query*: target hidden state / decoder hidden state
    
*Key*: source hidden state / encoder hidden state
    
*Value*: source hidden state (it can be different from key in other problems)
    
*Problem*: Find a function which matches the query to the key (we can assume this function to be a feed-forward neural network and let the model learn the function itself)

<figure>
  <br>
  <img src="./assets/images/ffnn_for_align_score.jpeg" style="width:80%">
  <figcaption style="text-align:center">Function as FFNN</figcaption>
</figure>
    
**Left side FFNN**:
1. Structure:
    1. Arbitrary no. of fully connected layers
    2. 1 fully connected softmax layer
2. Drawbacks:
    1. Hardcodes expected position of words in source sentence
    2. Restrictions on source sentence length
    
**Right side FFNN**:
1. Structure:
    1. 1 fully connected layers (with tanh)
    2. 1 fully connected layer (without activation function)
    3. 1 softmax layer
2. Multiple instances of this structure with weight sharing
3. Equations for the FFNN is shown below:

<figure>
  <br>
  <img src="./assets/images/score_eq1.jpeg" style="width:80%">
  <img src="./assets/images/score_eq2.jpeg" style="width:80%">
  <img src="./assets/images/score_eq3.jpeg" style="width:80%">
  <figcaption style="text-align:center">Equations for FFNN</figcaption>
</figure>
    
4. **Tip**: There can be simpler scoring functions as shown below.
    1. Dot product version combined with the softmax function represents right side FFNN but with the modifications that there is no fully connected layer before the softmax layer and the neurons in the softmax layer use the target hidden state vector as neuron weights.
    2. General version combined with the softmax function represents right side FFNN but with modifications that first layer have linear activation function along with input as source hidden state only and the neurons in the softmax layer use the target hidden state vector as neuron weights.
<figure>
  <br>
  <img src="./assets/images/other_scoring.jpeg" style="width:80%">
  <figcaption style="text-align:center">Other scoring functions</figcaption>
</figure>
 
### 2.3.3 Soft vs Hard Attention
**Hard Attention**: Single encoder hidden state is selected to focus on each decoder timestep

**Soft Attention**: 
1. A mixture (weighted sum) of encoder hidden states is used on each decoder timestep
2. Why soft attention: function is continuous and differentiable thus enable the use of backpropagation
    
    
### 2.3.4 Architecture 1 Downsides
    1. Does not work well for long sentences as we now need more space to store encoder hidden state vectors

## 2.4 Architecture 3: Transformer

**Why transformers**:
1. RNN (with or without attention) does not work well for longer sentences
2. RNN is inherently serial (transformer mainly focuses on this problem solution)

### 2.4.1 Self Attention
**Decoder Attention**: Focus on different parts of **source hidden states** by a target hidden state

**Self Anntention**: Focus on differents parts of **preceding layer outputs** by a current word position

<figure>
  <br>
  <img src="./assets/images/self_attn.jpeg" style="width:80%">
  <figcaption style="text-align:center">Embedding layer -> Self-attention layer -> Fully connected layer</figcaption>
</figure>

*Query*: Previous layer current word

*Key*: Previous layer all word

*Value*: Previous layer all word

<figure>
  <br>
  <img src="./assets/images/self_attn_eq.jpeg" style="width:80%">
  <figcaption style="text-align:center">Attention calculation (attention score, normalization & weighted sum)</figcaption>
</figure>

**Tip**: Attention mechanism pass the above inputs through a 3 seperate single layer FFNN with linear activation functions to obtain query, key, and value.

Why: 
1. Different widths of query, key, value than original input vector. Thus different width of output vector (Thats the requirment of seq-seq transformation)
2. Key is different than the value (More generalization of attention)

<figure>
  <br>
  <img src="./assets/images/self_attn_proj.jpeg" style="width:80%">
  <figcaption style="text-align:center">Attention mechanism with projection layers that modify the dimensions of the quer, key, and value</figcaption>
</figure>

### 2.4.2 Multi Head Attention
1. Multiple attention mechanism operates in parallel for each input vector
2. The outputs of these multiple heads are concatenated and feed to the projection layer (so that we get the desired dimension of the output vector)

<figure>
  <br>
  <img src="./assets/images/multi_head_attn.jpeg" style="width:80%">
  <figcaption style="text-align:center">Multi-head attention mechanism</figcaption>
</figure>

### 2.4.3 Transformer
Transformer follows encoder-deccoder paradigm

<figure>
  <br>
  <img src="./assets/images/transformer.jpeg" style="width:80%">
  <figcaption style="text-align:center">Transformer Modules</figcaption>
</figure>

#### 2.4.3.1 Transformer Internals

**Encoder**:
1. Embedding layer
2. 6 identical modules where each module consists of:
    1. Multi-head self-attention
    2. Skip connection & layer normalization
    3. FFNN (2 layers)
    4. Skip connection & layer normalization
    
**Decoder (Autoregressive)**:
1. Multi-head self-attention (with masking to prevent attending to future words)
2. Skip connection & layer normalization
3. Multi-head traditional attention
4. Skip connection & layer normalization
5. FFNN (2 layers)
6. Skip connection & layer normalization

The full transformer architecture is shown below:

<figure>
  <br>
  <img src="./assets/images/full_transformer.jpeg" style="width:80%">
  <figcaption style="text-align:center">Full transformer architecture</figcaption>
</figure>

**Positional Encoding**:

1. How to take word order into consideration: positional encoding
2. What is the dimesnion of positional encoding: same as input embedding

<figure>
  <br>
  <img src="./assets/images/pos_enc.jpeg" style="width:80%">
  <figcaption style="text-align:center">Positional encoding</figcaption>
</figure>

3. How to calculate the positional embedding
    1. sine & cosine functions of different frequencies
    2. Learn positional encoding

### 2.4.4 How parallelism is achieved
1. There is no dependency within a single layer
2. Each instance of embedding, query, key, value, self-attention, fully connected layers share weights (in a particular head)
3. Heads are independent