<a href="https://colab.research.google.com/github/vikramkrishnan9885/MyColab/blob/master/MoreNMT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Architecture

In other words, NMT can be seen as an encoder-decoder architecture. The encoder converts a sentence from a given source language to a thought, and the decoder decodes or translates the thought to a target language. 

## Embedding layer

Benefit of using word embedding instead of one-hot-encoded representations of words, especially when the vocabulary is large. Here as well, we are using a two-word embedding layer, $Emb_s$ , for the source language and $Emb_T$ for the target language. So, instead of feeding $x_t$
 directly into the LSTM, we will be getting $Emb(x_t)$. However, to
avoid unnecessarily increasing the notation, we will assume $x_t=Emb(x_t)$

## The encoder
As mentioned earlier, the encoder is responsible for generating a thought vector or a context vector that represents what is meant by the source language.

## The context vector
The idea of the context vector (v) is to represent a sentence of a source language concisely. Also, in contrast to how the encoder's states are initialized (that is, they are initialized with zeros), the context vector becomes the initial state for the decoder LSTM. In other words, the decoder LSTM doesn't start with an initial state of zeros, but with the context vector as its initial state

## The decoder
The decoder is responsible for decoding the context vector into the desired translation. Our decoder is an LSTM network as well. Though it is possible for the encoder and decoder to share the same set of weights, it is usually better to use two different networks for the encoder and the decoder. This increases the number of parameters in our model, allowing us to learn the translations more effectively.


# Data prep

## Training data

### Problem statement

The training data consists of pairs of source sentences and corresponding translations
to the target language. An example might look like this:
* ( Ich ging nach Hause , I went home)
* ( Sie hat in der Schule gewartet , She was waiting at school)


We have N such pairs in our dataset. If we are to implement a fairly good translator, N needs to be in the scale of millions. An increase of training data as such, also implies prolonged training times

### Special Tokens
We will introduce two special tokens: \<s\> and \<\/s\>. The \<s\> token represents
the start of a sentence, whereas \<\/s\> represents the end of a sentence. Now, the data would look like this:
* (\<s\> Ich ging nach Hause \<\/s\> , \<s\> I went home \<\/s\>)
* (\<s\> Sie hat in der Schule gewartet \<\/s\> , \<s\> She was waiting at school \<\/s\>)

### Padding
Thereafter, we will pad the sentences with the \<\/s\> tokens such that the source
sentences are of a fixed length L and the target sentences are of a fixed length M.

It should be noted that L and M do not need to be equal. This step results in
the following:
* (\<s\> Ich ging nach Hause \<\/s\> \<\/s\> \<\/s\> , \<s\> I went home \<\/s\> \<\/s\> \<\/s\>)
* (\<s\> Sie hat in der Schule gewartet \<\/s\> , \<s\> She was waiting at school \<\/s\>)

### Truncation

If a sentence has a length greater than L or M, it is truncated to fit the length. Then the sentences are passed through a tokenizer to get the tokenized words out. Here I'm ignoring the second tuple (that is, a pair of sentences), as both are processed similarly:
(['\<s\>' , 'Ich' , 'ging' , 'nach' , 'Hause' , '\<\/s\>' , '\<\/s\>' , '\<\/s\>'] , ['\<s\>' , 'I' , 'went' ,
'home' , '\<\/s\>' , '\<\/s\>' , '\<\/s\>'])

### Batch proc
It should be noted that bringing sentences to a fixed length is not essential, as LSTMs are capable of handling dynamic sequence sizes. However, bringing them to a fixed length helps us to process sentences as batches instead of processing them one by one.

### Reverse

Next we will perform a special trick on the source sentences. Say, we have the
sentence, ABC in the source language, which we want to translate to $\alpha 
\beta \gamma \delta$ in the
target language. We will first reverse the source sentences so that the sentence, ABC would be read as CBA. This means that in order to translate ABC to $\alpha 
\beta \gamma \delta$, we need to feed in CBA. This improves the performance of our model significantly, especially when the source and target languages share the same sentence structure (for example, subject-verb-object).

## Testing data
At testing time, we only have the source sentence, but not the target sentence.
Also, we prepare our source data as we did for the training phase. Next, we get
the translated output word by word by feeding in the last predicted word by the
decoder as the next input. The prediction process is first triggered by feeding in an <s> token to the decoder first.