# Language Translation with Word Level seq2seq DL Models
The objective is to convert a German sentence (sequence of words) to English using a Neural Machine Translation (NMT) system based on word level encoder-decoder models.

We will use __Spanish-English__ sentence pairs data from http://www.manythings.org/anki/

Sequence-to-Sequence (seq2seq) models are used for a variety of NLP tasks, such as text summarization, speech recognition, language translation, text-to-speech, speech-to-text among others. Our aim is to translate german to english sentences.

Here, both, the input and output are sentences. In other words, these sentences are a sequence of words going in and out of our model.

<img src="https://i.imgur.com/Uk1tCPo.png">

A typical seq2seq model is also known as an encoder-decoder model and has 2 major components:

+ The encoder
+ The decoder

Both these parts are essentially two different sequential models like RNNs\LSTMs which are combined together.

<img src="https://i.imgur.com/bT6PAtv.png">

## Download the Dataset

In [None]:
!wget http://www.manythings.org/anki/spa-eng.zip
!unzip spa-eng.zip
!ls -l

## Load the Dataset

In [None]:
import numpy as np

FILE_NAME = './spa.txt'
with open(FILE_NAME, mode='rt', encoding='utf-8') as infile:
  data = infile.read()
  sentences = data.strip().split('\n')
  sentences = [item.split('\t') for item in sentences]
  eng_spa = np.array(sentences)

eng_spa = eng_spa[:50000]
eng_spa

In [None]:
spa = eng_spa[:,1]
eng = eng_spa[:,0]
eng, spa

## Basic Preprocessing

In [None]:
import string

# Remove punctuation and lowercase
eng = np.array([s.translate(str.maketrans('', '', string.punctuation)).lower() for s in eng])
spa = np.array([s.translate(str.maketrans('', '', string.punctuation)).lower() for s in spa])
eng, spa

## Analyze Sentences

### **Question 1**: Get maximum sentence length for both __English__ and __Spanish__ (2 points)


In [None]:
# Add your code here
eng_len = <YOUR CODE HERE>
spa_len = <YOUR CODE HERE>

In [None]:
import pandas as pd
%matplotlib inline

length_df = pd.DataFrame({'eng':eng_len, 'spa':spa_len})
length_df.hist(bins=30)

## Transform Text to Sequence of Numbers

### **Question 2**: Use **``tensorflow.keras.tokenizer``** to tokenize and prepare a text to number mapping (2 points)


In [None]:
from tensorflow import keras

# Add your code here
eng_tokenizer = <YOUR CODE HERE>
eng_tokenizer.<YOUR CODE HERE>
eng_vocab_size = <YOUR CODE HERE>
eng_length = <YOUR CODE HERE>

In [None]:
print('Max eng text length:', eng_length)
print('English Vocabulary Size:', eng_vocab_size)

In [None]:
spa_tokenizer = keras.preprocessing.text.Tokenizer()
spa_tokenizer.fit_on_texts(spa)
spa_vocab_size = len(spa_tokenizer.word_index) + 1
spa_length = max(spa_len)

In [None]:
print('Max deu text length:', spa_length)
print('Spanish Vocabulary Size:', spa_vocab_size)

## Prepare Train Test Splits

In [None]:
from sklearn.model_selection import train_test_split

train_spa, test_spa, train_eng, test_eng = train_test_split(spa, eng, test_size=0.2, random_state = 42)
train_spa.shape, test_spa.shape, train_eng.shape, test_eng.shape

## Normalize Sequence Lengths
We normalize sentence lengths by defining the maximum length. Larger sentences/sequences get truncated while smaller ones are padded.

### **Question 3**: Normalize sentences by defining the max length. **Pad** shorter ones and truncate the longer ones. (2 points)

_Hint: use ``pad_sequences``_


In [None]:
# Add your code here
<YOUR CODE HERE>

train_spa_enc.shape, train_eng_enc.shape

In [None]:
# Add your code here
<YOUR CODE HERE>

test_spa_enc.shape, test_eng_enc.shape

## Build the Model

<img src="https://i.imgur.com/3ZVi97s.png">

+ For the encoder, we will use an embedding layer and an LSTM layer
+ For the decoder, we will use another LSTM layer followed by a dense layer
+ Repeat Vector helps pass the output sequence from encoder to all LSTM cells in the decoder

We leverage the full power of GPUs by using the CUDA variant of the LSTM models.

### **Question 4**: Define a **Seq2Seq** model with encoder and decoder components (2 points)

In [None]:
EMBEDDING_DIM = 512
LSTM_UNITS = 512

# Add your code here
<YOUR CODE HERE>

## Train the Model

In [None]:
history = model.fit(train_spa_enc, train_eng_enc.reshape(train_eng_enc.shape[0], train_eng_enc.shape[1], 1), 
          epochs=50, batch_size=256, 
          validation_split = 0.1,
          verbose=1)

In [None]:
model.save('my_nmt_model.h5')

## Visualize Training Progress

In [None]:
import matplotlib.pyplot as plt

plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.legend(['train','validation'])

## Text Generator
We build a utility function to generate text sequences based on the output prediction sequences

In [None]:
from tqdm import tqdm
pd.set_option('display.max_colwidth', 500)

eng_idx2word = {v:k for k, v in eng_tokenizer.word_index.items()}

def generate_prediction_texts(pred_seqs, idx2word_map):
  pred_texts = []

  for pred in tqdm(preds):
    temp = []
    for idx in range(len(pred)):
      w = eng_idx2word.get(pred[idx], None)
      if idx > 0:
        if (w == eng_idx2word.get(pred[idx-1], None)) or (w == None):
            temp.append('')
        else:
            temp.append(w)
          
      else:
        if(w == None):
            temp.append('')
        else:
            temp.append(w)            
        
    pred_texts.append(' '.join(temp))
  return pred_texts

## Translation using seed from Training Data

In [None]:
train_spa_enc[:1000,:].shape

### **Question 5**: **Evaluate** model predictions using training samples only (3 points)


In [None]:
# Add your code here
<YOUR CODE HERE>

## Translation on Test Data

### **Question 6**: **Evaluate** model performance on test data (3 points)


In [None]:
# Add your code here
<YOUR CODE HERE>