# Shastra 2020

## Translation task (demo)

This is Neural machine translation model to convert **French** to **English** text. This is demo model for the end product model, which will be able to convert the any language to multiple languages through advanced version of the Machine translation models.

The architecture of the model is a simple encoder and decoder model:

Encoder:
 1. Embedding Layer for French language
 2. Two LSTM layers <br><br>
For demo, purpose the model is simple but in actual implementation the there will be more number of layers with some optimization techniques will be used.
It will take French sentences as input and will provide the hidden state at last timestep and outputs at all the timesteps to the decoder to produce the output language with attention.  Here, later on Bidirectional LSTM will be used for better encoding.
 
Decoder:
 1. Embedding Layer for English language
 2. Two LSTM layers 
 3. Attention layer
 4. Fully connected
 5. Softmax layer <br><br>
This takes the last hidden state of the encoder and all the outputs of all the timesteps of the encoder and by using [Attention mechanism](https://arxiv.org/abs/1508.04025) with the first input word given it will generate the next word and will use that with attention mechanism to generate succeeding words and hence the sentence will be translated.In this architecture, we also use **Teacher forcing** concept in which based on some random number between \[0, 1) and a teacher forcing ratio we select whether to input the last output of the deoder as the next input or to input the target output as the next input to the decoder, as it is shown in some research papers that this improves training.

This model is just for testing purpose, to check whether the architecture can work or not. In final model will also contain some other optimization techniques for better convergence and accuracy: (click on any technique to see it's resource)
1. [BatchNormalization](https://arxiv.org/abs/1502.03167)
2. [Variational DropOut instead of normal DropOut for LSTM layers](https://becominghuman.ai/learning-note-dropout-in-recurrent-networks-part-1-57a9c19a2307)
3. [DropConnect for LSTM layers (Variational in nature)](https://arxiv.org/abs/1801.06146)
4. [Differential learning rates for different layers](https://arxiv.org/abs/1801.06146)
5. [AWD-LSTMs (ASGD Weight-Dropped LSTMs) (tentative)](https://arxiv.org/abs/1708.02182)

**NOTE**: For preparing the dataset first use the **text_to_csv.py** file to convert all the data to csv after extracing the data downloaded from **[here](http://www.statmt.org/europarl/v7/fr-en.tgz)**. After extracting the path variables in the **text_to_csv.py** file have to be changed accordingly. Other datasets can be downloaded from [here](www.statmt.org/europarl/).

In [None]:
# install all the required packages with this 
!pip install -r requirements.txt

In [1]:
import numpy as np
import torch
import torch.nn as nn       # neural Networks module of pytorch for extending
import torch.optim as optim     # Optimizers
import torch.nn.functional as F
from torchsummary import summary
import spacy    # for French and English tokenization
from torchtext.data import Field, BucketIterator, TabularDataset    # For preprocessing and making batches
import dill      # for saving field of the datasets
import random    # for teaching force in the decoder part
import matplotlib.pyplot as plt   # for visualizing results

In [None]:
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
BATCH_SIZE = 8
EPOCHS = 5
TEACHER_FORCE_RATIO = 0.5
GRADIENT_CLIP = 0.25
MAX_LEN = 100

In [None]:
# Loading SpaCy tokenizers for French and English
fr_tokenize = spacy.load('fr', disable=['tagger', 'ner', 'parser'])
en_tokenize = spacy.load('en', disable=['tagger', 'ner', 'parser'])
de_tokenize = spacy.load('de', disable=['tagger', 'ner', 'parser'])

fr_field = Field(
    tokenize='spacy', 
    tokenizer_language='fr', 
    pad_first=True,
    lower=True
)
en_field = Field(
    tokenize='spacy', 
    tokenizer_language='en', 
    lower=True, 
    init_token='<sos>', 
    eos_token='<eos>'
)
de_field = Field(
    tokenize='spacy', 
    tokenizer_language='de', 
    lower=True, 
    init_token='<sos>', 
    eos_token='<eos>'
)

Cell below can be used to load already saved fields when the fields have been saved by the saving code given below 

In [None]:
# with open("fr_field.Field", "rb") as f:
#     fr_field = dill.load(f)

# with open("en_field.Field", "rb") as f:
#     en_field = dill.load(f)

Prepare a torchtext Datasets for **train** and **validation** set for translation using [**TabularDataset**](https://torchtext.readthedocs.io/en/latest/data.html#tabulardataset) class. (this might take some time to get loaded). Set the path to the directory in which the csv files are stored in **path** arguement and set the name of the train and val sets in the **train** and **validation** arguments before running the code 

In [None]:
train_set, val_set = TabularDataset.splits(
    path='../data/csv_format/', 
    train='train_set.csv', 
    validation='val_set.csv',
    format='CSV', 
    fields=[('French', fr_field), ('English', en_field)]
)

Build Vocabularies for the French and English using Fields from the above dataset, with having each word in vocabularies atleast 5 occurances in the whole train and validation datasets

In [None]:
# Extracting vocabularies for French and English
fr_field.build_vocab(train_set, val_set, min_freq=5)
en_field.build_vocab(train_set, val_set, min_freq=5)

print("Example from French vocabulary:\n", list(fr_field.vocab.freqs.keys())[1:100])
print("Examples from English vocabulary:\n", list(en_field.vocab.freqs.keys())[1:100])

In [None]:
# Sequence bucketing based on size of English sentences
train_iterator, val_iterator = BucketIterator.splits(
    (train_set, val_set), 
    batch_size=BATCH_SIZE, 
    sort_key=lambda x: len(x.English), 
    shuffle=True,
)

batch = next(iter(train_iterator))
print(batch.French)
print(batch.English)
print(len(train_iterator))

In [2]:
# Encoder for the input language
class Encoder(nn.Module):

    def __init__(self, vocab_sz, embedding_sz, output_sz, batch_sz, max_len, num_lstm_layers=2):
        super().__init__()
        self.batch_sz = batch_sz
        self.max_len = max_len
        self.hidden_sz = output_sz
        self.embedding_layer = nn.Embedding(vocab_sz, embedding_sz)
        self.lstm = nn.LSTM(embedding_sz, output_sz, num_lstm_layers)

    def forward(self, batch):
        embeddings = self.embedding_layer(batch)
        outputs, hidden = self.lstm(embeddings)
        if batch.shape[0]-self.max_len < 0: 
            return (torch.cat(
                (torch.zeros(self.max_len-batch.shape[0], *outputs.shape[1:]).to(DEVICE), outputs), dim=0), 
                hidden)
        else: return outputs[batch.shape[0]-self.max_len:], hidden

In [3]:
# Decoder with attention
class AttentionDecoder(nn.Module):
    def __init__(self, vocab_sz, embedding_sz, hidden_sz, max_length, num_lstm_layers=2):
        super().__init__()
        self.embedding_layer = nn.Embedding(vocab_sz, embedding_sz)
        self.attention_layer = nn.Linear(hidden_sz*num_lstm_layers+embedding_sz, max_length)
        self.lstm = nn.LSTM(hidden_sz, hidden_sz, num_lstm_layers)
        self.linear = nn.Linear(hidden_sz, vocab_sz)
        self.bn = nn.BatchNorm1d(vocab_sz)
        self.d = nn.Dropout(p=0.4)

    def forward(self, x, hidden, encoder_outputs):
        embeddings = self.embedding_layer(x)
        attention_weights = F.softmax(
            self.attention_layer(
                torch.cat((embeddings, torch.cat(hidden[0].split(1, dim=0), dim=-1).squeeze()), 1)), 
            1
        )
        attention_output = torch.bmm(attention_weights.unsqueeze(1), encoder_outputs.permute(1, 0, 2))
        output, hidden = self.lstm(attention_output.permute(1, 0, 2))
        output = self.linear(output[0])
        output = self.bn(output)
        return self.d(output), hidden

In [None]:
class Seq2seq(nn.Module):
    def __init__(self, encoder, decoder):
        super().__init__()
        self.encoder = encoder
        self.decoder = decoder
        
    def forward(self, batch):
        X = batch.French
        Y = batch.English

        if X.shape[1] != BATCH_SIZE : continue

        # Forward pass in Encoder part of model
        encoder_outputs, hidden = self.encoder(X.to(DEVICE))

        # Initializing losses for batchwise
        loss = 0.0

        # First input to the decoder ,i.e., "<sos>" token
        dec_input = (Y[0, :]).to(DEVICE)

        # Passing the o utput of the decoder to itself as next input or 
        # target output sometimes bassed on Teacher forcing
        for target in Y[1:, :]:
            output, hidden = self.decoder(dec_input, hidden, encoder_outputs)
            loss += loss_criterion(output, target.to(DEVICE))
            dec_input = output.max(dim=1)[1]
            teacher_force = random.random() < TEACHER_FORCE_RATIO
            dec_input = target.to(DEVICE) if teacher_force else dec_input
        
        return loss

In [None]:
def apply_diff_lrs(model, opt_type=nn.SGD, starting_lr=0.01, dividing_fact=2.6):
    i = 0
    for module in model.named_parameters():
        trainable_modules = []
        optimizers = []
        if type(module) not in [Seq2seq, Encoder, AttentionDecoder, Dropout]:
            trainable_modules.append(module)
            if type(module) == BatchNorm1d:
                optimizers[-1] = optim.SGD(list(trainable_modules[-2].parameters())+list(module.parameters()), lr=starting_lr/(2.6**i))
                continue
            optimizers.append(optim.SGD(module.parameters(), lr=starting_lr/(2.6**i)))
            i += 1
            
    return optimizers

In [None]:
# Initialize encoder and decoder for the model
encoder = Encoder(len(fr_field.vocab.freqs), 256, 512, BATCH_SIZE, 70).to(DEVICE)
decoder = AttentionDecoder(len(en_field.vocab.freqs), 256, 512, 70).to(DEVICE)

pad_idx = fr_field.vocab.stoi['<pad>']
loss_criterion = nn.CrossEntropyLoss(ignore_index=pad_idx)
optimizers = apply_diff_lrs(model, nn.SGD, 0.1)

In [5]:
encoder = Encoder(10000, 256, 512, 8, 70)
decoder = AttentionDecoder(10000, 256, 512, 70)

Next cell can be used if in case you have saved any models after training with the saving code given below, this code can be used to load it back

In [None]:
# encoder.load_state_dict(torch.load("enc_model.pt"))
# decoder.load_state_dict(torch.load("dec_model.pt"))

# Training

First number of **batch_sz** sentences will be entered to the **encoder** and the ouptut will be given as initial hidden state for the **decoder**

In [None]:
# Train function
def train(model, train_iterator=train_iterator, optimizer=optimizer, EPOCHS=EPOCHS, DEVICE=DEVICE):
    epoch_losses = []
    for epoch in range(EPOCHS):
        iteration_losses = []
        for iter_n, batch in enumerate(train_iterator):
            
            # Setting all the gradients to zero
            for optimizer in optimizers:
                optimizer.zero_grad()
            
            # Forward pass from the sequence to sequence model
            loss = model(batch)

            # Backpropagation step
            loss.backward()

            # Gradient clipping
            torch.nn.utils.clip_grad_norm_(model.parameters(), GRADIENT_CLIP)

            # Updation for the weights
            for optimizer in optimizers:
                optimizer.step()

            # Appending average loss in one iteration for each output token
            iteration_losses.append(loss.item()/Y.shape[0])
            if (iter_n+1) % 1000 == 0:
                print("Iterations completed", iter_n+1, "Loss:", loss.item())

        epoch_losses.append(sum(iteration_losses)/len(iteration_losses))
        print("Epoch: ", epoch+1, "Average Loss:", sum(iteration_losses)/len(iteration_losses))
        plt.plot(range(len(iteration_losses)), iteration_losses)
        plt.xlabel("Iterations")
        plt.ylabel("Loss")
        plt.show()

    plt.plot(range(EPOCHS), epoch_losses)
    plt.show()

In [None]:
plt.plot(range(len(iteration_losses)), iteration_losses)
plt.show()

This is the saving code (mentioned above) for saving encoder and decoder models and fields 

In [None]:
# Saving Encoder and Decoder and Fields of the torchtext
with open("enc_model.pt", "wb") as f:
    torch.save(encoder.state_dict(), f)
    
with open("dec_model.pt", "wb") as f:
    torch.save(decoder.state_dict(), f)
    
with open("fr_field.Field", "wb") as f:
    dill.dump(fr_field, f)

with open("en_field.Field", "wb") as f:
    dill.dump(en_field, f)

# Model Evaluation
The model can be evaluted on **val_iterator**

In [None]:
# evaluating model
def evaluate_model(model, val_iterator=val_iterator, DEVICE=DEVICE)
    model.eval()

    epoch_loss = 0

    with torch.no_grad():

        for _, batch in enumerate(val_iterator):
            
            X = batch.French
            Y = batch.English

            encoder_outputs, hidden = model.encoder(X.to(DEVICE))
            dec_input = Y[0,:].to(DEVICE)
            loss= 0.0
            for target in Y[1:, :]:
                output, hidden = model.decoder(dec_input, hidden, encoder_outputs)
                loss += loss_criterion(output, target.to(DEVICE))
                dec_input = output.max(dim=1)[1]
                
            epoch_loss += loss.item()

    print(epoch_loss / len(val_iterator))

# To get the predicted words 
def predict(model, sentence):
    model.eval()
    outputs = []
    with torch.no_grad():
        encoder_outputs, hidden = model.encoder(X.to(DEVICE))
        dec_input = Y[0,:].to(DEVICE)
        for target in Y[1:, :]:
            output, hidden = model.decoder(dec_input, hidden, encoder_outputs)
            dec_input = output.max(dim=1)[1]
            outputs.append(en_field.vocab.itos(dec_input))
        return outputs