# LSTM Bot

## Project Overview

In this project, you will build a chatbot that can converse with you at the command line. The chatbot will use a Sequence to Sequence text generation architecture with an LSTM as it's memory unit. You will also learn to use pretrained word embeddings to improve the performance of the model. At the conclusion of the project, you will be able to show your chatbot to potential employers.

Additionally, you have the option to use pretrained word embeddings in your model. We have loaded Brown Embeddings from Gensim in the starter code below. You can compare the performance of your model with pre-trained embeddings against a model without the embeddings.



---



A sequence to sequence model (Seq2Seq) has two components:
- An Encoder consisting of an embedding layer and LSTM unit.
- A Decoder consisting of an embedding layer, LSTM unit, and linear output unit.

The Seq2Seq model works by accepting an input into the Encoder, passing the hidden state from the Encoder to the Decoder, which the Decoder uses to output a series of token predictions.

## Dependencies

- Pytorch
- Numpy
- Pandas
- NLTK
- Gzip
- Gensim


Please choose a dataset from the Torchtext website. We recommend looking at the Squad dataset first. Here is a link to the website where you can view your options:

- https://pytorch.org/text/stable/datasets.html





In [1]:
import os

In [2]:
!pip install -U pip setuptools wheel
!pip install -U spacy
!python -m spacy download en_core_web_sm
!pip install -U torchtext==0.10.0

Defaulting to user installation because normal site-packages is not writeable
[0mDefaulting to user installation because normal site-packages is not writeable
[0mDefaulting to user installation because normal site-packages is not writeable
Collecting en-core-web-sm==3.4.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.4.1/en_core_web_sm-3.4.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m54.4 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m


[0m[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
Defaulting to user installation because normal site-packages is not writeable
[0m

In [3]:
# !pip install torch 
# !pip install typing-extensions --upgrade
# !pip install -U torchtext
# !pip install spacy



In [4]:
import torch
from torchtext.utils import download_from_url
import json, os

URLS = {
    'SQuAD1':
        ['https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json',
         'https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json'],
    'SQuAD2':
        ['https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json',
         'https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json']
}


def _create_data_from_json(data_path):
    with open(data_path) as json_file:
        raw_json_data = json.load(json_file)['data']
        for layer1 in raw_json_data:
            for layer2 in layer1['paragraphs']:
                for layer3 in layer2['qas']:
                    processed = {'context': layer2['context'], 'question': layer3['question'],
                                 'answers': [item['text'] for item in layer3['answers']],
                                 'answer_start': [item['answer_start'] for item in layer3['answers']]}
                    if len(processed['answers']) == 0:
                        processed['answers'] = [""]
                        processed['answer_start'] = [-1]
                    yield processed


class RawQuestionAnswerDataset(torch.utils.data.IterableDataset):
    """Defines an abstraction for raw question answer iterable datasets.
    """

    def __init__(self, iterator):
        """Initiate text-classification dataset.
        """
        super(RawQuestionAnswerDataset, self).__init__()
        self._iterator = iterator
        self.has_setup = False
        self.start = 0
        self.num_lines = None

    def setup_iter(self, start=0, num_lines=None):
        self.start = start
        self.num_lines = num_lines
        self.has_setup = True

    def __iter__(self):
        if not self.has_setup:
            self.setup_iter()

        for i, item in enumerate(self._iterator):
            if i >= self.start:
                yield item
            if self.num_lines is not None and i == (self.start + self.num_lines):
                break


def _setup_datasets(dataset_name, root='./data'):
    extracted_files = []
    select_to_index = {'train': 0, 'dev': 1}
    extracted_files = [download_from_url(URLS[dataset_name][select_to_index[key]],
                                         root=root) for key in select_to_index.keys()]
    train_iter = _create_data_from_json(extracted_files[0])
    dev_iter = _create_data_from_json(extracted_files[1])
    return (RawQuestionAnswerDataset(train_iter),
            RawQuestionAnswerDataset(dev_iter))


def SQuAD1(*args, **kwargs):
    """ Defines SQuAD1 datasets.
    Examples:
        >>> train, dev = torchtext.experimental.datasets.raw.SQuAD1()
    """

    return _setup_datasets(*(("SQuAD1",) + args), **kwargs)


def SQuAD2(*args, **kwargs):
    """ Defines SQuAD2 datasets.
    Examples:
        >>> train, dev = torchtext.experimental.datasets.raw.SQuAD2()
    """

    return _setup_datasets(*(("SQuAD2",) + args), **kwargs)


DATASETS = {
    'SQuAD1': SQuAD1,
    'SQuAD2': SQuAD2
}

## Data Downloaded

In [5]:
import pandas as pd
train = pd.read_json('./data/train-v1.1.json') 

In [6]:
train.head()

Unnamed: 0,data,version
0,"{'title': 'University_of_Notre_Dame', 'paragra...",1.1
1,"{'title': 'Beyoncé', 'paragraphs': [{'context'...",1.1
2,"{'title': 'Montana', 'paragraphs': [{'context'...",1.1
3,"{'title': 'Genocide', 'paragraphs': [{'context...",1.1
4,"{'title': 'Antibiotics', 'paragraphs': [{'cont...",1.1


In [7]:
build_vocab_from_iterator =train['data'].values 

In [8]:
ids = []
ans = []
ques = []


for value in build_vocab_from_iterator:
    for each in value['paragraphs'][0]['qas']:
        ans.append(each['answers'][0]['text'])
        ques.append(each['question'])
        ids.append(each['id'])
        if len(each['answers']) > 1:
            print(each)
        

In [9]:
train_dataset = pd.DataFrame([ [i,j,k] for i,j,k in zip(ids, ques, ans)], columns = ['ids', 'ques', 'ans'])
train_dataset.head()    

Unnamed: 0,ids,ques,ans
0,5733be284776f41900661182,To whom did the Virgin Mary allegedly appear i...,Saint Bernadette Soubirous
1,5733be284776f4190066117f,What is in front of the Notre Dame Main Building?,a copper statue of Christ
2,5733be284776f41900661180,The Basilica of the Sacred heart at Notre Dame...,the Main Building
3,5733be284776f41900661181,What is the Grotto at Notre Dame?,a Marian place of prayer and reflection
4,5733be284776f4190066117e,What sits on top of the Main Building at Notre...,a golden statue of the Virgin Mary


## Generate appropriate Datasets

In [10]:
import gensim
import nltk, os
import numpy as np
import pandas as pd
import gzip
import torch
from nltk.corpus import brown
nltk.download('brown')
nltk.download('punkt')
import spacy

# Output, save, and load brown embeddings
model = gensim.models.Word2Vec(brown.sents())
model.save('brown.embedding')

w2v = gensim.models.Word2Vec.load('brown.embedding')
spacy_en = spacy.load('en_core_web_sm')

 
def loadDF(path):
    '''
  
    You will use this function to load the dataset into a Pandas Dataframe for processing.

    '''
    df = pd.read_json(path)
    #train = pd.read_json('./data/train-v1.1.json') 
    build_vocab_from_iterator =df['data'].values 
    ids = []
    ans = []
    ques = []


    for value in build_vocab_from_iterator:
        for each in value['paragraphs'][0]['qas']:
            ans.append(each['answers'][0]['text'])
            ques.append(each['question'])
            ids.append(each['id'])
    train_dataset = pd.DataFrame([ [i,j,k] for i,j,k in zip(ids, ques, ans)], columns = ['ids', 'Ques', 'Ans'])

    return train_dataset


def prepare_text(sentence):
    
    '''

    Our text needs to be cleaned with a tokenizer. This function will perform that task.
    https://www.nltk.org/api/nltk.tokenize.html

    '''
    
    return [tok.text for tok in spacy_en.tokenizer(sentence)]   



def train_test_split(SRC=None, TRG=None):
    
    '''
    Input: SRC, our list of questions from the dataset
            TRG, our list of responses from the dataset

    Output: Training and test datasets for SRC & TRG

    '''
    bos = ["<BOS> "]
    
    eos = [" <EOS>"]
    
    train = loadDF('./data/train-v1.1.json')
    SRC_train_dataset = [bos + prepare_text(each) + eos for each in train.ques.values]
    TRG_train_dataset = [bos + prepare_text(each) + eos for each in train.ans.values]
    
    test = loadDF('./data/dev-v1.1.json')
    SRC_test_dataset = [bos + prepare_text(each) + eos for each in test.ques.values]
    TRG_test_dataset = [bos + prepare_text(each) + eos for each in test.ans.values]
    
    #SRC_train_dataset = [ [bos] + sent + eos for sent in SRC_train_dataset]
    #TRG_train_dataset = [ bos + sent + eos for sent in TRG_train_dataset]
    
    #SRC_test_dataset = [ bos + sent + eos for sent in SRC_test_dataset]
    #TRG_test_dataset = [ bos + sent + eos for sent in TRG_test_dataset]
    
    return SRC_train_dataset, SRC_test_dataset, TRG_train_dataset, TRG_test_dataset


[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [11]:
w2v.wv['Hello']

array([ 0.00207365,  0.02666673, -0.00453171, -0.05273971,  0.01784584,
       -0.06991713,  0.00554869,  0.02328627,  0.00014236, -0.0170673 ,
        0.02023535, -0.03714994, -0.02138056,  0.00796599,  0.04087447,
       -0.01110171,  0.0421456 , -0.02751041, -0.00795583, -0.03105834,
        0.04115778,  0.0389748 ,  0.02551365,  0.0205302 , -0.00317072,
       -0.02616424, -0.02052782,  0.00032259,  0.01723288, -0.02231467,
        0.07344069, -0.03654412,  0.01949139, -0.02212733,  0.00249744,
       -0.01764404, -0.00681961,  0.06774941,  0.02764932, -0.04710691,
        0.02022763, -0.03852994, -0.00343246,  0.00658672,  0.01837641,
       -0.0225295 , -0.04328559,  0.03185142,  0.01849885,  0.032887  ,
       -0.02174715, -0.00568397,  0.033268  , -0.01307951, -0.00217068,
        0.03895813,  0.03854025,  0.04190052,  0.00080813,  0.02923417,
       -0.00251668,  0.00868551, -0.00021046,  0.02190729, -0.03805066,
        0.0378219 ,  0.01983622,  0.0223449 , -0.03132726,  0.07

In [12]:
#SRC_train_dataset, SRC_test_dataset, TRG_train_dataset, TRG_test_dataset = train_test_split()


In [13]:
# TRG_test_dataset[:5]

# Preparing the dataset to be used by the models

## Part 1: Generate the csv

In [14]:
train = loadDF('./data/train-v1.1.json')
train.iloc[:100000, 1:3].to_csv('train.csv')

In [15]:
test = loadDF('./data/dev-v1.1.json')
train.iloc[:10000, 1:3].to_csv('test.csv')

train.iloc[10000:20000, 1:3].to_csv('dev.csv')


## Part 2: Generate the fields / iterators

In [16]:
import torch, spacy, re
from torchtext.legacy.data import Field, TabularDataset, BucketIterator
#from tokenizers import Tokenizer

def tokenise(text):
    return [tok.text for tok in spacy_en.tokenizer(text)]


def get_fields():
    src_q = Field(tokenize = tokenise, init_token='<BOS>', eos_token='<EOS>', lower=True)
    tar_q = Field(tokenize = tokenise, init_token='<BOS>', eos_token='<EOS>', lower=True)

    src_orig = Field(init_token='<BOS>', eos_token='<EOS>', lower=True)
    tar_orig = Field(init_token='<BOS>', eos_token='<EOS>', lower=True)
    return src_q, tar_q, src_orig, tar_orig


src_q, tar_q, src_orig, tar_orig = get_fields()

fields = { 'Ques': ('src', src_q), 'Ans': ('trg', tar_q ) }
orig_fields = { 'Ques': ('src_orig', src_orig), 'Ans': ('trg_orig', tar_orig ) }

In [17]:
#Get Tokenised Field Data
def get_data(train='train.csv', dev='dev.csv', test='test.csv'):
    train, valid, test = TabularDataset.splits(
                path = '',
                train = train,
                validation = dev,
                test = test,
                format = 'csv',
                fields = fields)

    return train, valid, test


#Get Original Data
def get_original_data(train='train.csv', dev='dev.csv', test='test.csv'):
    train, valid, test = TabularDataset.splits(
                path = '',
                train = train,
                validation = dev,
                test = test,
                format = 'csv',
                fields = orig_fields)

    return train, valid, test




# Get Iterator Bucket data

def get_iterators(train='train.csv', dev='dev.csv', test='test.csv', batch_size=32):
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    src_q.build_vocab(train, min_freq = 2)
    tar_q.build_vocab(train, min_freq = 2)
    train_iterator, valid_iterator, test_iterator = BucketIterator.splits(
        (train_data, valid_data, test_data),
        batch_size = batch_size,
        sort_key=lambda x: len(x.src),
        sort_within_batch=False,
        device = device)
    return train_iterator, valid_iterator, test_iterator, src_q, tar_q

    

In [18]:
import torchtext
torchtext.__version__

'0.10.0'

# Define the Encoder

In [19]:
import torch
import torch.nn as nn
import torch.optim as optim

# from torchtext.legacy.datasets import Multi30k
# from torchtext.legacy.data import Field, BucketIterator

import spacy
import numpy as np

import random
import math
import time

In [20]:
# class Encoder(nn.Module):
    
#     def __init__(self, input_size, hidden_size):
        
#         super(Encoder, self).__init__()
        
#         # self.embedding provides a vector representation of the inputs to our model
#         self.hid_dim = hid_dim
#         self.n_layers = 1
        
#         # self.lstm, accepts the vectorized input and passes a hidden state
        
    
#     def forward(self, i):
        
#         '''
#         Inputs: i, the src vector
#         Outputs: o, the encoder outputs
#                 h, the hidden state
#                 c, the cell state
#         '''
        
#         return o, h, c
    
class Encoder(nn.Module):
    def __init__(self, input_dim, emb_dim, hid_dim, n_layers, dropout):
        super().__init__()
        
        self.hid_dim = hid_dim
        self.n_layers = n_layers
        
        self.embedding = nn.Embedding(input_dim, emb_dim)
        
        self.rnn = nn.LSTM(emb_dim, hid_dim, n_layers, dropout = dropout)
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, src):
        
        #src = [src len, batch size]
        
        embedded = self.dropout(self.embedding(src))
        
        #embedded = [src len, batch size, emb dim]
        
        outputs, (hidden, cell) = self.rnn(embedded)
        
        #outputs = [src len, batch size, hid dim * n directions]
        #hidden = [n layers * n directions, batch size, hid dim]
        #cell = [n layers * n directions, batch size, hid dim]
        
        #outputs are always from the top hidden layer
        
        return hidden, cell
    
    


# Define the decoder

In [21]:
# class Decoder(nn.Module):
      
#     def __init__(self, hidden_size, output_size):
        
#         super(Decoder, self).__init__()
        
#         # self.embedding provides a vector representation of the target to our model
        
#         # self.lstm, accepts the embeddings and outputs a hidden state

#         # self.ouput, predicts on the hidden state via a linear output layer     
        
#     def forward(self, i, h):
        
#         '''
#         Inputs: i, the target vector
#         Outputs: o, the prediction
#                 h, the hidden state
#         '''
        
#         return o, h
        

class Decoder(nn.Module):
    def __init__(self, output_dim, emb_dim, hid_dim, n_layers, dropout):
        super().__init__()
        
        self.output_dim = output_dim
        self.hid_dim = hid_dim
        self.n_layers = n_layers
        
        self.embedding = nn.Embedding(output_dim, emb_dim)
        
        self.rnn = nn.LSTM(emb_dim, hid_dim, n_layers, dropout = dropout)
        
        self.fc_out = nn.Linear(hid_dim, output_dim)
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, input, hidden, cell):
        
        #input = [batch size]
        #hidden = [n layers * n directions, batch size, hid dim]
        #cell = [n layers * n directions, batch size, hid dim]
        
        #n directions in the decoder will both always be 1, therefore:
        #hidden = [n layers, batch size, hid dim]
        #context = [n layers, batch size, hid dim]
        
        input = input.unsqueeze(0)
        
        #input = [1, batch size]
        
        embedded = self.dropout(self.embedding(input))
        
        #embedded = [1, batch size, emb dim]
                
        output, (hidden, cell) = self.rnn(embedded, (hidden, cell))
        
        #output = [seq len, batch size, hid dim * n directions]
        #hidden = [n layers * n directions, batch size, hid dim]
        #cell = [n layers * n directions, batch size, hid dim]
        
        #seq len and n directions will always be 1 in the decoder, therefore:
        #output = [1, batch size, hid dim]
        #hidden = [n layers, batch size, hid dim]
        #cell = [n layers, batch size, hid dim]
        
        prediction = self.fc_out(output.squeeze(0))
        
        #prediction = [batch size, output dim]
        
        return prediction, hidden, cell        



# Define the Seq2Seq

In [22]:
# class Seq2Seq(nn.Module):
    
#     def __init__(self, encoder_input_size, encoder_hidden_size, decoder_hidden_size, decoder_output_size):
        
#         super(Seq2Seq, self).__init__()
        
    
    
#     def forward(self, src, trg, teacher_forcing_ratio = 0.5):      
        
#         return o

    

class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder, device):
        super().__init__()
        
        self.encoder = encoder
        self.decoder = decoder
        self.device = device
        
        assert encoder.hid_dim == decoder.hid_dim, \
            "Hidden dimensions of encoder and decoder must be equal!"
        assert encoder.n_layers == decoder.n_layers, \
            "Encoder and decoder must have equal number of layers!"
        
    def forward(self, src, trg, teacher_forcing_ratio = 0.5):
        
        #src = [src len, batch size]
        #trg = [trg len, batch size]
        #teacher_forcing_ratio is probability to use teacher forcing
        #e.g. if teacher_forcing_ratio is 0.75 we use ground-truth inputs 75% of the time
        
        batch_size = trg.shape[1]
        trg_len = trg.shape[0]
        trg_vocab_size = self.decoder.output_dim
        
        #tensor to store decoder outputs
        outputs = torch.zeros(trg_len, batch_size, trg_vocab_size).to(self.device)
        
        #last hidden state of the encoder is used as the initial hidden state of the decoder
        hidden, cell = self.encoder(src)
        
        #first input to the decoder is the  tokens
        input = trg[0,:]
        
        for t in range(1, trg_len):
            
            #insert input token embedding, previous hidden and previous cell states
            #receive output tensor (predictions) and new hidden and cell states
            output, hidden, cell = self.decoder(input, hidden, cell)
            
            #place predictions in a tensor holding predictions for each token
            outputs[t] = output
            
            #decide if we are going to use teacher forcing or not
            teacher_force = random.random() < teacher_forcing_ratio
            
            #get the highest predicted token from our predictions
            top1 = output.argmax(1) 
            
            #if teacher forcing, use actual next token as next input
            #if not, use predicted token
            input = trg[t] if teacher_force else top1
        
        return outputs

# Training the model

In [23]:
SEED = 1234
BATCH_SIZE = 32
N_EPOCHS = 10
CURRENT_EPOCH = 0

random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

train_data, valid_data, test_data = get_data()
train_iterator, valid_iterator, _, src_tw, trg_en = get_iterators(train_data, valid_data, test_data, BATCH_SIZE)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Unique tokens in source (tw) vocabulary: {len(src_tw.vocab)}")
print(f"Unique tokens in target (en) vocabulary: {len(trg_en.vocab)}")
 



Unique tokens in source (tw) vocabulary: 2054
Unique tokens in target (en) vocabulary: 1245


### Build Model

In [24]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
INPUT_DIM, OUTPUT_DIM = len(src_tw.vocab), len(trg_en.vocab) 
ENC_EMB_DIM = 100
DEC_EMB_DIM = 100
HID_DIM = 512
N_LAYERS = 2
ENC_DROPOUT = 0.1
DEC_DROPOUT = 0.1

enc = Encoder(INPUT_DIM, ENC_EMB_DIM, HID_DIM, N_LAYERS, ENC_DROPOUT)
dec = Decoder(OUTPUT_DIM, DEC_EMB_DIM, HID_DIM, N_LAYERS, DEC_DROPOUT)

model = Seq2Seq(enc, dec, device).to(device)

def init_weights(module):
    for name, param in module.named_parameters():
        print(name)
        nn.init.uniform_(param.data, -0.08, 0.08)

model.apply(init_weights)

weight
weight_ih_l0
weight_hh_l0
bias_ih_l0
bias_hh_l0
weight_ih_l1
weight_hh_l1
bias_ih_l1
bias_hh_l1
embedding.weight
rnn.weight_ih_l0
rnn.weight_hh_l0
rnn.bias_ih_l0
rnn.bias_hh_l0
rnn.weight_ih_l1
rnn.weight_hh_l1
rnn.bias_ih_l1
rnn.bias_hh_l1
weight
weight_ih_l0
weight_hh_l0
bias_ih_l0
bias_hh_l0
weight_ih_l1
weight_hh_l1
bias_ih_l1
bias_hh_l1
weight
bias
embedding.weight
rnn.weight_ih_l0
rnn.weight_hh_l0
rnn.bias_ih_l0
rnn.bias_hh_l0
rnn.weight_ih_l1
rnn.weight_hh_l1
rnn.bias_ih_l1
rnn.bias_hh_l1
fc_out.weight
fc_out.bias
encoder.embedding.weight
encoder.rnn.weight_ih_l0
encoder.rnn.weight_hh_l0
encoder.rnn.bias_ih_l0
encoder.rnn.bias_hh_l0
encoder.rnn.weight_ih_l1
encoder.rnn.weight_hh_l1
encoder.rnn.bias_ih_l1
encoder.rnn.bias_hh_l1
decoder.embedding.weight
decoder.rnn.weight_ih_l0
decoder.rnn.weight_hh_l0
decoder.rnn.bias_ih_l0
decoder.rnn.bias_hh_l0
decoder.rnn.weight_ih_l1
decoder.rnn.weight_hh_l1
decoder.rnn.bias_ih_l1
decoder.rnn.bias_hh_l1
decoder.fc_out.weight
decoder.fc

Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(2054, 100)
    (rnn): LSTM(100, 512, num_layers=2, dropout=0.1)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (decoder): Decoder(
    (embedding): Embedding(1245, 100)
    (rnn): LSTM(100, 512, num_layers=2, dropout=0.1)
    (fc_out): Linear(in_features=512, out_features=1245, bias=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
)

In [25]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 7,686,025 trainable parameters


In [26]:
optimizer = optim.Adam(model.parameters(), lr=1e-6)

PAD_IDX = trg_en.vocab.stoi[trg_en.pad_token] # ignore padding index when calculating loss
criterion = nn.CrossEntropyLoss(ignore_index = PAD_IDX)


In [27]:
def train(model, iterator, optimizer, criterion, clip):
    
    model.train()
    
    epoch_loss = 0
    
    for i, batch in enumerate(iterator):
        
        src = batch.src
        trg = batch.trg
        
        optimizer.zero_grad()
        
        output = model(src, trg)
        
        #trg = [trg len, batch size]
        #output = [trg len, batch size, output dim]
        
        output_dim = output.shape[-1]
        
        output = output[1:].view(-1, output_dim)
        trg = trg[1:].view(-1)
        
        #trg = [(trg len - 1) * batch size]
        #output = [(trg len - 1) * batch size, output dim]
        
        loss = criterion(output, trg)
        
        loss.backward()
        
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
        
        optimizer.step()
        
        epoch_loss += loss.item()
        
    return epoch_loss / len(iterator)

In [28]:
def evaluate(model, iterator, criterion):
    
    model.eval()
    
    epoch_loss = 0
    
    with torch.no_grad():
    
        for i, batch in enumerate(iterator):

            src = batch.src
            trg = batch.trg

            output = model(src, trg, 0) #turn off teacher forcing

            #trg = [trg len, batch size]
            #output = [trg len, batch size, output dim]

            output_dim = output.shape[-1]
            
            output = output[1:].view(-1, output_dim)
            trg = trg[1:].view(-1)

            #trg = [(trg len - 1) * batch size]
            #output = [(trg len - 1) * batch size, output dim]

            loss = criterion(output, trg)
            
            epoch_loss += loss.item()
        
    return epoch_loss / (len(iterator) + 1)

In [29]:
def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

In [30]:
N_EPOCHS = 10
CLIP = 1

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):
    
    start_time = time.time()
    
    train_loss = train(model, train_iterator, optimizer, criterion, CLIP)
    valid_loss = evaluate(model, valid_iterator, criterion)
    
    end_time = time.time()
    
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'tut1-model.pt')
    
    print(f'Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. PPL: {math.exp(valid_loss):7.3f}')

Epoch: 01 | Time: 0m 10s
	Train Loss: 7.118 | Train PPL: 1234.481
	 Val. Loss: 0.000 |  Val. PPL:   1.000
Epoch: 02 | Time: 0m 10s
	Train Loss: 7.107 | Train PPL: 1220.643
	 Val. Loss: 0.000 |  Val. PPL:   1.000
Epoch: 03 | Time: 0m 10s
	Train Loss: 7.095 | Train PPL: 1206.502
	 Val. Loss: 0.000 |  Val. PPL:   1.000
Epoch: 04 | Time: 0m 10s
	Train Loss: 7.083 | Train PPL: 1191.674
	 Val. Loss: 0.000 |  Val. PPL:   1.000
Epoch: 05 | Time: 0m 10s
	Train Loss: 7.069 | Train PPL: 1175.377
	 Val. Loss: 0.000 |  Val. PPL:   1.000
Epoch: 06 | Time: 0m 10s
	Train Loss: 7.054 | Train PPL: 1157.190
	 Val. Loss: 0.000 |  Val. PPL:   1.000
Epoch: 07 | Time: 0m 10s
	Train Loss: 7.037 | Train PPL: 1138.410
	 Val. Loss: 0.000 |  Val. PPL:   1.000
Epoch: 08 | Time: 0m 10s
	Train Loss: 7.018 | Train PPL: 1116.375
	 Val. Loss: 0.000 |  Val. PPL:   1.000
Epoch: 09 | Time: 0m 10s
	Train Loss: 6.994 | Train PPL: 1090.586
	 Val. Loss: 0.000 |  Val. PPL:   1.000
Epoch: 10 | Time: 0m 10s
	Train Loss: 6.963 | 

In [31]:
import torch.nn as nn

# from utils import translate_sentence

BATCH_SIZE = 32
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
train_data, valid_data, test_data = get_data()
og_train_data, og_valid_data, og_test_data = get_original_data()
_, _, test_iterator, src_tw, trg_en = get_iterators(train_data, valid_data, test_data, BATCH_SIZE)
PAD_IDX = trg_en.vocab.stoi[trg_en.pad_token] # ignore padding index when calculating loss
criterion = nn.CrossEntropyLoss(ignore_index = PAD_IDX)



# example_idx = randrange(len(og_valid_data.examples))
# example = valid_data.examples[example_idx]
# og_example = og_valid_data.examples[example_idx]
# orig_source = ' '.join(og_example.src_orig)
# print('ORIG SOURCE: ', orig_source)
# orig_target = ' '.join(og_example.trg_orig)
# print('ORIG TARGET: ', orig_target)
# preprocessed_source = ' '.join(example.src[::-1])
# print('TOKENIZED SOURCE: ', preprocessed_source)
# preprocessed_target = ' '.join(example.trg)
# refs = example.trg
# print('TOKENIZED TARGET: ', preprocessed_target)
#
# src_tensor = src_tw.process([example.src]).to(device)
# trg_tensor = trg_en.process([example.trg]).to(device)
#
# model.eval()
# with torch.no_grad():
#     outputs = model(src_tensor, trg_tensor, teacher_forcing_ratio=0)
#
# output_idx = outputs[1:].squeeze(1).argmax(1)
# # itos: A list of token strings indexed by their numerical identifiers.
# generation = [trg_en.vocab.itos[idx] for idx in output_idx]
# predicted_translation = []
# for word in generation:
#     if word == '<eos>': break
#     predicted_translation.append(word)
# predicted_translation = ' '.join(predicted_translation)
# print('TRANSLATION: ', ' '.join(predicted_translation))


# write to translation files to use when calculating BLEU
file_ref = open("target_translation.txt", "a")  # append mode
file_pred = open("predicted_translation.txt", "a")  # append mode

for i in range(len(test_data.examples)):
    example = test_data.examples[i]
    preprocessed_target = ' '.join(example.trg) + '\n'
    file_ref.write(preprocessed_target)

    src_tensor = src_tw.process([example.src]).to(device)
    trg_tensor = trg_en.process([example.trg]).to(device)

    model.eval()
    with torch.no_grad():
        outputs = model(src_tensor, trg_tensor, teacher_forcing_ratio=0)

    output_idx = outputs[1:].squeeze(1).argmax(1)
    
    # itos: A list of token strings indexed by their numerical identifiers.
    generation = [trg_en.vocab.itos[idx] for idx in output_idx]
    predicted_translation = []
    for word in generation:
        if word == '<eos>': break
        predicted_translation.append(word)
    predicted_target = ' '.join(predicted_translation) + '\n'
    file_pred.write(predicted_target)

file_ref.close()
file_pred.close()
