# Building A Translation Application from Scratch


In [1]:
! pip install sentencepiece


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sentencepiece
  Downloading sentencepiece-0.1.97-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m25.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: sentencepiece
Successfully installed sentencepiece-0.1.97


## 1. Getting dataset (22,000,000 English-French Translations)



In [None]:
# ! pip install kaggle
# ! mkdir ~/.kaggle
# ! cp kaggle.json ~/.kaggle/
# ! chmod 600 ~/.kaggle/kaggle.json
# ! kaggle datasets download dhruvildave/en-fr-translation-dataset
# ! unzip /content/en-fr-translation-dataset.zip

Some notes on this dataset: It's pretty messy and it's so large that it's hard to shuffle it.

## 2. Building a SentencePiece Tokenizer

We could build separate vocabulary-based tokenizers for French and English, but this would be somewhat cumbersome. Let's try to build a shared byte-pair encoding tokenizer. This could be done from scratch, but the SentencePiece library is too convenient.

Some things to consider: When working with subwords, we can make our model smaller and more efficient, but do we sacrifice accuracy? That's a good question. Even so, we can maintain punctuation and case, and better account for 'unknown' words.

In [2]:
import pandas as pd
import random

source_path = '/content/drive/MyDrive/NLP_2023/french_translation/en-fr.csv'

source_path_2 = '/content/drive/MyDrive/NLP_2023/french_translation/english-french-common.csv'

random.seed(42)

dff = pd.read_csv(source_path, chunksize=500000)
df2 = pd.read_csv(source_path_2)

In [3]:
df = next(dff)

In [4]:
df.sample(10)

Unnamed: 0,en,fr
323022,Vinevax Bio-dowel Fungicide Label crops Target...,Vinevax Bio-dowel Fongicide Cultures principal...
443483,Or Both orders of governments must make every ...,Ou Les deux ordres de gouvernement doivent fai...
231454,Sylvain de Tonnancour James Holloway Senior Po...,Sylvain de Tonnancour James Holloway Conseille...
339808,Section 33 quality assurance processes used to...,Les processus d’assurance de la qualité prévus...
255963,◦ New technology – Maximum application of exis...,◦ Nouvelle technologie – Application maximale ...
226577,Find out now how to develop an e-business stra...,Vous apprendrez ici comment développer une str...
322724,Maturity - Indeterminate Growth Lentil has an ...,Maturité et croissance indéterminée La croissa...
261621,"Regardless of their merit or commercial value,...",Indépendamment de leurs qualités ou leur valeu...
72252,Game,Jeu
457138,◦ See Chicken Section Table Eggs and Processed...,◦ Voir Salubrité des Aliments (Poulets) Oeufs ...


In [5]:
df2 = df2.rename(columns={'English words/sentences':'en','French words/sentences':'fr'})
df2.sample(10)

Unnamed: 0,en,fr
163745,Tom certainly had a lot of time to think about...,Tom avait certainement beaucoup de temps pour ...
152876,There's a little bit of water in the glass.,Il y a un peu d'eau dans le verre.
164200,I don't know whether he'll come by train or by...,Je ne sais pas s'il viendra par le train ou en...
117546,Unfortunately he refused to come.,Malheureusement il a refusé de venir.
170088,We're going to invite Tom and Mary to our Hall...,Nous allons inviter Tom et Mary à notre fête d...
24274,"I should come, too.","Je devrais venir, aussi."
72223,Tom's family is in Boston.,La famille de Tom est à Boston.
69267,I went to the post office.,Je suis allé à la poste.
141020,I ran into a friend of mine on the bus.,Je suis tombée sur une amie à moi dans le bus.
7568,He is a writer.,Il est écrivain.


In [6]:
print(len(df2))

175621


In [7]:
df3 = pd.concat([df.sample(175621), df2])
df3.sample(10)

Unnamed: 0,en,fr
97720,I'm sorry if I frightened you.,Je suis désolée si je vous ai effrayés.
177929,Welcome to our Region,De l'innovation solide comme le roc
30984,She got in the taxi.,Elle est montée dans un taxi.
155922,I really don't want to go to Boston with Tom.,Je ne veux vraiment pas aller à Boston avec Tom.
243341,Business Information by Sector Canadian Oil an...,Information d'affaires par secteur Industrie c...
161185,"And, when Canadians do venture outside their h...","En outre, lorsque les Canadiens quittent leur ..."
163247,I read in the newspaper that he had been murde...,J'ai lu dans le journal qu'il avait été assass...
64446,This question isn't easy.,Cette question n'est pas simple.
295991,While participants did not agree that the APF ...,Même si les participants ne croient pas que le...
8238,I wasn't hired.,On ne m'a pas embauché.


In [8]:
print(len(df3))

351242


In [9]:
df3['en'] = df3['en'].astype('str')
df3['fr'] = df3['fr'].astype('str')
sample_text = list(df3['en'] + df3['fr'])

with open("sample_text.txt", "w") as output:
    output.write(str(("\n").join(sample_text)))

In [10]:
from sentencepiece import SentencePieceTrainer, SentencePieceProcessor

In [11]:
input_file = '/content/sample_text.txt'
max_num_words = 10000
model_type = 'bpe'
model_prefix = '/content/drive/MyDrive/NLP_2023/french_translation/sentencepiece'
pad_id = 0
unk_id = 1
bos_id = 2
eos_id = 3

sentencepiece_params = ' '.join([
    '--input={}'.format(input_file),
    '--model_type={}'.format(model_type),
    '--model_prefix={}'.format(model_prefix),
    '--vocab_size={}'.format(max_num_words),
    '--pad_id={}'.format(pad_id),
    '--unk_id={}'.format(unk_id),
    '--bos_id={}'.format(bos_id),
    '--eos_id={}'.format(eos_id)
])
print(sentencepiece_params)
SentencePieceTrainer.train(sentencepiece_params)

--input=/content/sample_text.txt --model_type=bpe --model_prefix=/content/drive/MyDrive/NLP_2023/french_translation/sentencepiece --vocab_size=10000 --pad_id=0 --unk_id=1 --bos_id=2 --eos_id=3


In [12]:
sp = SentencePieceProcessor()
sp.bos_token = '<start>'
sp.eos_token = '<end>'
sp.pad_token = '<pad>'
sp.unk_token = '<unk>'
sp.load(f"{model_prefix}.model")
print('Found %s unique tokens.' % sp.get_piece_size())

Found 10000 unique tokens.


## Testing our tokenizer.

In [13]:
original = "Je n'avais aucune idée"
encoded_pieces = sp.encode_as_pieces(original)
print(encoded_pieces)

# or convert it to numeric id for downstream modeling
encoded_ids = sp.encode_as_ids(original)
print(encoded_ids)
#'[3261, 45, 9924, 4030, 4298, 4774]'

['▁Je', '▁n', "'", 'avais', '▁aucune', '▁idée']
[3186, 45, 9925, 3992, 4273, 4756]


In [14]:
decoded_pieces = sp.decode_pieces(encoded_pieces)
print(decoded_pieces)

# we can convert the numeric id back to the original text
decoded_ids = sp.decode_ids(encoded_ids)
print(decoded_ids)

Je n'avais aucune idée
Je n'avais aucune idée


In [15]:
max_len = 50

def encode_english(text, sp):
  byte_pairs = sp.encode_as_ids(text)
  enc_c = [sp.bos_id()] + list(byte_pairs) + [sp.pad_id()] * (max_len - len(list(byte_pairs)))
  return enc_c[:max_len]

sample_text = 'My name is Sean, and je suis un americain'
print(encode_english(sample_text, sp))

[2, 1876, 3016, 138, 2045, 15, 9921, 70, 551, 665, 91, 449, 10, 42, 213, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


In [16]:
def encode_french(text, sp):
    byte_pairs = sp.encode_as_ids(text)
    enc_c = [sp.bos_id()] + list(byte_pairs) + [sp.eos_id()] + [sp.pad_id()] * (max_len - len(list(byte_pairs)))
    return enc_c[:max_len]

sample_text = 'My name is Sean, and je suis un americain'
print(encode_french(sample_text, sp))

[2, 1876, 3016, 138, 2045, 15, 9921, 70, 551, 665, 91, 449, 10, 42, 213, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


Finalizing our Dataset

In [17]:
df3.tail()

Unnamed: 0,en,fr
175616,"Top-down economics never works, said Obama. ""T...","« L'économie en partant du haut vers le bas, ç..."
175617,A carbon footprint is the amount of carbon dio...,Une empreinte carbone est la somme de pollutio...
175618,Death is something that we're often discourage...,La mort est une chose qu'on nous décourage sou...
175619,Since there are usually multiple websites on a...,Puisqu'il y a de multiples sites web sur chaqu...
175620,If someone who doesn't know your background sa...,Si quelqu'un qui ne connaît pas vos antécédent...


In [23]:
df4 = df3.sample(frac = 1)
df4.head()

Unnamed: 0,en,fr
108377,How long have you had this pain?,Depuis combien de temps as-tu cette douleur ?
173827,Just because something is more expensive doesn...,Ce n'est pas parce que quelque chose est plus ...
399596,Both treatments require tillage for incorporat...,Sécurité La sécurité constitue le facteur le p...
401767,The UV disinfection unit was originally instal...,"Le système a coûté environ 7 000 $, ce qui com..."
61438,I know Tom is colorblind.,Je sais que Tom est daltonien.


In [24]:
df4.to_csv('/content/drive/MyDrive/NLP_2023/french_translation/translation_data.csv')

### 3. Building a preprocessor and dataloader that accesses multiple files.

In [18]:
import math
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.data import TensorDataset, DataLoader, Dataset, IterableDataset

In [19]:
import csv

class CustomIterableDatasetv1(IterableDataset):

    def __init__(self, filename, length):

        #Store the filename in object's memory
        self.filename = filename
        self.len = length


    def __len__(self):
        return self.len

    def preprocess(self, text, text2):

        text_pp = torch.LongTensor(encode_english(text, sp))
        text_pp2 = torch.LongTensor(encode_french(text2, sp))
        
        return text_pp, text_pp2

    def line_mapper(self, line):
        
        text, text2 = self.preprocess(line[1], line[2])
        
        return text, text2
       
    def __iter__(self):

        #Create an iterator
        file_itr = open(self.filename)
        reader = csv.reader(file_itr)

        #Map each element using the line_mapper
        mapped_itr = map(self.line_mapper, reader)
        
        return mapped_itr

In [20]:
batch_size = 100
source_path = '/content/drive/MyDrive/NLP_2023/french_translation/translation_data.csv'
df = pd.read_csv(source_path)
length = len(df)

### No shuffle because it's an iterative loader
train_loader = torch.utils.data.DataLoader(CustomIterableDatasetv1(source_path, length),
                                           batch_size = batch_size,  
                                           pin_memory=True)

In [21]:
### Function to create masks

def create_masks(english, french_input, french_target):
    
    def subsequent_mask(size):
        mask = torch.triu(torch.ones(size, size)).transpose(0, 1).type(dtype=torch.uint8)
        return mask.unsqueeze(0)
    
    english_mask = english!=0   ## This makes a matrix of true and falses
    english_mask = english_mask.to(device)
    english_mask = english_mask.unsqueeze(1).unsqueeze(1)         # (batch_size, 1, 1, max_words)
  

    french_input_mask = french_input!=0
    french_input_mask = french_input_mask.unsqueeze(1)  # (batch_size, 1, max_words)
    french_input_mask = french_input_mask & subsequent_mask(french_input.size(-1)).type_as(french_input_mask.data) 
    french_input_mask = french_input_mask.unsqueeze(1) # (batch_size, 1, max_words, max_words)
    french_target_mask = french_target!=0              # (batch_size, max_words)
    
    return english_mask, french_input_mask, french_target_mask

# Build Transformer Model

In [22]:
class Embeddings(nn.Module):
    """
    Initializes embeddings
    Adds positional encoding
    """
    def __init__(self, vocab_size, d_model, max_len = 50):
        super(Embeddings, self).__init__()
        self.d_model = d_model  ## This is basically how deep the model is, so the embeddings are going to be 512 numbers for each word.
        self.embed = nn.Embedding(vocab_size, d_model)
        self.pe = self.create_positinal_encoding(max_len, self.d_model)
        self.dropout = nn.Dropout(0.1) 
        
    def create_positinal_encoding(self, max_len, d_model):
        pe = torch.zeros(max_len, d_model).to(device)
        for pos in range(max_len):   # for each position of the word
            for i in range(0, d_model, 2):   # for each dimension of the each position
                pe[pos, i] = math.sin(pos / (10000 ** ((2 * i)/d_model)))
                pe[pos, i + 1] = math.cos(pos / (10000 ** ((2 * (i + 1))/d_model)))
        pe = pe.unsqueeze(0)   # include the batch size
        return pe
        
    def forward(self, encoded_words):
        embedding = self.embed(encoded_words) * math.sqrt(self.d_model) ## To add more weight to embeddings
        embedding += self.pe[:, :embedding.size(1)]   # pe will automatically be expanded with the same batch size as encoded_words
        embedding = self.dropout(embedding) ## Isn't this the same dropout?
        return embedding

In [23]:
class MultiHeadAttention(nn.Module):
    '''
    Runs Embeddings through linear layer to create query, keys, and values
    Divides q,k,v by heads
    '''
    def __init__(self, heads, d_model):
        
        super(MultiHeadAttention, self).__init__()
        assert d_model % heads == 0
        self.d_k = d_model // heads
        self.heads = heads
        self.dropout = nn.Dropout(0.1)
        self.query = nn.Linear(d_model, d_model)
        self.key = nn.Linear(d_model, d_model)
        self.value = nn.Linear(d_model, d_model)
        self.concat = nn.Linear(d_model, d_model)
        
    def forward(self, query, key, value, mask):
        """
        query, key, value of shape: (batch_size, max_len, 512)
        mask of shape: (batch_size, 1, 1, max_words)
        """
        # (batch_size, max_len, 512)
        query = self.query(query)
        key = self.key(key)        
        value = self.value(value)   
        
        # (batch_size, max_len, 512) --> (batch_size, max_len, h, d_k) --> (batch_size, h, max_len, d_k)
        query = query.view(query.shape[0], -1, self.heads, self.d_k).permute(0, 2, 1, 3)   
        key = key.view(key.shape[0], -1, self.heads, self.d_k).permute(0, 2, 1, 3)  
        value = value.view(value.shape[0], -1, self.heads, self.d_k).permute(0, 2, 1, 3)  
        
        # (batch_size, h, max_len, d_k) matmul (batch_size, h, d_k, max_len) --> (batch_size, h, max_len, max_len)
        scores = torch.matmul(query, key.permute(0,1,3,2)) / math.sqrt(query.size(-1))
        scores = scores.masked_fill(mask == 0, -1e9)    # (batch_size, h, max_len, max_len)
        weights = F.softmax(scores, dim = -1)           # (batch_size, h, max_len, max_len)
        weights = self.dropout(weights)
        
        # (batch_size, h, max_len, max_len) matmul (batch_size, h, max_len, d_k) --> (batch_size, h, max_len, d_k)
        context = torch.matmul(weights, value)
        # (batch_size, h, max_len, d_k) --> (batch_size, max_len, h, d_k) --> (batch_size, max_len, h * d_k)
        context = context.permute(0,2,1,3).contiguous().view(context.shape[0], -1, self.heads * self.d_k)
        # (batch_size, max_len, h * d_k)
        interacted = self.concat(context)
        return interacted 

In [24]:
class FeedForward(nn.Module):
    '''
    This is a later linear layer, for adding more features.
    '''
    def __init__(self, d_model, middle_dim = 2048):
        super(FeedForward, self).__init__()
        
        self.fc1 = nn.Linear(d_model, middle_dim)
        self.fc2 = nn.Linear(middle_dim, d_model)
        self.dropout = nn.Dropout(0.1)

    def forward(self, x):
        out = F.relu(self.fc1(x))
        out = self.fc2(self.dropout(out))
        return out

In [25]:
class EncoderLayer(nn.Module):
    '''
    creates encoder layer
    '''
    def __init__(self, d_model, heads):
        super(EncoderLayer, self).__init__()
        self.layernorm = nn.LayerNorm(d_model)
        self.self_multihead = MultiHeadAttention(heads, d_model)
        self.feed_forward = FeedForward(d_model)
        self.dropout = nn.Dropout(0.1)

    def forward(self, embeddings, mask):
        interacted = self.dropout(self.self_multihead(embeddings, embeddings, embeddings, mask))
        interacted = self.layernorm(interacted + embeddings)
        feed_forward_out = self.dropout(self.feed_forward(interacted))
        encoded = self.layernorm(feed_forward_out + interacted)
        return encoded

In [26]:
class DecoderLayer(nn.Module):
    
    def __init__(self, d_model, heads):
        super(DecoderLayer, self).__init__()
        self.layernorm = nn.LayerNorm(d_model)
        self.self_multihead = MultiHeadAttention(heads, d_model)
        self.src_multihead = MultiHeadAttention(heads, d_model)
        self.feed_forward = FeedForward(d_model)
        self.dropout = nn.Dropout(0.1)
        
    def forward(self, embeddings, encoded, src_mask, target_mask):
        query = self.dropout(self.self_multihead(embeddings, embeddings, embeddings, target_mask))
        query = self.layernorm(query + embeddings)
        interacted = self.dropout(self.src_multihead(query, encoded, encoded, src_mask))
        interacted = self.layernorm(interacted + query)
        feed_forward_out = self.dropout(self.feed_forward(interacted))
        decoded = self.layernorm(feed_forward_out + interacted)
        return decoded


In [27]:
class Transformer(nn.Module):
    
    def __init__(self, d_model, heads, num_layers):
        super(Transformer, self).__init__()
        
        self.d_model = d_model
        self.vocab_size = 10000
        self.embed = Embeddings(self.vocab_size, d_model)
        self.encoder = nn.ModuleList([EncoderLayer(d_model, heads) for _ in range(num_layers)])
        self.decoder = nn.ModuleList([DecoderLayer(d_model, heads) for _ in range(num_layers)])
        self.logit = nn.Linear(d_model, self.vocab_size)
        
    def encode(self, src_words, src_mask):
        src_embeddings = self.embed(src_words)
        for layer in self.encoder:
            src_embeddings = layer(src_embeddings, src_mask)
        return src_embeddings
    
    def decode(self, target_words, target_mask, src_embeddings, src_mask):
        tgt_embeddings = self.embed(target_words)
        for layer in self.decoder:
            tgt_embeddings = layer(tgt_embeddings, src_embeddings, src_mask, target_mask)
        return tgt_embeddings
        
    def forward(self, src_words, src_mask, target_words, target_mask):
        encoded = self.encode(src_words, src_mask)
        decoded = self.decode(target_words, target_mask, encoded, src_mask)
        out = F.log_softmax(self.logit(decoded), dim = 2)
        return out

In [28]:
class AdamWarmup:
    
    def __init__(self, model_size, warmup_steps, optimizer):
        
        self.model_size = model_size
        self.warmup_steps = warmup_steps
        self.optimizer = optimizer
        self.current_step = 0
        self.lr = 0
        
    def get_lr(self):
        return self.model_size ** (-0.5) * min(self.current_step ** (-0.5), self.current_step * self.warmup_steps ** (-1.5))
        
    def step(self):
        # Increment the number of steps each time we call the step function
        self.current_step += 1
        lr = self.get_lr()
        for param_group in self.optimizer.param_groups:
            param_group['lr'] = lr
        # update the learning rate
        self.lr = lr
        self.optimizer.step()

In [29]:
class LossWithLS(nn.Module):

    def __init__(self, size, smooth):
        super(LossWithLS, self).__init__()
        self.criterion = nn.KLDivLoss(size_average=False, reduce=False)
        self.confidence = 1.0 - smooth
        self.smooth = smooth
        self.size = size
        
    def forward(self, prediction, target, mask):
        """
        prediction of shape: (batch_size, max_words, vocab_size)
        target and mask of shape: (batch_size, max_words)
        """
        prediction = prediction.view(-1, prediction.size(-1))   # (batch_size * max_words, vocab_size)
        target = target.contiguous().view(-1)   # (batch_size * max_words)
        mask = mask.float()
        mask = mask.view(-1)       # (batch_size * max_words)
        labels = prediction.data.clone()
        labels.fill_(self.smooth / (self.size - 1))
        labels.scatter_(1, target.data.unsqueeze(1), self.confidence)
        loss = self.criterion(prediction, labels)    # (batch_size * max_words, vocab_size)
        loss = (loss.sum(1) * mask).sum() / mask.sum()
        return loss


In [30]:
d_model = 512
heads = 8
num_layers = 3
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
epochs = 8
vocab_size = sp.vocab_size()
    
transformer = Transformer(d_model = d_model, heads = heads, num_layers = num_layers)
transformer = transformer.to(device)
adam_optimizer = torch.optim.Adam(transformer.parameters(), lr=0, betas=(0.9, 0.98), eps=1e-9)
transformer_optimizer = AdamWarmup(model_size = d_model, warmup_steps = 4000, optimizer = adam_optimizer)
criterion = LossWithLS(vocab_size, 0.1)



In [31]:
def train(train_loader, transformer, criterion, epoch):
    
    transformer.train()
    sum_loss = 0
    count = 0

    for i, (english, french) in enumerate(train_loader):
        
        samples = english.shape[0]

        # Move to device
        english = english.to(device)
        french = french.to(device)

        # Prepare Target Data
        french_input = french[:, :-1] ## Remove the <end> token
        french_target = french[:, 1:] ## Remove the <start> token

        # Create mask and add dimensions
        english_mask, french_input_mask, french_target_mask = create_masks(english, french_input, french_target)

        # Get the transformer outputs
        out = transformer(english, english_mask, french_input, french_input_mask)

        # Compute the loss
        loss = criterion(out, french_target, french_target_mask)
        
        # Backprop
        transformer_optimizer.optimizer.zero_grad()
        loss.backward()
        transformer_optimizer.step()
        
        sum_loss += loss.item() * samples
        count += samples
        
        if i % 100 == 0:
            print("Epoch [{}][{}/{}]\tLoss: {:.3f}".format(epoch, i, len(train_loader), sum_loss/count))

In [32]:
def evaluate(transformer, english, english_mask, max_len):
    """
    Performs Greedy Decoding with a batch size of 1
    """
    transformer.eval()
    start_token = sp.bos_id()
    end_token = sp.eos_id()
    encoded = transformer.encode(english, english_mask)
    words = torch.LongTensor([[start_token]]).to(device)
    
    for step in range(max_len - 1):
        size = words.shape[1]
        target_mask = torch.triu(torch.ones(size, size)).transpose(0, 1).type(dtype=torch.uint8)
        target_mask = target_mask.to(device).unsqueeze(0).unsqueeze(0)
        decoded = transformer.decode(words, target_mask, encoded, english_mask)
        predictions = transformer.logit(decoded[:, -1])
        _, next_word = torch.max(predictions, dim = 1)
        next_word = next_word.item()
        if next_word == end_token:
            break
        words = torch.cat([words, torch.LongTensor([[next_word]]).to(device)], dim = 1)   # (1,step+2)
        
    # Construct Sentence
    if words.dim() == 2:
        words = words.squeeze(0)
        words = words.tolist()
        
    sen_idx = [w for w in words if w not in {start_token}]
    
    sentence = sp.decode_ids(sen_idx)
    
    return sentence

In [33]:
for epoch in range(epochs):
    
    train(train_loader, transformer, criterion, epoch)
    
    state = {'epoch': epoch, 'transformer': transformer, 'transformer_optimizer': transformer_optimizer}
    torch.save(state, '/content/drive/MyDrive/NLP_2023/french_translation/checkpoint_big_' + str(epoch) + '.pth.tar')

Epoch [0][0/3513]	Loss: 8.091
Epoch [0][100/3513]	Loss: 7.536
Epoch [0][200/3513]	Loss: 7.003
Epoch [0][300/3513]	Loss: 6.596
Epoch [0][400/3513]	Loss: 6.306
Epoch [0][500/3513]	Loss: 6.082
Epoch [0][600/3513]	Loss: 5.893
Epoch [0][700/3513]	Loss: 5.733
Epoch [0][800/3513]	Loss: 5.595
Epoch [0][900/3513]	Loss: 5.470
Epoch [0][1000/3513]	Loss: 5.356
Epoch [0][1100/3513]	Loss: 5.251
Epoch [0][1200/3513]	Loss: 5.155
Epoch [0][1300/3513]	Loss: 5.065
Epoch [0][1400/3513]	Loss: 4.979
Epoch [0][1500/3513]	Loss: 4.902
Epoch [0][1600/3513]	Loss: 4.828
Epoch [0][1700/3513]	Loss: 4.757
Epoch [0][1800/3513]	Loss: 4.692
Epoch [0][1900/3513]	Loss: 4.630
Epoch [0][2000/3513]	Loss: 4.571
Epoch [0][2100/3513]	Loss: 4.516
Epoch [0][2200/3513]	Loss: 4.463
Epoch [0][2300/3513]	Loss: 4.412
Epoch [0][2400/3513]	Loss: 4.365
Epoch [0][2500/3513]	Loss: 4.320
Epoch [0][2600/3513]	Loss: 4.276
Epoch [0][2700/3513]	Loss: 4.235
Epoch [0][2800/3513]	Loss: 4.195
Epoch [0][2900/3513]	Loss: 4.158
Epoch [0][3000/3513]	L

In [34]:
checkpoint = torch.load('/content/drive/MyDrive/NLP_2023/french_translation/checkpoint_big_7.pth.tar')
transformer = checkpoint['transformer']

In [48]:
while(1):
    english = input("English: ") 
    if english == 'quit':
        break
    english = encode_english(english, sp)
    english = torch.LongTensor(english).to(device).unsqueeze(0)
    english_mask = (english!=0).to(device).unsqueeze(1).unsqueeze(1)  
    sentence = evaluate(transformer, english, english_mask, int(40))
    print(sentence+'\n')

English: Hello, how are you today?
Eh bien aujourd'hui, comment êtes-vous ?

English: Excellent! The translation model seems to work!
Le modèle semble traduite !

English: Clearly, it's far from perfect.
Il est loin parfaitement loin d'être parfait.

English: But it sort of works I think.
Mais je pense que ça fonctionne.

English: And that's pretty good for only three hundred thousand data points.
Et trois points de pourcentage, les données de l'ordre de trois mille points de vente.

English: Let's try it with simpler sentences.
Essayons de simplifier les phrases.

English: I am a robot.
Je suis un robot.

English: I am from Japan.
Je viens du Japon.

English: I dance every day.
Je danse tous les jours.

English: And I like to eat pasta.
Et j'aime manger des pâtes.

English: As a basic translator, it's quite effective.
En tant que c'est un traducteur de base efficace.

English: Let's try it with more complex sentences.
Essayons de plus complexes avec ça.

English: Here's a sentence fro

## Conclusion: 

Byte-Pair encoding preserves the punctuation and produces pretty good results for morphologically complex languages like French. 

Because we're using an iterator to serve our data, we're losing the shuffling capabilities of the data loader. This means that shuffling has to occur in preprocessing. 