
# Seq2Seq Translation without Attention

## Preprocessing

### Preparing the Data

In [1]:
!wget https://www.statmt.org/europarl/v7/fr-en.tgz -O data/fr-en.tgz
!tar -xf data/fr-en.tgz -C data

--2023-02-18 17:15:58--  https://www.statmt.org/europarl/v7/fr-en.tgz
Resolving www.statmt.org (www.statmt.org)... 129.215.197.184
Connecting to www.statmt.org (www.statmt.org)|129.215.197.184|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 202718517 (193M) [application/x-gzip]
Saving to: ‘data/fr-en.tgz’

data/fr-en.tgz       18%[==>                 ]  36.59M  4.05MB/s    eta 44s    

KeyboardInterrupt: 

In [None]:
# ! pip install spacy (if not installed)

In [None]:
!python -m spacy download en
!python -m spacy download fr

In [18]:
def load_document(file_name):
    with open(file_name, 'rt', encoding='utf-8') as f:
        text = f.read()
    return text

def to_sentences(document):
    return document.strip().split('\n')

en_file = 'data/europarl-v7.fr-en.en'
fr_file = 'data/europarl-v7.fr-en.fr'

en_document = load_document(en_file)
fr_document = load_document(fr_file)

print(f'English document: {len(to_sentences(en_document))} sentences')
print(f'French document: {len(to_sentences(fr_document))} sentences')

English document: 2007723 sentences
French document: 2007723 sentences


### Cleaning the Data

In [1]:
import string

print(string.printable)
print(string.punctuation)

0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~ 	

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


In [2]:
from unicodedata import normalize

print(normalize('NFD', 'brötchen'))

brötchen


In [3]:
import re

def clean_lines(lines):
    if isinstance(lines, list):
        return [clean_lines(line) for line in lines]
    
    is_question = lines.endswith('?')
    remove_punctuation = str.maketrans('', '', string.punctuation)
    lines = normalize('NFD', lines).encode('ascii', 'ignore')
    lines = lines.decode('UTF-8')
    lines = lines.lower()
    lines = lines.translate(remove_punctuation)
    lines = re.sub(rf'[^{re.escape(string.printable)}]', '', lines)
    
    lines = [word for word in lines.split() if word.isalpha()]
    if is_question:
        lines.append('?')
    return ' '.join(lines)

clean_lines('What brötchen means?')

'what brotchen means ?'

In [22]:
# save clean lines to file
import pickle

cleaned_en = clean_lines(to_sentences(en_document))
cleaned_fr = clean_lines(to_sentences(fr_document))

with open('data/cleaned_en.pkl', 'wb') as f:
    pickle.dump(cleaned_en, f)
    print('English cleaned:')
    
with open('data/cleaned_fr.pkl', 'wb') as f:
    pickle.dump(cleaned_fr, f)
    print('French cleaned:')

English cleaned:
French cleaned:


In [23]:
import numpy as np

for i in np.random.randint(0, len(cleaned_en), 10):
    print(f'[English] {cleaned_en[i]}')
    print(f'[French] {cleaned_fr[i]}\n')

[English] on behalf of the inddem group pl mr president the present debate on breaches of human rights is particularly shocking and repulsive as it relates to human trafficking and sexual abuse against children in liberia haiti congo and elsewhere by the staff of humanitarian missions who are supposed to provide help and care to the victims of starvation and armed conflict and ensure their security protection and nourishment
[French] au nom du groupe inddem pl monsieur le president le present debat sur les violations des droits de lhomme est particulierement choquant et repoussant puisquil est lie au trafic detres humains et aux abus sexuels denfants au liberia en haiti au congo et ailleurs perpetres par le personnel de missions humanitaires cense apporter de laide et des soins aux victimes de la faim et des conflits armes et assurer leur securite leur protection et leur alimentation

[English] question no by kirsten jensen
[French] jappelle la question de mme kirsten jensen

[English]

## Training

you can restart the notebook if available memory is not enough

In [5]:
import torch as th
import torch.nn as nn
import torch.nn.functional as F

import numpy as np
from typing import List, Tuple, Union, Optional

In [6]:
print(th.cuda.get_device_name(), th.cuda.get_device_capability())
device = th.device('cuda' if th.cuda.is_available() else 'cpu')

NVIDIA GeForce RTX 3080 (8, 6)


In [7]:
print(th.cuda.memory_summary())

|                  PyTorch CUDA memory summary, device ID 0                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |       0 B  |       0 B  |       0 B  |       0 B  |
|       from large pool |       0 B  |       0 B  |       0 B  |       0 B  |
|       from small pool |       0 B  |       0 B  |       0 B  |       0 B  |
|---------------------------------------------------------------------------|
| Active memory         |       0 B  |       0 B  |       0 B  |       0 B  |
|       from large pool |       0 B  |       0 B  |       0 B  |       0 B  |
|       from small pool |       0 B  |       0 B  |       0 B  |       0 B  |
|---------------------------------------------------------------

### load data & split

In [4]:
import random
import numpy as np

def set_seed(seed: int):
    random.seed(seed)
    np.random.seed(seed)
    th.manual_seed(seed)
    th.cuda.manual_seed(seed)
    th.cuda.manual_seed_all(seed)

set_seed(42)

In [5]:
import pandas as pd
from sklearn.model_selection import train_test_split

cleaned_en = pd.read_pickle('data/cleaned_en.pkl')
cleaned_fr = pd.read_pickle('data/cleaned_fr.pkl')

en_fr = pd.DataFrame({'en': cleaned_en, 'fr': cleaned_fr})
en_fr

Unnamed: 0,en,fr
0,resumption of the session,reprise de la session
1,i declare resumed the session of the european ...,je declare reprise la session du parlement eur...
2,although as you will have seen the dreaded mil...,comme vous avez pu le constater le grand bogue...
3,you have requested a debate on this subject in...,vous avez souhaite un debat a ce sujet dans le...
4,in the meantime i should like to observe a min...,en attendant je souhaiterais comme un certain ...
...,...,...
2007718,i would also like although they are absent to ...,je me permettrai meme bien quils soient absent...
2007719,i am not going to reopen the millennium or not...,je ne rouvrirai pas le debat sur le millenaire...
2007720,adjournment of the session,interruption de la session
2007721,i declare the session of the european parliame...,je declare interrompue la session du parlement...


In [40]:
# Select 150k samples due to memory limitation

en_fr = en_fr.sample(150000, random_state=42)

In [41]:
train_df, test_df = train_test_split(en_fr, test_size=0.2, random_state=42)
train_df, valid_df = train_test_split(train_df, test_size=0.13, random_state=42)

print(f'Train data: {len(train_df)}\t({len(train_df) / len(en_fr) * 100:.2f}%)')
print(f'Valid data: {len(valid_df)}\t({len(valid_df) / len(en_fr) * 100:.2f}%)')
print(f'Test data: {len(test_df)}\t({len(test_df) / len(en_fr) * 100:.2f}%)')

Train data: 104400	(69.60%)
Valid data: 15600	(10.40%)
Test data: 30000	(20.00%)


### build input & label data

In [None]:
!pip install torchtext

In [8]:
from torchtext.data import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

en_tokenizer = get_tokenizer('spacy', language='en')
fr_tokenizer = get_tokenizer('spacy', language='fr')

2023-02-18 15:43:15.914360: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-02-18 15:43:16.004503: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-02-18 15:43:16.502728: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2023-02-18 15:43:16.502783: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] 

In [42]:
en_vocab = build_vocab_from_iterator(map(en_tokenizer, train_df['en']),
                                     specials=['<unk>', '<pad>', '<bos>', '<eos>'],
                                     min_freq=2)
fr_vocab = build_vocab_from_iterator(map(fr_tokenizer, train_df['fr']),
                                     specials=['<unk>', '<pad>', '<bos>', '<eos>'],
                                     min_freq=2)

en_vocab.set_default_index(en_vocab['<unk>'])
fr_vocab.set_default_index(fr_vocab['<unk>'])

print(f'English vocab size: {len(en_vocab)}')
print(f'French vocab size: {len(fr_vocab)}')

English vocab size: 20633
French vocab size: 28871


In [11]:
# encoding
txt = "The terms “scalable” and “large scale” have been used in machine learning circles long before there was big data."
txt = clean_lines(txt)
print(f'<cleand> {txt}')

tokens = en_tokenizer(txt)
encoded = [en_vocab[token] for token in tokens]
print(f'<splited> {tokens}')
print(f'<encoded> {encoded}')

# decoding
decoded = [en_vocab.get_itos()[idx] for idx in encoded]
print(f'<restored> {" ".join(decoded)}')

<cleand> the terms scalable and large scale have been used in machine learning circles long before there was big data
<splited> ['the', 'terms', 'scalable', 'and', 'large', 'scale', 'have', 'been', 'used', 'in', 'machine', 'learning', 'circles', 'long', 'before', 'there', 'was', 'big', 'data']
<encoded> [4, 255, 0, 7, 436, 1664, 22, 44, 337, 8, 5824, 2012, 5494, 308, 213, 43, 51, 1307, 467]
<restored> the terms <unk> and large scale have been used in machine learning circles long before there was big data


In [44]:
from tqdm import tqdm

def data_process(data: pd.DataFrame) -> List[Tuple[th.Tensor, th.Tensor]]:
    tensor_output = []
    for i in tqdm(range(len(data)), total=len(data)):
        # skip too long sentences
        if len(data['en'].iloc[i]) > 500 or len(data['fr'].iloc[i]) > 500:
            continue
        en_tensor = th.tensor([en_vocab[token] for token in en_tokenizer(data['en'].iloc[i])], dtype=th.long)
        fr_tensor = th.tensor([fr_vocab[token] for token in fr_tokenizer(data['fr'].iloc[i])], dtype=th.long)
        tensor_output.append((en_tensor, fr_tensor))
    return tensor_output

train_data = data_process(train_df)
valid_data = data_process(valid_df)
test_data = data_process(test_df)

100%|██████████| 104400/104400 [00:11<00:00, 9394.65it/s]
100%|██████████| 15600/15600 [00:01<00:00, 8948.38it/s]
100%|██████████| 30000/30000 [00:03<00:00, 9069.58it/s]


In [45]:
print(f'Train data: {len(train_data)}')
print(f'Valid data: {len(valid_data)}')
print(f'Test data: {len(test_data)}\n')

for data in [train_data, valid_data, test_data]:
    print(f'Max length:')
    print(f'\ten: {max([len(en) for en, fr in data])}')
    print(f'\tfr: {max([len(fr) for en, fr in data])}')
    print(f'Mean length:')
    print(f'\ten: {np.mean([len(en) for en, fr in data]):.2f}')
    print(f'\tfr: {np.mean([len(fr) for en, fr in data]):.2f}\n')


Train data: 103583
Valid data: 15472
Test data: 29794

Max length:
	en: 97
	fr: 89
Mean length:
	en: 24.29
	fr: 25.32

Max length:
	en: 85
	fr: 90
Mean length:
	en: 24.01
	fr: 24.99

Max length:
	en: 90
	fr: 90
Mean length:
	en: 24.35
	fr: 25.30



In [46]:
import pickle

with open('data/train_data.pkl', 'wb') as f:
    pickle.dump(train_data, f)
    print('Train data saved.')

with open('data/valid_data.pkl', 'wb') as f:
    pickle.dump(valid_data, f)
    print('Valid data saved.')

with open('data/test_data.pkl', 'wb') as f:
    pickle.dump(test_data, f)
    print('Test data saved.')

Train data saved.
Valid data saved.
Test data saved.


In [47]:
with open('data/en_vocab.pkl', 'wb') as f:
    pickle.dump(en_vocab, f)
    print('English vocabulary saved.')

with open('data/fr_vocab.pkl', 'wb') as f:
    pickle.dump(fr_vocab, f)
    print('French vocabulary saved.')

English vocabulary saved.
French vocabulary saved.


### build model

In [10]:
import torch as th
import torch.nn as nn
import torch.nn.functional as F

import numpy as np
from typing import List, Tuple, Union, Optional

import pandas as pd
from tqdm import tqdm

device = th.device('cuda' if th.cuda.is_available() else 'cpu')
# device = th.device('cpu')

en_vocab = pd.read_pickle('data/en_vocab.pkl')
fr_vocab = pd.read_pickle('data/fr_vocab.pkl')

train_data = pd.read_pickle('data/train_data.pkl')
valid_data = pd.read_pickle('data/valid_data.pkl')
test_data = pd.read_pickle('data/test_data.pkl')

In [2]:
len(train_data), len(valid_data), len(test_data)

(103583, 15472, 29794)

In [13]:
from torch.utils.data import DataLoader
from torch.nn.utils.rnn import pad_sequence

batch_size = 148
PAD_IDX = en_vocab['<pad>']
SOS_IDX = en_vocab['<bos>']
EOS_IDX = en_vocab['<eos>']

def generate_batch(data_batch):
    en_batch, fr_batch = [], []
    for (en_item, fr_item) in data_batch:
        en_batch.append(th.cat([th.tensor([SOS_IDX]), en_item, th.tensor([EOS_IDX])], dim=0))
        fr_batch.append(th.cat([th.tensor([SOS_IDX]), fr_item, th.tensor([EOS_IDX])], dim=0))
    en_batch = pad_sequence(en_batch, padding_value=PAD_IDX, batch_first=True)
    fr_batch = pad_sequence(fr_batch, padding_value=PAD_IDX, batch_first=True)
    return en_batch, fr_batch

train_iter = DataLoader(train_data, batch_size=batch_size, shuffle=True, collate_fn=generate_batch)
valid_iter = DataLoader(train_data, batch_size=batch_size, shuffle=True, collate_fn=generate_batch)
test_iter = DataLoader(train_data, batch_size=batch_size, shuffle=True, collate_fn=generate_batch)

### Check batch data

In [10]:
iter_ = iter(train_iter)
en_batch, fr_batch = next(iter_)
print(en_batch)
print(en_batch.shape)

tensor([[   2,   29,   42,  ...,    1,    1,    1],
        [   2,   17,   11,  ...,    1,    1,    1],
        [   2,   13,   26,  ...,    1,    1,    1],
        ...,
        [   2,   63,  223,  ...,    1,    1,    1],
        [   2,   17, 1441,  ...,    1,    1,    1],
        [   2,  647,    4,  ...,    1,    1,    1]])
torch.Size([64, 56])


In [11]:
print(' '.join(en_vocab.get_itos()[idx] for idx in en_batch[0]))
print(' '.join(fr_vocab.get_itos()[idx] for idx in fr_batch[0]))

<bos> mr president ladies and gentlemen i am glad to have this first opportunity to speak to the house <eos> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad>
<bos> monsieur le president mesdames et messieurs les deputes je me rejouis davoir la possibilite de mexprimer pour la premiere fois devant cette haute assemblee <eos> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad>


In [14]:
class SimpleEncoder(nn.Module):
    def __init__(self, input_dim:int, emb_dim:int, encoder_hid_dim:int, decoder_hid_dim:int, dropout:float):
        super().__init__()
        self.embedding = nn.Embedding(input_dim, emb_dim)
        self.rnn = nn.GRU(emb_dim, encoder_hid_dim, bidirectional=True, batch_first=True)
        self.fc = nn.Linear(encoder_hid_dim * 2, decoder_hid_dim)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, src:th.Tensor):
        """
        src: [seq_len, batch_size]
        output: [seq_len, batch_size, 2 * encoder_hid_dim], [batch_size, decoder_hid_dim]
        """
        embedded = self.dropout(self.embedding(src))    # [bs, seq_len, emb_dim]
        # outputs: [seq_len, batch_size, encoder_hid_dim * 2], hidden: [2, batch_size, encoder_hid_dim]
        outputs, hidden = self.rnn(embedded)    # [bs, seq_len, 2 * enc_hid], [2, bs, enc_hid]
        # fc input: concat last forward and backward hidden state
        hidden = th.tanh(self.fc(th.cat((hidden[-2,:,:], hidden[-1,:,:]), dim=-1))) # [bs, dec_hid]
        return outputs, hidden

In [55]:
tmp = SimpleEncoder(len(en_vocab), 128, 256, 256, 0.5)
a, b = tmp(en_batch)
a.shape, b.shape

(torch.Size([64, 68, 512]), torch.Size([64, 256]))

In [15]:
class Attention(nn.Module):
    def __init__(self, encoder_hid_dim: int, decoder_hid_dim: int, attn_dim: int):
        super().__init__()
        self.attn_input_dim = encoder_hid_dim * 2 + decoder_hid_dim
        self.attn = nn.Linear(self.attn_input_dim, attn_dim)

    def forward(self, decoder_hidden: th.Tensor, encoder_outputs: th.Tensor):
        """
        decoder_hidden: [batch_size, decoder_hid_dim]
        encoder_outputs: [batch_size, seq_len, 2 * encoder_hid_dim]
        output: [batch_size, seq_len]

        this is a concat attention
        """
        src_len = encoder_outputs.shape[1]  # 1 for batch_first else 0
        repeated_decoder_hidden = decoder_hidden.unsqueeze(1).repeat(1, src_len, 1) # [bs, seq_len, dec_hid]
        energy = self.attn(th.cat((repeated_decoder_hidden, encoder_outputs), dim=-1))  # [bs, seq_len, attn_dim]
        attention = energy.tanh().sum(dim=-1)   # [bs, seq_len]
        return F.softmax(attention, dim=-1)

In [61]:
tmp = Attention(256, 256, 256)
tmp(b, a).shape

torch.Size([64, 68, 256]) torch.Size([64, 68, 512])
torch.Size([64, 68, 256])
torch.Size([64, 68])


torch.Size([64, 68])

In [16]:
class SimpleDecoder(nn.Module):
    def __init__(self, output_dim:int, emb_dim:int, encoder_hid_dim:int, decoder_hid_dim:int, dropout:float, attention:Attention):
        super().__init__()
        self.attention = attention
        self.output_dim = output_dim
        self.embedding = nn.Embedding(output_dim, emb_dim)
        self.rnn = nn.GRU(emb_dim + encoder_hid_dim * 2, decoder_hid_dim, batch_first=True)
        self.fc = nn.Linear(attention.attn_input_dim + emb_dim, output_dim)
        self.dropout = nn.Dropout(dropout)

    def _weighted_encoder_outputs(self, decoder_hidden: th.Tensor, encoder_outputs: th.Tensor):
        """
        decoder_hidden: [batch_size, decoder_hid_dim]
        encoder_outputs: [batch_size, seq_len, 2 * encoder_hid_dim]
        output: [batch_size, 1, 2 * encoder_hid_dim]
        """
        a = self.attention(decoder_hidden, encoder_outputs)  # [bs, seq_len]
        a = a.unsqueeze(1)
        weighted_encoder_outputs = th.bmm(a, encoder_outputs)   # [bs, 1, 2 * enc_hid]
        return weighted_encoder_outputs

    def forward(self,
                decoder_inputs:th.Tensor,
                decoder_hidden:th.Tensor,
                encoder_outputs:th.Tensor):
        """
        decoder_inputs: [seq_len, batch_size]
        hidden: [batch_size, decoder_hid_dim]
        encoder_outputs: [seq_len, batch_size, encoder_hid_dim * 2]
        output: [batch_size, output_dim]
        """
        decoder_inputs = decoder_inputs.unsqueeze(1)  # [bs, 1]
        embedded = self.dropout(self.embedding(decoder_inputs)) # [bs, 1, emb_dim]
        # output: [batch_size, 1, encoder_hid_dim], hidden: [1, batch_size, decoder_hid_dim]
        w_encoder_outputs = self._weighted_encoder_outputs(decoder_hidden, encoder_outputs) # [bs, 1, 2 * enc_hid]
        rnn_input = th.cat((embedded, w_encoder_outputs), dim=-1)   # [bs, 1, emb_dim + 2 * enc_hid]
        output, decoder_hidden = self.rnn(rnn_input, decoder_hidden.unsqueeze(0))   # [bs, 1, dec_hid], [1, bs, dec_hid]
        embedded = embedded.squeeze(1)  # [bs, emb_dim]
        output = output.squeeze(1)  # [bs, dec_hid]
        w_encoder_outputs = w_encoder_outputs.squeeze(1)    # [bs, 2 * enc_hid]
        output = self.fc(th.cat((output, w_encoder_outputs, embedded), dim=-1)) # [bs, output_dim]
        # output: [batch_size, output_dim]
        return output, decoder_hidden.squeeze(0)    # [bs, output_dim], [bs, dec_hid]

In [77]:
attn = Attention(256, 256, 192)
tmp = SimpleDecoder(len(fr_vocab), 128, 256, 256, 0.5, attn)
print(tmp.fc)
print(tmp.output_dim)
x, y = tmp(en_batch[:, 0], b, a)
x.shape, y.shape

Linear(in_features=896, out_features=28871, bias=True)
28871
torch.Size([64, 1, 68]) torch.Size([64, 68, 512])


(torch.Size([64, 28871]), torch.Size([64, 256]))

In [17]:
class Seq2Seq(nn.Module):
    def __init__(self, encoder:nn.Module, decoder:nn.Module, device:th.device):
        super().__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.device = device

    def forward(self, src:th.Tensor, trg:th.Tensor, teacher_forcing_ratio:float=0.5):
        """
        src: [seq_len, batch_size]
        trg: [seq_len, batch_size]
        output: [seq_len, batch_size, output_dim]
        """
        batch_size = src.shape[0]
        max_len = trg.shape[1]
        output_dim = self.decoder.output_dim
        outputs = th.zeros(batch_size, max_len, output_dim).to(self.device)
        encoder_outputs, hidden = self.encoder(src)
        decoder_input = trg[:, 0]    # start with <sos>
        for t in range(1, max_len):
            output, hidden = self.decoder(decoder_input, hidden, encoder_outputs)
            outputs[:, t, :] = output
            do_teacher_force = np.random.random() < teacher_forcing_ratio
            # select one of the ground truth or the predicted word (auto-regressive)
            decoder_input = trg[:, t] if do_teacher_force else output.argmax(-1)
        return outputs

    def translate(self, src: th.Tensor, max_len: int = 100) -> th.Tensor:
        with th.no_grad():
            outputs = th.zeros(1, max_len).to(self.device).long()
            encoder_outputs, hidden = self.encoder(src)
            decoder_input = th.tensor([SOS_IDX] * batch_size).to(self.device)
            for t in range(1, max_len):
                output, hidden = self.decoder(decoder_input, hidden, encoder_outputs)
                outputs[:, t] = output.argmax(-1)
                decoder_input = output.argmax(-1)
            return outputs

In [17]:
print(en_batch.shape, fr_batch.shape)
attn = Attention(256, 256, 192)
enc = SimpleEncoder(len(en_vocab), 128, 256, 256, 0.5)
dec = SimpleDecoder(len(fr_vocab), 128, 256, 256, 0.5, attn)
tmp = Seq2Seq(enc, dec, th.device('cpu'))
x = tmp(en_batch, fr_batch)
x.shape

torch.Size([64, 56]) torch.Size([64, 58])


KeyboardInterrupt: 

In [18]:
def init_weights(m):
    for name, param in m.named_parameters():
        if 'weight' in name:
            nn.init.normal_(param.data, mean=0, std=0.01)
        else:
            nn.init.constant_(param.data, 0)

In [10]:
batch_size = 128

train_iter = DataLoader(train_data, batch_size=batch_size, shuffle=True, collate_fn=generate_batch)
valid_iter = DataLoader(valid_data, batch_size=batch_size, shuffle=True, collate_fn=generate_batch)
test_iter = DataLoader(test_data, batch_size=batch_size, shuffle=True, collate_fn=generate_batch)

In [40]:
to_lang = 'en'

input_dim = len(fr_vocab) if to_lang == 'en' else len(en_vocab)
output_dim = len(en_vocab) if to_lang == 'en' else len(fr_vocab)
enc_emb_dim = 64
dec_emb_dim = 64
enc_hid_dim = 128
dec_hid_dim = 128
attn_dim = 16
enc_dropout = 0.5
dec_dropout = 0.5

attention = Attention(enc_hid_dim, dec_hid_dim, attn_dim)
encoder = SimpleEncoder(input_dim, enc_emb_dim, enc_hid_dim, dec_hid_dim, enc_dropout)
decoder = SimpleDecoder(output_dim, dec_emb_dim, enc_hid_dim, dec_hid_dim, dec_dropout, attention)
model = Seq2Seq(encoder, decoder, device).to(device)

model.apply(init_weights)
model

Seq2Seq(
  (encoder): SimpleEncoder(
    (embedding): Embedding(28871, 64)
    (rnn): GRU(64, 128, batch_first=True, bidirectional=True)
    (fc): Linear(in_features=256, out_features=128, bias=True)
    (dropout): Dropout(p=0.5, inplace=False)
  )
  (decoder): SimpleDecoder(
    (attention): Attention(
      (attn): Linear(in_features=384, out_features=16, bias=True)
    )
    (embedding): Embedding(20633, 64)
    (rnn): GRU(320, 128, batch_first=True)
    (fc): Linear(in_features=448, out_features=20633, bias=True)
    (dropout): Dropout(p=0.5, inplace=False)
  )
)

In [41]:
optimizer = th.optim.Adam(model.parameters())
criterion = nn.CrossEntropyLoss(ignore_index=en_vocab['<pad>'] if to_lang == 'en' else fr_vocab['<pad>'])

In [42]:
import gc

valid_text = "Il vaut mieux voir une fois que d'entendre mille fois."
valid_text = fr_tokenizer(clean_lines(valid_text))
valid_text = [fr_vocab['<bos>']] + [fr_vocab[token] for token in valid_text] + [fr_vocab['<eos>']]
valid_text = th.tensor(valid_text).unsqueeze(0).to(device)
valid_label = "It is better to see once than to hear a thousand times."

def train(model: nn.Module,
          dataloader: DataLoader,
          optimizer: th.optim.Optimizer,
          criterion: nn.Module,
          clip: float,
          to_lang='en'):

    model.train()
    epoch_loss = 0
    progress_bar = tqdm(dataloader, desc='Training', leave=False, total=len(dataloader))
    for i, (src, trg) in enumerate(progress_bar):
        if to_lang == 'en':
            src, trg = trg, src
        src = src.to(device)
        trg = trg.to(device)
        optimizer.zero_grad()
        output = model(src, trg)
        output_dim = output.shape[-1]
        output = output[1:].view(-1, output_dim)
        trg = trg[1:].view(-1)
        loss = criterion(output, trg)
        loss.backward()
        th.nn.utils.clip_grad_norm_(model.parameters(), clip)
        optimizer.step()
        epoch_loss += loss.item()
        progress_bar.set_postfix(loss=loss.item())

        if i % 5 == 0:
            gc.collect()
        th.cuda.empty_cache()

    return epoch_loss / len(dataloader)

def evaluate(model: nn.Module,
             dataloader: DataLoader,
             criterion: nn.Module,
             to_lang='en'):

    model.eval()
    epoch_loss = 0
    with th.no_grad():
        progress_bar = tqdm(dataloader, desc='Evaluating', leave=False, total=len(dataloader))
        for src, trg in progress_bar:
            if to_lang == 'en':
                src, trg = trg, src
            src = src.to(device)
            trg = trg.to(device)
            output = model(src, trg, teacher_forcing_ratio=0) # turn off teacher forcing
            output_dim = output.shape[-1]
            output = output[1:].view(-1, output_dim)
            trg = trg[1:].view(-1)
            loss = criterion(output, trg)
            epoch_loss += loss.item()
            progress_bar.set_postfix(loss=loss.item())

        output = model.translate(valid_text, to_lang)
        output = output.squeeze(0)
        output = ' '.join([en_vocab.get_itos()[idx] for idx in output])
        print(f'Predicted: {output}')

        th.cuda.empty_cache()

    return epoch_loss / len(dataloader)

In [43]:
def run(model: nn.Module,
        n_epochs: int,
        train_dataloader: DataLoader,
        valid_dataloader: DataLoader,
        optimizer: th.optim.Optimizer,
        criterion: nn.Module,
        clip: float,
        to_lang='en'):

    train_losses = []
    valid_losses = []
    best_valid_loss = float('inf')
    for epoch in range(n_epochs):
        train_loss = train(model, train_dataloader, optimizer, criterion, clip, to_lang)
        valid_loss = evaluate(model, valid_dataloader, criterion, to_lang)
        if valid_loss < best_valid_loss:
            best_valid_loss = valid_loss
            th.save(model.state_dict(), 'seq2seq-model.pt')
            print('Best model saved')
        print(f'Epoch: {epoch+1:02}')
        print(f'\tTrain Loss: {train_loss:.3f}')
        print(f'\t Val. Loss: {valid_loss:.3f}')
        train_losses.append(train_loss)
        valid_losses.append(valid_loss)

        th.cuda.empty_cache()

    return train_losses, valid_losses

In [44]:
# memory usage
print(f'Memory Usage: {th.cuda.memory_allocated() / 1024**3:.2f} GB')

Memory Usage: 1.28 GB


In [None]:
model

                                                                        

src:	<bos> il vaut mieux voir une fois que dentendre mille fois <eos>
pred:	it is not that the that is not to the to the to the to the <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos>
trg:	it is better to see once than to hear a thousand times

Epoch: 01
	Train Loss: 6.143
	 Val. Loss: 6.143


                                                                        

src:	<bos> il vaut mieux voir une fois que dentendre mille fois <eos>
pred:	it is more than to the that i have been to to to to <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos>
trg:	it is better to see once than to hear a thousand times

Epoch: 02
	Train Loss: 5.354
	 Val. Loss: 5.628


                                                                        

src:	<bos> il vaut mieux voir une fois que dentendre mille fois <eos>
pred:	it is better better than to the time to speak with the time <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos>
trg:	it is better to see once than to hear a thousand times

Epoch: 03
	Train Loss: 4.760
	 Val. Loss: 5.326


                                                                        

src:	<bos> il vaut mieux voir une fois que dentendre mille fois <eos>
pred:	it is better better than once again once again once again once again <eos> once again <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos>
trg:	it is better to see once than to hear a thousand times

Epoch: 04
	Train Loss: 4.367
	 Val. Loss: 5.225


                                                                        

src:	<bos> il vaut mieux voir une fois que dentendre mille fois <eos>
pred:	it is better better than once again once again once again once again once again once again <eos> once again <eos> once again <eos> once again <eos> once again <eos> once again <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> once again <eos> once again <eos> once again <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> once again <eos> once again <eos> once again <eos> <eos> <eos> <eos> <eos> <eos> <eos> once again <eos> once again <eos> once again <eos> <eos> <eos> <eos> <eos> <eos> <eos> once again <eos> once again <eos> once again <eos> <eos> <eos> <eos> <eos>
trg:	it is better to see once than to hear a thousand times

Epoch: 05
	Train Loss: 4.117
	 Val. Loss: 5.157


Training:  75%|███████▍  | 588/785 [06:01<02:04,  1.58it/s, loss=3.74]

In [38]:
def translate(model: nn.Module, input_text: str):
    model.eval()
    tokenizer = fr_tokenizer if to_lang == 'en' else en_tokenizer
    if to_lang == 'en':
        from_vocab, to_vocab = fr_vocab, en_vocab
    else:
        from_vocab, to_vocab = en_vocab, fr_vocab

    input_text = tokenizer(clean_lines(input_text))
    input_text = [from_vocab['<bos>']] + [from_vocab[token] for token in input_text] + [from_vocab['<eos>']]
    input_text = th.tensor(input_text).unsqueeze(0).to(device)
    output = model.translate(input_text)
    output = ' '.join(to_vocab.get_itos()[idx] for idx in output.squeeze(0))
    return output

model.load_state_dict(th.load('seq2seq-model.pt'))
translate(model, 'Je suis un étudiant.')

'<unk> i am a supporter of the <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos>'