# Sequence 2 sequence learning

Seq2seq merupakan paradigma DeepLearning yang meutilisasi dekoder enkoder. Secara umum seq 2 seq ini menggunakan 2 model yang sama 


In [1]:
import torch 
from torch import nn 
import random 
import numpy as np
import spacy 
import datasets 
import torchtext
import tqdm
import evaluate

In [2]:
seed = 0 
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.backends.cudnn.deterministic = True
en_nlp = spacy.load("en_core_web_sm")
de_nlp = spacy.load("de_core_news_sm")

Menggunakan dataset Multi30K, 
dimana berisi Fitur bahasa dari en dan de

In [3]:
dataset = datasets.load_dataset("bentrevett/multi30k")
dataset

DatasetDict({
    train: Dataset({
        features: ['en', 'de'],
        num_rows: 29000
    })
    validation: Dataset({
        features: ['en', 'de'],
        num_rows: 1014
    })
    test: Dataset({
        features: ['en', 'de'],
        num_rows: 1000
    })
})

Struktur data yang digunakan di couple secara ketat. Jadi per 1 set data itu sdh termasuk en dan de

In [4]:
train_data, valid_data, test_data = (
    dataset["train"],
    dataset["validation"],
    dataset["test"],
)
train_data[0]

{'en': 'Two young, White males are outside near many bushes.',
 'de': 'Zwei junge weiße Männer sind im Freien in der Nähe vieler Büsche.'}

## Tokenizer 
Akan dilakuklan tokenisasi untuk dipusah 

misalakan

`good morning` -> `[good,morning]`
nah penggunaan ini itu menggunakan spacyt en_core web_sm yang merupakan .  
tokeniser utk en dan de 


In [5]:
string  = "what a lovely day today"
[token.text for token in en_nlp.tokenizer(string)] 

['what', 'a', 'lovely', 'day', 'today']

### Membuat fungsi tokenizer 


In [6]:
def tokenize(examples,en_nlp,de_nlp,max_length,lower,sos_token,eos_token):
    en_tokens = [token.text for token in en_nlp.tokenizer(examples['en'])[:max_length]]
    de_tokens = [token.text for token in de_nlp.tokenizer(examples['de'])[:max_length]]
    if lower:
        en_tokens,de_tokens=[token.lower() for token in en_tokens],[token.lower() for token in de_tokens]
        en_tokens,de_tokens  = [sos_token] + en_tokens +[eos_token],[sos_token] + de_tokens +[eos_token]
    return {"en_tokens" : en_tokens, "de_tokens" :  de_tokens}
        

In [7]:
max_length  =1_000
lower = True
sos_token  = "<sos>"
eos_token = "<eos>"

fn_kwargs = {
    "en_nlp"  : en_nlp,
    "de_nlp"  : de_nlp ,
    "max_length" : max_length, 
    "lower" :lower,
    "sos_token" : sos_token,
    "eos_token" : eos_token
}
 
train_data = train_data.map(tokenize,fn_kwargs=fn_kwargs)
valid_data= valid_data.map(tokenize,fn_kwargs=fn_kwargs)
test_data= test_data.map(tokenize,fn_kwargs=fn_kwargs)

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [8]:
train_data[0]

{'en': 'Two young, White males are outside near many bushes.',
 'de': 'Zwei junge weiße Männer sind im Freien in der Nähe vieler Büsche.',
 'en_tokens': ['<sos>',
  'two',
  'young',
  ',',
  'white',
  'males',
  'are',
  'outside',
  'near',
  'many',
  'bushes',
  '.',
  '<eos>'],
 'de_tokens': ['<sos>',
  'zwei',
  'junge',
  'weiße',
  'männer',
  'sind',
  'im',
  'freien',
  'in',
  'der',
  'nähe',
  'vieler',
  'büsche',
  '.',
  '<eos>']}

Next, we'll build the _vocabulary_ for the source and target languages. The vocabulary is used to associate each unique token in our dataset with an index (an integer), e.g. "hello" = 1, "world" = 2, "bye" = 3, "hates" = 4, etc. When feeding text data to our model, we convert the string into tokens and then the tokens into numbers using the vocabulary as a look up table, e.g. "hello world" becomes `["hello", "world"]` which becomes `[1, 2]` using the example indices given. We do this as neural networks cannot operate on strings, only numerical values.


this just map the token into a number

In [9]:
import torchtext.vocab
#  unk is unknown token and the is  unknown token, pad is poadding

min_freq =2 
unk_token = "<unk>"
pad_token="<pad>"

special_tokens = [
    unk_token,
    pad_token,
    sos_token,
    eos_token
]

en_vocab = torchtext.vocab.build_vocab_from_iterator(
    train_data["en_tokens"],
    min_freq=min_freq,
    specials=special_tokens,
)

de_vocab = torchtext.vocab.build_vocab_from_iterator(
    train_data["de_tokens"],
    min_freq=min_freq,
    specials=special_tokens,
)



In [10]:
en_vocab.get_itos()[:10]

['<unk>', '<pad>', '<sos>', '<eos>', 'a', '.', 'in', 'the', 'on', 'man']

In [11]:
en_vocab.get_itos()[9]

'man'

In [12]:
en_vocab.get_stoi()["the"] # getstoi is get string token on iunteger 
len(en_vocab), len(de_vocab)

(5893, 7853)

In [13]:
# And we can get the token corresponding to that index to prove it's the `<unk>` token.

en_vocab.get_itos()[0]

'<unk>'

# TorchText Vocab Gist

Below is a concise example showcasing how to build a vocabulary in **TorchText**, 
handle special tokens (like `<eos>`, `<sos>`, `<unk>`, `<pad>`), and convert between 
tokens and indices (via `stoi` and `itos` lookups).

## Special Tokens

- `<eos>` = End Of Sentence  
- `<sos>` = Start Of Sentence  
- `<unk>` = Unknown token (for out-of-vocabulary words)  
- `<pad>` = Padding token  

## Example Workflow

1. **Install/Import Dependencies**
2. **Define Special Tokens**
3. **Define Tokenizer and Preprocessing**
   - Add `<sos>` and `<eos>` to the sequence before/after tokenizing.
4. **Build Vocabulary**  
   - Use `build_vocab_from_iterator` to construct the vocabulary from your dataset.
5. **Lookup Tokens/Indices**
   - `vocab[token]` → returns the index (`stoi`: string-to-index).
   - `vocab.lookup_token(index)` → returns the token (`itos`: index-to-string).
   - `vocab.lookup_tokens(indices)` → batch lookup for multiple indices.
6. **Default Index**  
   - If a token is not in the vocabulary, it defaults to `<unk>` index.

---

```python
import torch
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

# 1. Define your special tokens
SPECIAL_TOKENS = ["<unk>", "<pad>", "<sos>", "<eos>"]

# 2. Create a tokenizer
#    Using a basic tokenizer provided by torchtext
tokenizer = get_tokenizer("basic_english")

# 3. Add <sos> and <eos> around your tokens 
def yield_tokens(data_iter):
    """
    data_iter should yield raw text strings.
    We wrap each sentence with <sos> and <eos>.
    """
    for text in data_iter:
        yield [ "<sos>" ] + tokenizer(text) + [ "<eos>" ]

# Example dataset
train_data = [
    "hello world",
    "how are you doing",
    "hello again"
]

# 4. Build the vocabulary
#    We pass the token generator to build_vocab_from_iterator
vocab = build_vocab_from_iterator(
    yield_tokens(train_data),
    specials=SPECIAL_TOKENS
)

# Set the default index to <unk> token
vocab.set_default_index(vocab["<unk>"])

# 5. Look up tokens or indices
sample_text = "hello world"
tokens = [ "<sos>" ] + tokenizer(sample_text) + [ "<eos>" ]
print("Tokens:", tokens)

# Convert tokens to indices (stoi)
indices = vocab(tokens)
print("Indices:", indices)

# Convert indices back to tokens (itos)
restored_tokens = vocab.lookup_tokens(indices)
print("Restored Tokens:", restored_tokens)

# 6. Direct stoi and itos usage
idx_of_hello = vocab["hello"]
token_of_idx = vocab.lookup_token(idx_of_hello)
print(f"Index of 'hello': {idx_of_hello}")
print(f"Token at index {idx_of_hello}: {token_of_idx}")


Since we already klnow our `<unk>` token is zero as is the first element in our special tokens listm but we can allso check both vocab to have thge same iondex for the unknown padding toiken 

In [14]:
assert en_vocab[unk_token] == de_vocab[unk_token]
assert en_vocab[pad_token] == de_vocab[pad_token]

unk_index = en_vocab[unk_token]
pad_index = en_vocab[pad_token]

Using the `set_default_index` method we can set what value is returned when we try and get the index of a token outside of our vocabulary. In this case, the index of the unknown token, `<unk>`.


In [15]:
en_vocab.set_default_index(unk_index)
de_vocab.set_default_index(unk_index)

In [16]:
en_vocab["The"]

0

In [17]:
en_vocab.get_itos()[0]

'<unk>'

we can seee the numerical reprsenetation of thius by using numericdalize example

In [18]:
def numericalize_example(example, en_vocab, de_vocab):
    en_ids = en_vocab.lookup_indices(example["en_tokens"])
    de_ids = de_vocab.lookup_indices(example["de_tokens"])
    return {"en_ids": en_ids, "de_ids": de_ids}

In [19]:
fn_kwargs = {"en_vocab": en_vocab, "de_vocab": de_vocab}

train_data = train_data.map(numericalize_example, fn_kwargs=fn_kwargs)
valid_data = valid_data.map(numericalize_example, fn_kwargs=fn_kwargs)
test_data = test_data.map(numericalize_example, fn_kwargs=fn_kwargs)

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [20]:
train_data[0]

{'en': 'Two young, White males are outside near many bushes.',
 'de': 'Zwei junge weiße Männer sind im Freien in der Nähe vieler Büsche.',
 'en_tokens': ['<sos>',
  'two',
  'young',
  ',',
  'white',
  'males',
  'are',
  'outside',
  'near',
  'many',
  'bushes',
  '.',
  '<eos>'],
 'de_tokens': ['<sos>',
  'zwei',
  'junge',
  'weiße',
  'männer',
  'sind',
  'im',
  'freien',
  'in',
  'der',
  'nähe',
  'vieler',
  'büsche',
  '.',
  '<eos>'],
 'en_ids': [2, 16, 24, 15, 25, 778, 17, 57, 80, 202, 1312, 5, 3],
 'de_ids': [2, 18, 26, 253, 30, 84, 20, 88, 7, 15, 110, 7647, 3171, 4, 3]}

In [21]:
en_vocab.lookup_tokens(train_data[0]["en_ids"])

['<sos>',
 'two',
 'young',
 ',',
 'white',
 'males',
 'are',
 'outside',
 'near',
 'many',
 'bushes',
 '.',
 '<eos>']

One other thing that the `datasets` library handles for us with the `Dataset` class is converting features to the correct type. Our indices in each example are currently basic Python integers. However, they need to be converted to PyTorch tensors in order to use them with PyTorch. We could convert them just before we pass them into the model, however it is more convenient to do it now.

The `with_format` method converts features indicated by the `columns` argument to a given `type`. Here, we specify the type as "torch" (for PyTorch) and the columns to be "en_ids" and "de_ids" (the features which we want to convert to PyTorch tensors). By default, `with_format` will remove any features not in the list of features passed to `columns`. We want to keep those features, which we can do with `output_all_columns=True`.


In [22]:
data_type = "torch"
format_columns = ["en_ids", "de_ids"]

train_data = train_data.with_format(
    type=data_type, columns=format_columns, output_all_columns=True
)

valid_data = valid_data.with_format(
    type=data_type,
    columns=format_columns,
    output_all_columns=True,
)

test_data = test_data.with_format(
    type=data_type,
    columns=format_columns,
    output_all_columns=True,
)

In [23]:
train_data[0] # WE TRANSFORM THIS INTO torch matrix tensor

{'en_ids': tensor([   2,   16,   24,   15,   25,  778,   17,   57,   80,  202, 1312,    5,
            3]),
 'de_ids': tensor([   2,   18,   26,  253,   30,   84,   20,   88,    7,   15,  110, 7647,
         3171,    4,    3]),
 'en': 'Two young, White males are outside near many bushes.',
 'de': 'Zwei junge weiße Männer sind im Freien in der Nähe vieler Büsche.',
 'en_tokens': ['<sos>',
  'two',
  'young',
  ',',
  'white',
  'males',
  'are',
  'outside',
  'near',
  'many',
  'bushes',
  '.',
  '<eos>'],
 'de_tokens': ['<sos>',
  'zwei',
  'junge',
  'weiße',
  'männer',
  'sind',
  'im',
  'freien',
  'in',
  'der',
  'nähe',
  'vieler',
  'büsche',
  '.',
  '<eos>']}

# Creating data loaders
The final step of preparing the data is to create the data loaders. These can be iterated upon to return a batch of data, each batch being a dictionary containing the numericalized English and German sentences (which have also been padded) as PyTorch tensors.

First, we need to create a function that collates, i.e. combines, a batch of examples into a batch. The `collate_fn` below takes a "batch" as input (a list of examples), we then separate out the English and German indices for each example in the batch, and pass each one to the `pad_sequence` function. `pad_sequence` takes a list of tensors, pads each one to the length of the longest tensor using the `padding_value` (which we set to `pad_index`, the index of our `<pad>` token) and then returns a `[max length, batch size]` shaped tensor, where `batch size` is the number of examples in the batch and `max length` is the length of the longest tensor in the batch. We put each tensor into a dictionary and then return it.

The `get_collate_fn` takes in the padding token index and returns the `collate_fn` defined inside it. This technique, of defining a function inside another and returning it, is known as a [closure](<https://en.wikipedia.org/wiki/Closure_(computer_programming)>). It allows the `collate_fn` to continually use the value of `pad_index` it was created with without creating a class or using global variables. 

In [37]:
def get_collate_fn(pad_index):
    def collate_fn(batch):
        # this collate is
        batch_en_ids = [example["en_ids"] for example in batch]
        batch_de_ids = [example["de_ids"] for example in batch]
        batch_en_ids = nn.utils.rnn.pad_sequence(batch_en_ids, padding_value=pad_index)
        batch_de_ids = nn.utils.rnn.pad_sequence(batch_de_ids, padding_value=pad_index)
        batch = {
            "en_ids": batch_en_ids,
            "de_ids": batch_de_ids,
        }
        return batch

    return collate_fn
    

In [38]:
def get_data_loader(dataset, batch_size, pad_index, shuffle=False):
    collate_fn = get_collate_fn(pad_index)
    data_loader = torch.utils.data.DataLoader(
        dataset=dataset,
        batch_size=batch_size,
        collate_fn=collate_fn,
        shuffle=shuffle,
    )
    return data_loader

In [39]:
batch_size = 128

train_data_loader = get_data_loader(train_data, batch_size, pad_index, shuffle=True)
valid_data_loader = get_data_loader(valid_data, batch_size, pad_index)
test_data_loader = get_data_loader(test_data, batch_size, pad_index)

In [41]:
for i,v in enumerate(train_data_loader):
    if i<2:
        print(f"value of the ids that we numericalise :  \n {v['en_ids']}, \n Total length of the sentence {len(v['en_ids'])}")

value of the ids that we numericalise :  
 tensor([[   2,    2,    2,  ...,    2,    2,    2],
        [  48,   74,    4,  ...,    4,    4, 4345],
        [  30,   19,  444,  ..., 1430,    9,   10],
        ...,
        [   1,    1,    1,  ...,    1,    1,    1],
        [   1,    1,    1,  ...,    1,    1,    1],
        [   1,    1,    1,  ...,    1,    1,    1]]), 
 Total length of the sentence 34
value of the ids that we numericalise :  
 tensor([[  2,   2,   2,  ...,   2,   2,   2],
        [  4,   4,   4,  ...,   4,   4,   4],
        [  9, 485,  24,  ..., 357,  14,   9],
        ...,
        [  1,   1,   1,  ...,   1,   1,   1],
        [  1,   1,   1,  ...,   1,   1,   1],
        [  1,   1,   1,  ...,   1,   1,   1]]), 
 Total length of the sentence 30


### Seq2Seq

For the final part of the implemenetation, we'll implement the seq2seq model. This will handle:

-   receiving the input/source sentence
-   using the encoder to produce the context vectors
-   using the decoder to produce the predicted output/target sentence

Our full model will look like this:

![](assets/seq2seq4.png)

The `Seq2Seq` model takes in an `Encoder`, `Decoder`, and a `device` (used to place tensors on the GPU, if it exists).

For this implementation, we have to ensure that the number of layers and the hidden (and cell) dimensions are equal in the `Encoder` and `Decoder`. This is not always the case, we do not necessarily need the same number of layers or the same hidden dimension sizes in a sequence-to-sequence model. However, if we did something like having a different number of layers then we would need to make decisions about how this is handled. For example, if our encoder has 2 layers and our decoder only has 1, how is this handled? Do we average the two context vectors output by the decoder? Do we pass both through a linear layer? Do we only use the context vector from the highest layer? Etc.

Our `forward` method takes the source sentence, target sentence and a teacher-forcing ratio. The teacher forcing ratio is used when training our model. When decoding, at each time-step we will predict what the next token in the target sequence will be from the previous tokens decoded, $\hat{y}_{t+1}=f(s_t^L)$. With probability equal to the teaching forcing ratio (`teacher_forcing_ratio`) we will use the actual ground-truth next token in the sequence as the input to the decoder during the next time-step. However, with probability `1 - teacher_forcing_ratio`, we will use the token that the model predicted as the next input to the model, even if it doesn't match the actual next token in the sequence.

The first thing we do in the `forward` method is to create an `outputs` tensor that will store all of our predictions, $\hat{Y}$.

We then feed the input/source sentence, `src`, into the encoder and receive out final hidden and cell states.

The first input to the decoder is the start of sequence (`<sos>`) token. As our `trg` tensor already has the `<sos>` token appended (all the way back when we tokenized our English sentences) we get our $y_1$ by slicing into it. We know how long our target sentences should be (`trg_length`), so we loop that many times. The last token input into the decoder is the one **before** the `<eos>` token -- the `<eos>` token is never input into the decoder.

During each iteration of the loop, we:

-   pass the input, previous hidden and previous cell states ($y_t, s_{t-1}, c_{t-1}$) into the decoder
-   receive a prediction, next hidden state and next cell state ($\hat{y}_{t+1}, s_{t}, c_{t}$) from the decoder
-   place our prediction, $\hat{y}_{t+1}$/`output` in our tensor of predictions, $\hat{Y}$/`outputs`
-   decide if we are going to "teacher force" or not
    -   if we do, the next `input` is the ground-truth next token in the sequence, $y_{t+1}$/`trg[t]`
    -   if we don't, the next `input` is the predicted next token in the sequence, $\hat{y}_{t+1}$/`top1`, which we get by doing an `argmax` over the output tensor

Once we've made all of our predictions, we return our tensor full of predictions, $\hat{Y}$/`outputs`.

**Note**: our decoder loop starts at 1, not 0. This means the 0th element of our `outputs` tensor remains all zeros. So our `trg` and `outputs` look something like:

$$
\begin{align*}
\text{trg} = [<sos>, &y_1, y_2, y_3, <eos>]\\
\text{outputs} = [0, &\hat{y}_1, \hat{y}_2, \hat{y}_3, <eos>]
\end{align*}
$$

Later on when we calculate the loss, we cut off the first element of each tensor to get:

$$
\begin{align*}
\text{trg} = [&y_1, y_2, y_3, <eos>]\\
\text{outputs} = [&\hat{y}_1, \hat{y}_2, \hat{y}_3, <eos>]
\end{align*}
$$
