# Sequence 2 sequence learning

Seq2seq merupakan paradigma DeepLearning yang meutilisasi dekoder enkoder. Secara umum seq 2 seq ini menggunakan 2 model yang sama 


In [1]:
import torch 
from torch import nn 
import random 
import numpy as np
import spacy 
import datasets 
import torchtext
import tqdm
import evaluate

In [2]:
seed = 0 
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.backends.cudnn.deterministic = True
en_nlp = spacy.load("en_core_web_sm")
de_nlp = spacy.load("de_core_news_sm")

Menggunakan dataset Multi30K, 
dimana berisi Fitur bahasa dari en dan de

In [3]:
dataset = datasets.load_dataset("bentrevett/multi30k")
dataset

DatasetDict({
    train: Dataset({
        features: ['en', 'de'],
        num_rows: 29000
    })
    validation: Dataset({
        features: ['en', 'de'],
        num_rows: 1014
    })
    test: Dataset({
        features: ['en', 'de'],
        num_rows: 1000
    })
})

Struktur data yang digunakan di couple secara ketat. Jadi per 1 set data itu sdh termasuk en dan de

In [4]:
train_data, valid_data, test_data = (
    dataset["train"],
    dataset["validation"],
    dataset["test"],
)
train_data[0]

{'en': 'Two young, White males are outside near many bushes.',
 'de': 'Zwei junge weiße Männer sind im Freien in der Nähe vieler Büsche.'}

## Tokenizer 
Akan dilakuklan tokenisasi untuk dipusah 

misalakan

`good morning` -> `[good,morning]`
nah penggunaan ini itu menggunakan spacyt en_core web_sm yang merupakan .  
tokeniser utk en dan de 


In [5]:
string  = "what a lovely day today"
[token.text for token in en_nlp.tokenizer(string)] 

['what', 'a', 'lovely', 'day', 'today']

### Membuat fungsi tokenizer 


In [7]:
def tokenize(examples,en_nlp,de_nlp,max_length,lower,sos_token,eos_token):
    en_tokens = [token.text for token in en_nlp.tokenizer(examples['en'])[:max_length]]
    de_tokens = [token.text for token in de_nlp.tokenizer(examples['de'])[:max_length]]
    if lower:
        en_tokens,de_tokens=[token.lower() for token in en_tokens],[token.lower() for token in de_tokens]
        en_tokens,de_tokens  = [sos_token] + en_tokens +[eos_token],[sos_token] + de_tokens +[eos_token]
    return {"en_tokens" : en_tokens, "de_tokens" :  de_tokens}
        

In [9]:
max_length  =1_000
lower = True
sos_token  = "<sos>"
eos_token = "<eos>"

fn_kwargs = {
    "en_nlp"  : en_nlp,
    "de_nlp"  : de_nlp ,
    "max_length" : max_length, 
    "lower" :lower,
    "sos_token" : sos_token,
    "eos_token" : eos_token
}
 
train_data = train_data.map(tokenize,fn_kwargs=fn_kwargs)
valid_data= valid_data.map(tokenize,fn_kwargs=fn_kwargs)
test_data= test_data.map(tokenize,fn_kwargs=fn_kwargs)

Map:   0%|          | 0/29000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1014 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [10]:
train_data[0]

{'en': 'Two young, White males are outside near many bushes.',
 'de': 'Zwei junge weiße Männer sind im Freien in der Nähe vieler Büsche.',
 'en_tokens': ['<sos>',
  'two',
  'young',
  ',',
  'white',
  'males',
  'are',
  'outside',
  'near',
  'many',
  'bushes',
  '.',
  '<eos>'],
 'de_tokens': ['<sos>',
  'zwei',
  'junge',
  'weiße',
  'männer',
  'sind',
  'im',
  'freien',
  'in',
  'der',
  'nähe',
  'vieler',
  'büsche',
  '.',
  '<eos>']}

Next, we'll build the _vocabulary_ for the source and target languages. The vocabulary is used to associate each unique token in our dataset with an index (an integer), e.g. "hello" = 1, "world" = 2, "bye" = 3, "hates" = 4, etc. When feeding text data to our model, we convert the string into tokens and then the tokens into numbers using the vocabulary as a look up table, e.g. "hello world" becomes `["hello", "world"]` which becomes `[1, 2]` using the example indices given. We do this as neural networks cannot operate on strings, only numerical values.


this just map the token into a number

In [None]:
import torchtext.vocab
#  unk is unknown token and the is  unknown token, pad is poadding

min_freq =2 
unk_token = "<unk>"
pad_token="<pad>"

special_tokens = [
    unk_token,
    pad_token,
    sos_token,
    eos_token
]

en_vocab = torchtext.vocab.build_vocab_from_iterator(
    train_data["en_tokens"],
    min_freq=min_freq,
    specials=special_tokens,
)

de_vocab = torchtext.vocab.build_vocab_from_iterator(
    train_data["de_tokens"],
    min_freq=min_freq,
    specials=special_tokens,
)

In [13]:
en_vocab.get_itos()[:10]

['<unk>', '<pad>', '<sos>', '<eos>', 'a', '.', 'in', 'the', 'on', 'man']

In [14]:
en_vocab.get_itos()[9]

'man'

In [17]:
en_vocab.get_stoi()["the"] # getstoi is get string token on iunteger 
len(en_vocab), len(de_vocab)

(5893, 7853)

In [19]:
# And we can get the token corresponding to that index to prove it's the `<unk>` token.

en_vocab.get_itos()[0]

'<unk>'

# TorchText Vocab Gist

Below is a concise example showcasing how to build a vocabulary in **TorchText**, 
handle special tokens (like `<eos>`, `<sos>`, `<unk>`, `<pad>`), and convert between 
tokens and indices (via `stoi` and `itos` lookups).

## Special Tokens

- `<eos>` = End Of Sentence  
- `<sos>` = Start Of Sentence  
- `<unk>` = Unknown token (for out-of-vocabulary words)  
- `<pad>` = Padding token  

## Example Workflow

1. **Install/Import Dependencies**
2. **Define Special Tokens**
3. **Define Tokenizer and Preprocessing**
   - Add `<sos>` and `<eos>` to the sequence before/after tokenizing.
4. **Build Vocabulary**  
   - Use `build_vocab_from_iterator` to construct the vocabulary from your dataset.
5. **Lookup Tokens/Indices**
   - `vocab[token]` → returns the index (`stoi`: string-to-index).
   - `vocab.lookup_token(index)` → returns the token (`itos`: index-to-string).
   - `vocab.lookup_tokens(indices)` → batch lookup for multiple indices.
6. **Default Index**  
   - If a token is not in the vocabulary, it defaults to `<unk>` index.

---

```python
import torch
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

# 1. Define your special tokens
SPECIAL_TOKENS = ["<unk>", "<pad>", "<sos>", "<eos>"]

# 2. Create a tokenizer
#    Using a basic tokenizer provided by torchtext
tokenizer = get_tokenizer("basic_english")

# 3. Add <sos> and <eos> around your tokens 
def yield_tokens(data_iter):
    """
    data_iter should yield raw text strings.
    We wrap each sentence with <sos> and <eos>.
    """
    for text in data_iter:
        yield [ "<sos>" ] + tokenizer(text) + [ "<eos>" ]

# Example dataset
train_data = [
    "hello world",
    "how are you doing",
    "hello again"
]

# 4. Build the vocabulary
#    We pass the token generator to build_vocab_from_iterator
vocab = build_vocab_from_iterator(
    yield_tokens(train_data),
    specials=SPECIAL_TOKENS
)

# Set the default index to <unk> token
vocab.set_default_index(vocab["<unk>"])

# 5. Look up tokens or indices
sample_text = "hello world"
tokens = [ "<sos>" ] + tokenizer(sample_text) + [ "<eos>" ]
print("Tokens:", tokens)

# Convert tokens to indices (stoi)
indices = vocab(tokens)
print("Indices:", indices)

# Convert indices back to tokens (itos)
restored_tokens = vocab.lookup_tokens(indices)
print("Restored Tokens:", restored_tokens)

# 6. Direct stoi and itos usage
idx_of_hello = vocab["hello"]
token_of_idx = vocab.lookup_token(idx_of_hello)
print(f"Index of 'hello': {idx_of_hello}")
print(f"Token at index {idx_of_hello}: {token_of_idx}")


### Noi