<td>
<a href="https://colab.research.google.com/github/raoulg/straattaal/blob/main/notebooks/demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
</td>

In [None]:
# install necessary dependencies
try:
   import google.colab
   IN_COLAB = True
except ImportError:
   IN_COLAB = False

if IN_COLAB:
   !pip install slanggen
from pathlib import Path
import requests
import torch

from slanggen import datatools
from slanggen import models

I scraped the "Amsterdamse straattaal woordenboek" , see [link](https://www.mijnwoordenboek.nl/dialect/Amsterdamse%20straattaal).
I will download the result from the scraping to train the algorithm.

In [None]:
def download(url, datafile: Path):
    datadir = datafile.parent
    if not datadir.exists():
        print(f"Creating directory {datadir}")
        datadir.mkdir(parents=True)

    if not datafile.exists():
        print(f"Downloading {url} to {datafile}")
        response = requests.get(url)
        with datafile.open("wb") as f:
            f.write(response.content)
    else:
        print(f"File {datafile} already exists, skipping download")

url = "https://github.com/raoulg/straattaal/blob/main/assets/straattaal.txt?raw=true"
datafile = Path("data/straattaal.txt")
download(url, datafile)


Lets have a look at the first ten words

In [None]:
processed_words = datatools.load_data(datafile)
processed_words[:10], len(processed_words)

We have 453 words in total. You can notice there is an extra start `<s>` and stop `</s>` tag, which will be used to train the model.
We will use a Byte Pair Encoding (BPE) algorithm to learn the subword units from the corpus.

Let's have a look at the first ten tokens, generated by the BPE algorithm from the dataset.

In [None]:
tokenizer = models.buildBPE(corpus=processed_words, vocab_size=100)
list(tokenizer.get_vocab())[:10]

You can clearly see a token is somewhere in between a word and a character. 
We can now encode a word and see which tokens are created:

In [None]:
enc = tokenizer.encode("waggie")
print(f'The subtokens of "waggie" are\n {enc.tokens} \nwith ids\n {enc.ids}')

And reconstruct the word from the ids:

In [None]:
tokenizer.decode(enc.ids)

Let's process the sequences. We will:
- transform words into subtokens, and then into arbitrary integers
- add zeros to make all sequences the same length (padding)

In [None]:
padded_sequences = datatools.preprocess(processed_words, tokenizer)
padded_sequences

Every word now is a list of integers. We will shift the sequence one position, such that the target (to predict) is the next token in the sequence.

In [None]:
dataset = datatools.ShiftedDataset(padded_sequences)
dataset

In [None]:
x, y = dataset[0]
print(f"input: {x}")
print(f"output: {y}")

we will use a Dataloader. This will batch the sequences and shuffle the dataset.

In [None]:
# import torch dataloader
from torch.utils.data import DataLoader

loader = DataLoader(dataset, batch_size=16, shuffle=True)
x, y = next(iter(loader))
x.shape, y.shape

Lets look at the full dataset:

In [None]:
for x, y in loader:
    print(x.shape, y.shape)

And we will use the vocabulary size to use as an output size for the model.
The model now takes:
- as input: a sequence of integers
- as output: for every possible BPE token, the probability that it is the next token in the sequence.

In [None]:
# Define the vocab size based on the tokenizer
vocab_size = tokenizer.get_vocab_size()
vocab_size

We can now set up all the ingredients:
- the model uses 64 dimensions to represent the language
- we can calculate the loss (the difference between the actual next token and the predicted next token)
- the optimizer will tell the model in which direction to adjust the weights

In [None]:
from torch.optim.lr_scheduler import ReduceLROnPlateau, StepLR

from torch import nn, optim
# Hyperparameters
config = {
    "vocab_size": vocab_size,
    "embedding_dim": 64,
    "hidden_dim": 64,
    "num_layers": 2,
    "output_dim": vocab_size,
}

model = models.SlangRNN(config)
loss_fn = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.1)
scheduler = ReduceLROnPlateau(optimizer, mode='min', factor=0.1, patience=50, min_lr=1e-4)

Let's train for 800 epochs. This means we will present the full dataset of 453 words for 800 times to the model.

In [None]:
# Import required libraries
from loguru import logger
import torch

# Set number of training epochs
epochs = 800
history = []
last_lr = 0

# Main training loop - iterate through epochs
# an epoch is a complete pass through the dataset
for epoch in range(epochs):
   loss = 0

   # Inner loop - process each batch of data
   # for every batch of 32 samples, we update the model parameters
   for x, y in loader:
       # Clear gradients from previous batch
       optimizer.zero_grad()

       # Initialize hidden state for RNN/LSTM
       hidden = model.init_hidden(x)

       # Forward pass - get model predictions for next letters
       output, hidden = model(x, hidden)

       # Calculate loss by comparing predictions to targets
       # Reshape output and target tensors to match expected dimensions
       loss += loss_fn(output.view(-1, vocab_size), y.view(-1))

   # Backward pass - compute gradients
   loss.backward()
   # Update model parameters using optimizer
   # this is where the model learns by backpropagating
   optimizer.step()

   # Adjust learning rate based on loss
   scheduler.step(loss)

   # Store loss value for plotting/monitoring
   history.append(loss.item())

   # Get current learning rate from scheduler
   curr_lr = scheduler.get_last_lr()

   # Log training progress every 10 epochs
   if (epoch+1) % 10 == 0:
       logger.info(f'Epoch [{epoch+1}/{epochs}], Loss: {loss.item():.4f}')

       # Log if learning rate has changed
       if last_lr != curr_lr:
           last_lr = curr_lr
           logger.info(f"Current learning rate: {curr_lr}")

Lets see if we have been learning.

In [None]:
import matplotlib.pyplot as plt
plt.plot(history)
plt.ylabel('Loss')
plt.xlabel('Epochs')
plt.ylim(0, 40)

We can now save, and load, the trained model for use.

In [None]:
modeldir = Path("artefacts")
if not modeldir.exists():
    modeldir.mkdir(parents=True)

modelfile = modeldir / "model.pth"
torch.save(model.state_dict(), modelfile)

In [None]:
model = models.SlangRNN(config)
model.load_state_dict(torch.load(modelfile))

We can now give a starting letter, eg 'a', and give the model a sequence of start_token and start_letter.
The model will now start to predict next tokens, until it predicts the stop_token.

In [None]:
# Initialize generation parameters
start_letter = 'a'
max_length = 20
temperature = 1.0

# Get token indices for start token and first letter
# we now have "<s>a" as the initial input sequence
start_token_idx = tokenizer.encode("<s>").ids[0]
start_letter_idx = tokenizer.encode(start_letter).ids[0]

# Create initial input sequence
# we translate "<s>a" to tokens (numbers)
input_seq = torch.tensor([[start_token_idx, start_letter_idx]], dtype=torch.long)
generated_word = [start_letter_idx]
print(f"Initial input sequence: {input_seq}")

# Initialize model's hidden state
hidden = model.init_hidden(input_seq)

# Generate remaining characters one by one
for _ in range(max_length - 1):
   # Get model prediction without computing gradients
   with torch.no_grad():
       output, hidden = model(input_seq, hidden)

   # Process model output and apply temperature scaling
   output = output.squeeze(0)
   output = output[-1, :].view(-1).div(temperature).exp()

   # Sample next token based on model probabilities
   next_token = torch.multinomial(output, 1).item()

   # Stop if padding token is generated
   if next_token == tokenizer.token_to_id("<pad>"):
       break

   # Add token to generated sequence and update input
   generated_word.append(next_token)
   input_seq = torch.tensor([generated_word], dtype=torch.long)

This generates tokens

In [None]:
generated_word

Which we can decode into a word

In [None]:
tokenizer.decode(generated_word)

We can loop this process to generate multiple words.

In [None]:
models.sample_n(processed_words, n=10, model=model, tokenizer=tokenizer, max_length=20, temperature=1.0)

And save everything for later use.

In [None]:
tokenizer_file = modeldir / "tokenizer.json"
tokenizer.save(str(tokenizer_file))
torch.save(model.state_dict(), modelfile)