<a href="https://colab.research.google.com/github/yaya-sy/LLMReasoningCourse/blob/main/labs/Lab1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab 1: Improving the output quality of LLMs with better decoding algorithms

We will see how to improve LLMs for French to English translation through better sampling algorithms and hyper-parameters



## Data loading

Clone this github repo: https://github.com/yaya-sy/LLMReasoningCourse.git

In the folder `labs/lab1/data`, you will find a parallel corpus for English and French, meaning the $i^{th}$ line of the file `fr.txt` is the french translation of the $i^{th}$ in the file `en.txt`

In [None]:
!git clone https://github.com/yaya-sy/LLMReasoningCourse.git

Write a function `load_data` that returns a data of this format:

```json
{
  "fr": {"dev": [french development corpus], "test": [french test corpus],
  "en": {"dev": [english development corpus], "test": [englih test corpus],
}
```

Use 30% of the data for dev and the remain for test. Shuffle the pairs of translations, but the data must remain aligned.

In [None]:
import random
def load_data(path):
    with open(f"{path}/fr.txt", "r") as f:
        fr = f.readlines()
    with open(f"{path}/en.txt", "r") as f:
        en = f.readlines()

    en_fr = list(zip(en, fr))
    random.shuffle(en_fr)
    split_idx = int(len(en_fr) * 0.3)
    dev_data = en_fr[:split_idx]
    test_data = en_fr[split_idx:]

    data = {"fr": {"dev": [], "test": []}, "en": {"dev": [], "test": []}}
    for en, fr in dev_data:
        data["fr"]["dev"].append(fr.strip())
        data["en"]["dev"].append(en.strip())
    for en, fr in test_data:
        data["fr"]["test"].append(fr.strip())
        data["en"]["test"].append(en.strip())

    return data

Why are we doing this? What is the difference between dev and test corpus? Which split should we use to tune our algorithms?

## Load the model

We will use `HuggingFaceTB/SmolLM2-135M-Instruct` for this lab.

In [None]:
# load the model and the tokenizer. Load the model on the GPU if available
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = AutoModelForCausalLM.from_pretrained("HuggingFaceTB/SmolLM2-135M-Instruct", dtype=torch.bfloat16, token="")
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM2-135M-Instruct")

model = model.to(device)

What arguments in the model loader can we use to reduce the memory footprint of the model?

In [None]:
conversations = [{"role": "user", "content": "J'adore le chocolat."}, {"role": "assistant", "content": "Ok, cool."}]

In [None]:
templated = tokenizer.apply_chat_template(conversations, tokenize=False)

In [None]:
ids = tokenizer(templated)

In [None]:
tokenizer.convert_ids_to_tokens(ids["input_ids"])

In [None]:
c = torch.tensor([0.41, 1.2])

## Generation loop

The following class if the main class we use for generating text from the LLM. You will implement the missing parts.

In [None]:
from typing import List, Union, Optional
from transformers import PreTrainedModel, PreTrainedTokenizer

class Generator:
    def __init__(self, model: PreTrainedModel, tokenizer: PreTrainedTokenizer):
        self.model = model
        self.model.eval()
        self.w = self.model.lm_head.weight.data
        print(self.w.shape)
        self.tokenizer = tokenizer
        self.eos = self.tokenizer.eos_token_id
        self.prompt = "You are given a French text, provide faithful translation to English."

    def tokenize(self, texts: List[str]):
        """Tokenize the texts"""
        conversations = [
            [{"role": "user", "content": f'{self.prompt}\n\nHere is the French text: {text}.\n\nYour English translation:'}]
            for text in texts]
        # TODO: call apply_chat_template from the tokenizer class with the right arguments to output torch tensors of the token ids
        # Which padding side to use? why?
        templated = self.tokenizer.apply_chat_template(conversations, tokenize=False, add_generation_prompt=True)
        tokenized = self.tokenizer(templated, padding_side="left", padding=True, return_tensors="pt")
        return tokenized

    def decode(self, generated_token_ids):
        decoded = []
        where_eos_is_reached = generated_token_ids == self.eos
        for idx, batch in enumerate(where_eos_is_reached):
            if batch.any(): # if eos id is present for each sequence in the batch
                eos_idx = batch.int().argmax()
                decoded.append(generated_token_ids[idx, :eos_idx].tolist())
            else: # else just return the uncompleted generation
                decoded.append(generated_token_ids[idx].tolist())
        return self.tokenizer.batch_decode(decoded, skip_special_tokens=True)

    def logits(self, c):
        return c @ self.w.T

    def softmax(self, logits):
        """Normalizes the logits to have probabilities"""
        # see: https://stackoverflow.com/questions/42599498/numerically-stable-softmax
        logits -= logits.max(dim=-1, keepdim=True).values
        scores = logits.exp()
        return scores / scores.sum(dim=-1, keepdim=True)

    @torch.no_grad()
    def generate(self,
                 texts: List[str],
                 temperature: Optional[int]=None,
                 top_k: Optional[int]=None,
                 top_p: Optional[float]=None,
                 max_new_tokens: int=16):

        batch_size = len(texts)
        tokenized = self.tokenize(texts)
        token_ids = tokenized.to(self.model.device) # contains {input_ids: ..., attention_mask: ...}

        generated = torch.tensor([], dtype=torch.long, device=self.model.device) # will contain the generated token id
        while generated.numel() == 0 or generated.shape[-1] < max_new_tokens: # generate until the max_new_tokens is reached
            outputs = self.model(**token_ids, output_hidden_states=True)
            h = outputs.hidden_states[-1] # last hidden state
            logits = self.logits(h)
            logits = outputs.logits[:, -1, :].float()
            vocab_size = logits.shape[-1]
            if temperature is not None and temperature > 0:
                logits /= temperature

            if top_k is not None:
                top_k_logits = torch.topk(logits, k=vocab_size-top_k, largest=False, dim=-1).indices
                # set -inf to lowest scores
                logits.scatter_(-1, top_k_logits, float('-inf'))

            if top_p is not None:
                # filter the logits for top_p by setting float('-Inf') to the token logits that don't reach the top_p
                pass

            probabilities = self.softmax(logits)
            next_tokens = torch.multinomial(probabilities, 1).long()
            token_ids["input_ids"] = torch.cat((token_ids["input_ids"], next_tokens), dim=-1)
            token_ids["attention_mask"] = torch.cat((token_ids["attention_mask"], torch.ones_like(next_tokens)), dim=-1)
            generated = torch.cat((generated, next_tokens), dim=-1)

        # decode the generations
        decoded = self.decode(generated)
        return decoded

Implement and test the generation class

In [None]:
generator = Generator(model=model,
                      tokenizer=tokenizer)

In [None]:
generator.generate(texts=["J'adore le chocolat.", "Ce film est vraiment magnifique !"])

In [None]:
generator.generate(texts=["J'adore le chocolat.", "Ce film est vraiment magnifique !"], temperature=0.4, top_p=0.7, max_new_tokens=16)

We are using batch generation here. Is it faster than single-batch generation. What are the limits of batch generation?

What are the solutions to improve the generation speed? see: https://huggingface.co/blog/continuous_batching

In [None]:
from tqdm.notebook import tqdm
def translate_corpus(corpus: List[str], batch_size: int=16, t=0.2):
    data = sorted(enumerate(corpus), key=lambda x: generator.tokenize([x[1]]).input_ids.shape[-1])

    translated = [None] * len(corpus)

    for i in tqdm(range(0, len(data), batch_size)):
        indices, texts = zip(*data[i : i + batch_size])

        preds = generator.generate(list(texts), temperature=t, top_p=0.8, max_new_tokens=64)

        # Place predictions back in original slots
        for idx, pred in zip(indices, preds):
            translated[idx] = pred

    return translated

In [None]:
# translate the french dev corpus to English
translated = translate_corpus(data["fr"]["dev"])

In [None]:
list(zip(data["fr"]["dev"][:10], data["en"]["dev"][:10]))

In [None]:
list(zip(data["fr"]["dev"][:10], translated[:10]))

Transalte the french dev corpus to english at different temperatures: `[0.1, 0.4, 1.0, 1.5, 2.0, 4.0, 8.0]` and plot the log-likelihood of the french translations given the english texts. A recall of the definition of the log-likehood:

$$
\log \mathcal{L}(\mathcal{D}, \theta) = \frac{1}{N} \sum\limits_{(S, T) \in \mathcal{D}} \sum\limits_{i=1}^{|S|} \log p(T_{i}|T_{<i}, S; \theta)
$$

where S is the source sentence (the English sentence) and T the french translated sentence.

In [None]:
def log_likelihood(corpus: List[str], batch_size: int=16):
    pass

In [None]:
# use sacre bleu to evaluate the translations on BLEU and CHRF metrics: https://github.com/mjpost/sacrebleu