## Testing CausalLM Models for Text Generatio

### We Will Explore Few Models Here

- as per tutorial the DistilGPT2
- ELI5 dataset

We will then explore few others and start adding complexity as we go. Bigger LLM models with quantinisation and LoRa.
We can try the Falcom 7B and the Bloom 3B again. But this time start simple.

We will also explre the Alpaca template for text generation.

### Resources

- From task guides in HF [Causal Language Modelling](https://huggingface.co/docs/transformers/tasks/language_modeling)
- From Finetuning LLAMA2 on custom dataset [Github](https://github.com/curiousily/Get-Things-Done-with-Prompt-Engineering-and-LangChain/blob/master/14.fine-tuning-llama-2-7b-on-custom-dataset.ipynb)

In [1]:
### Imports
import pandas as pd
import torch
import numpy as np
from transformers import (
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    AutoModelForCausalLM,
    DataCollatorForLanguageModeling,
    Trainer
)


In [3]:
from datasets import load_dataset

eli5 = load_dataset("eli5", split="train_asks[:10000]", trust_remote_code=True)
eli5 = eli5.train_test_split(test_size=0.2)
eli5["train"][0]

{'q_id': '28i7kc',
 'title': 'How far in advance were/are the flybys and encounters for Galileo and Cassini planned?',
 'selftext': 'Do we know enough about the orbital solutions of the various moons and their spheres of influence to essentially have each and every flyby planned as soon as the probe enters the system from interplanetary transfer?  \nIncluding minor course corrections during the duration of the mission...\n\nOR, are various flybys and mission highlights known only after the probe enters a system and the orbits calculated more precisely?\n\nFor some context, Cassini is making a close flyby of Titan today, June 17th, 2014.  Would this flyby have been planned before launch in 1997, during interplanetary transfer, upon capture by Saturn, or relatively recently?',
 'document': '',
 'subreddit': 'askscience',
 'answers': {'a_id': ['cibbugl'],
  'text': ["I can't give a full answer, but I can partially answer your question about the Titan flyby. Many flybys within a system tha

In [2]:
training_texts = [item["answers"]["text"] for item in eli5["train"]]

NameError: name 'eli5' is not defined

In [29]:
joined_texts = [" ".join(x) for x in training_texts]

In [30]:
joined_texts[:2]

["Folding@home is probably one of the better ones out there, SETI@home is still around there, but as computers got more advanced they have been able to get servers that can process all their loads without distributed processing, [I'd probably recommend a different BIONIC application](_URL_0_), SETI@home now actually just does a higher detail second pass on the SETI data.\n\nAlso, your computer is idle, but understand an idle computer uses less power than a non-idle power. Many of these projects can result in a significant increase in your power bill. Yeah, I've always thought this made a lot of sense.  There are some research problems that are limited by the compute power/time it would take to solve them, and if they can be  organized in a way amenable to distributed computing, volunteering idle computer time can be a big help!\n\nYou mentioned folding proteins.  This is an early application, and is one of the most successful, volunteer, distributed computing projects. Vijay Pande at S

In [3]:
tokenizer = AutoTokenizer.from_pretrained("distilgpt2")
tokenizer.pad_token = tokenizer.eos_token

In [4]:
eli_flat = eli5.flatten()

In [31]:
def preprocess_texts(item):
    inputs = tokenizer([" ".join(x) for x in item["answers.text"]], truncation=True, padding="max_length")
    inputs["labels"] = inputs["input_ids"].copy()
    return inputs


In [5]:
def preprocess_function(examples):
    return tokenizer([" ".join(x) for x in examples["answers.text"]])

In [6]:
tokenized_eli5 = eli_flat.map(
    preprocess_function,
    batched=True,
    num_proc=4,
    remove_columns=eli_flat["train"].column_names,
)

Map (num_proc=4):   0%|          | 0/8000 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (1131 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (3333 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1139 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1058 > 1024). Running this sequence through the model will result in indexing errors


Map (num_proc=4):   0%|          | 0/2000 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (1054 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1258 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (2224 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1308 > 1024). Running this sequence through the model will result in indexing errors


In [7]:
block_size = 128


def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
    # customize this part to your needs.
    if total_length >= block_size:
        total_length = (total_length // block_size) * block_size
    # Split by chunks of block_size.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

In [8]:
lm_dataset = tokenized_eli5.map(group_texts, batched=True, num_proc=4)

Map (num_proc=4):   0%|          | 0/8000 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/2000 [00:00<?, ? examples/s]

In [64]:
# eli_my = map(lambda data_set: {k: v["text"] for k, v in data_set if k in ["answers"]}, eli5.items())

In [34]:
tok = eli_flat.map(preprocess_texts, batched=True, num_proc=8, remove_columns=eli_flat["train"].column_names)

Map (num_proc=8):   0%|          | 0/8000 [00:00<?, ? examples/s]

Map (num_proc=8):   0%|          | 0/2000 [00:00<?, ? examples/s]

In [9]:
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

In [10]:
model = AutoModelForCausalLM.from_pretrained("distilgpt2")

In [11]:
training_args = TrainingArguments(
    output_dir="./data/eli5_tutorial",
    optim="paged_adamw_32bit",
    gradient_accumulation_steps=4,
    learning_rate=2e-5,
    weight_decay=0.01,
    push_to_hub=True,
    num_train_epochs=3,
    evaluation_strategy="epoch",
    save_safetensors=True,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_dataset["train"],
    eval_dataset=lm_dataset["test"],
    data_collator=data_collator,
    tokenizer=tokenizer
)



In [12]:
trainer.train()

You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss,Validation Loss
230,No log,3.780864
460,No log,3.755306
690,3.890600,3.743536
920,3.890600,3.738033


TrainOutput(global_step=1146, training_loss=3.842975023945381, metrics={'train_runtime': 129.2389, 'train_samples_per_second': 283.862, 'train_steps_per_second': 8.867, 'total_flos': 1197751642619904.0, 'train_loss': 3.842975023945381, 'epoch': 2.0})

In [13]:
trainer.save_model('./data/eli5_test_tut')

model.safetensors:   0%|          | 0.00/328M [00:00<?, ?B/s]

In [23]:
import math

eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

Perplexity: 41.93


In [20]:
prompt = "After we wrap our base model model with PeftModel along with the config"

In [21]:
from transformers import pipeline

generator = pipeline("text-generation", model="./data/eli5_test_tut")


In [22]:
generator(prompt, max_length=100)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "After we wrap our base model model with PeftModel along with the configurator, these will only change if the other model fits correctly with the model's input frame.\n\nNow, let's take a look at the model's power sources:\nWe see the input and output (pulse frequency, RF output, input) which is basically the total input input we get. That means the power source is still on the PWM-1.1 level (that's why current current"}]

In [32]:
def get_perplexity_of_model(model, tokenizer, dataset, block_size = None) -> float:
    dataset = dataset.flatten()
    def tokenize_function(examples):
        tok_examples = tokenizer([" ".join(x) for x in examples["answers.text"]])
        if block_size is not None:
            # Concatenate all texts.
            concatenated_examples = {k: sum(tok_examples[k], []) for k in tok_examples.keys()}
            total_length = len(concatenated_examples[list(tok_examples.keys())[0]])
            # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
            # customize this part to your needs.
            if total_length >= block_size:
                total_length = (total_length // block_size) * block_size
            # Split by chunks of block_size.
            result = {
                k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
                for k, t in concatenated_examples.items()
            }
            result["labels"] = result["input_ids"].copy()
            return result
        else:
            tok_examples["labels"] = tok_examples["input_ids"].copy()
            return tok_examples
    
    def compute_perplexity(batch):
        input_ids = batch["input_ids"]
        with torch.no_grad():
            outputs = model(input_ids=input_ids, labels=input_ids)
        logits = outputs.logits
        shift_logits = logits[..., :-1, :].contiguous()
        shift_labels = input_ids[..., 1:].contiguous()
        # Flatten the tensors to calculate cross-entropy loss
        loss = torch.nn.functional.cross_entropy(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))
        return torch.exp(loss)
        
    tokenized_datasets = dataset.map(tokenize_function, batched=True)
    # Compute perplexity
    perplexity = tokenized_datasets["test"].map(compute_perplexity, batched=True).compute()
    
    # Calculate average perplexity
    average_perplexity = torch.stack(perplexity).mean().item()
    return average_perplexity
    

In [8]:
loaded_model = AutoModelForCausalLM.from_pretrained("./data/eli5_test_tut")

In [9]:
tokenizer = AutoTokenizer.from_pretrained("distilgpt2")
tokenizer.pad_token = tokenizer.eos_token

In [34]:
eli5 = load_dataset("eli5", split="train_asks[:10000]", trust_remote_code=True)
eli5 = eli5.train_test_split(test_size=0.2)

In [35]:
get_perplexity_of_model(loaded_model, tokenizer, eli5)

Map:   0%|          | 0/8000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

TypeError: GPT2LMHeadModel.forward() got an unexpected keyword argument 'q_id'

### Testing Creating Data for LLM

In [3]:
from datasets import load_dataset

In [4]:
dataset = load_dataset('knkarthick/dialogsum')

Downloading readme:   0%|          | 0.00/4.65k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/11.3M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/442k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.35M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

In [1]:
from transformers import AutoModelForCausalLM, pipeline

In [None]:
INSTRUCTION = "Summarise the conversation between the people."
def generate_training_prompt(input, output):
    return f"""
    Instruction: {INSTRUCTION}\n\n
    Input: {input} \n\n

In [5]:
loaded_model = AutoModelForCausalLM.from_pretrained("ybelkada/falcon-7b-sharded-bf16")

Loading checkpoint shards:   0%|          | 0/8 [00:00<?, ?it/s]

In [6]:

loaded_model.generation_config

GenerationConfig {
  "bos_token_id": 1,
  "eos_token_id": 2
}

In [2]:
generator = pipeline(model="ybelkada/falcon-7b-sharded-bf16")

Loading checkpoint shards:   0%|          | 0/8 [00:00<?, ?it/s]

In [4]:
generator("Can you tell me what is the largest country in the world today?", max)

Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.


[{'generated_text': 'Can you tell me what is the largest country in the world today?\nThe largest country in the'}]