<a href="https://colab.research.google.com/github/sasidhar92/5-minute-AI/blob/main/genAI_201.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Welcome to 5 minute AI**




*   Foundation Models
*   Drawbacks of LLMs
*   Finetuning


LLMs or large language models are trained over vast amount of data from the internet. LLMs learn the patterns between words, sentences to develop understanding of the language from statistical perspective.

Pitfalls:

* Since LLMs are trained on vast amount of internet data, it is generic. Ex. If the trained data doesn't contain health care data, LLMs might not be able provide answer to our queries
* Also internet data might contain outdated or false information. When you ask LLMs to answer questions about recent news events or logic, LLMs might not be able to give us correct answers.

There are various techniques developed to combat these pitfalls.

For ex. finetuning LLMs on domain specfic data like healthcare will shift the statistical distribution from vast internet data to health care domain

or

techniques to allow LLMs to answer queries by looking at authorized/ approved data

# **Section**


In this section, we will discuss about various finetuning techniques. For simplicity, we will use small models to learn concepts like parameter efficient fine tuning using Low rank  (LORA) technique.

We are using a sentence classification model called Roberta. This model is based on BERT, bidirectional encoder representation from transformers. In this exercise:

* We will setup the required libraries from Hugging face to access the Roberta Classification model
* We will use Roberta model to classify some example movie reviews into positive and negative
* We can see that the model is generic and doesn't classify the movie reviews
* We will use a new dataset from Stanford called Stanford Sentiment treebank
* We can of course develop a new model from scratch but we can also use finetuning techniques like LORA from PEFT library to finetune pretrained Roberta model
* We convert the new dataset into token, feed it into the roberta model and retrain a small portion of the model
* This will allow the model to learn the new statistical distribution of movie reviews
* We will rerun the previous movie review examples

In [None]:
# setting up trasformers library from hugging face
import numpy as np
!pip install datasets transformers peft evaluate

from datasets import load_dataset, DatasetDict, Dataset

from transformers import (
    AutoTokenizer,
    AutoConfig,
    AutoModelForSequenceClassification,
    DataCollatorWithPadding,
    TrainingArguments,
    Trainer)

from peft import PeftModel, PeftConfig, get_peft_model, LoraConfig
import evaluate, torch

Collecting datasets
  Downloading datasets-2.21.0-py3-none-any.whl.metadata (21 kB)
Collecting peft
  Downloading peft-0.12.0-py3-none-any.whl.metadata (13 kB)
Collecting evaluate
  Downloading evaluate-0.4.2-py3-none-any.whl.metadata (9.3 kB)
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=1.13.0->peft)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch>=1.13.0->peft)
  Using cached nvidia_cuda_runtime

In [None]:
# we are loading the stanford sentiment dataset
dataset = load_dataset("glue", "sst2")

# print the first 5 examples in the training dataset
dataset['train'][:5]

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/35.3k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/3.11M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/72.8k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/148k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/67349 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/872 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1821 [00:00<?, ? examples/s]

{'sentence': ['hide new secretions from the parental units ',
  'contains no wit , only labored gags ',
  'that loves its characters and communicates something rather beautiful about human nature ',
  'remains utterly satisfied to remain the same throughout ',
  'on the worst revenge-of-the-nerds clichés the filmmakers could dredge up '],
 'label': [0, 0, 1, 0, 0],
 'idx': [0, 1, 2, 3, 4]}

In [None]:
model_checkpoint = 'roberta-base'

id2label = {0: "Negative", 1: "Positive"}
label2id = {"Negative":0, "Positive":1}

# generating the new model to train from checkpoint using the hugging face transformers library
model = AutoModelForSequenceClassification.from_pretrained(
    model_checkpoint, num_labels=2, id2label=id2label, label2id=label2id)

# creating a tokenizer using the autotokenizer function

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, add_prefix_space=True)

if tokenizer.pad_token is None:
    tokenizer.add_special_tokens({'pad_token': '[PAD]'})
    model.resize_token_embeddings(len(tokenizer))


#since we are finetuning the pretrained model, we are tokenizing the Stanford sentiment dataset
def tokenize_function(newData):
    # extract text
    text = newData["sentence"]

    #tokenize and truncate text
    tokenizer.truncation_side = "left"
    tokenized_inputs = tokenizer(
        text,
        return_tensors="np",
        truncation=True,
        max_length=512
    )

    return tokenized_inputs

tokenized_dataset = dataset.map(tokenize_function, batched=True)

config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Map:   0%|          | 0/67349 [00:00<?, ? examples/s]

Map:   0%|          | 0/872 [00:00<?, ? examples/s]

Map:   0%|          | 0/1821 [00:00<?, ? examples/s]

I am going to give some example reviews for the pretrained Roberta model to classify. We compare these results with the finetuned model results later

In [None]:
example_reviews = [ "A thrilling ride from start to finish, with superb performances and a plot that keeps you on the edge of your seat.",
                   "Lacking originality and depth, this film fails to leave a lasting impression.",
                    "Despite its stunning visuals, the film falls flat due to a predictable and uninspired storyline.",
                    "A masterful blend of humor and heart, making it a must-watch for fans of the genre.",
                    "The movie's pacing drags, but its powerful message and compelling characters redeem it."]

print("Pretrained model predictions:")
for review in example_reviews:
    inputs = tokenizer.encode(review, return_tensors="pt")
    logits = model(inputs).logits
    predictions = torch.argmax(logits)
    print(review + " - " + id2label[predictions.tolist()])


Pretrained model predictions:
A thrilling ride from start to finish, with superb performances and a plot that keeps you on the edge of your seat. - Negative
Lacking originality and depth, this film fails to leave a lasting impression. - Negative
Despite its stunning visuals, the film falls flat due to a predictable and uninspired storyline. - Negative
A masterful blend of humor and heart, making it a must-watch for fans of the genre. - Negative
The movie's pacing drags, but its powerful message and compelling characters redeem it. - Negative


## As you can see from above example, the base/ pretrained Roberta models fails to classify reviews correctly. Time to finetune the model

In [None]:
# finetuning the pretrained model

peft_config = LoraConfig(task_type="SEQ_CLS",
                        r=4,
                        lora_alpha=32,
                        lora_dropout=0.01,
                        target_modules = ['query'])

model = get_peft_model(model, peft_config)
model.print_trainable_parameters()
# setting variables/ hyperparameters to tune the model
lr = 1e-3
batch_size = 32
num_epochs = 2


trainable params: 665,858 || all params: 125,313,028 || trainable%: 0.5314


## As you can see above, the total parameters of the model ~125M but for finetuning this model on our new dataset, we are just training 0.53% roughly ~ 665K of the pretrained model parameters

In [None]:
# setting up trainer function
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
accuracy = evaluate.load("accuracy")
def metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=1)
    return {"accuracy": accuracy.compute(predictions=predictions, references=labels)}



training_args = TrainingArguments(
    output_dir= model_checkpoint + "-lora-text-classification",
    learning_rate=lr,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=num_epochs,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=metrics,
)

# train model
trainer.train()

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]



Epoch,Training Loss,Validation Loss,Accuracy
1,0.2312,0.223814,{'accuracy': 0.9174311926605505}
2,0.2023,0.214968,{'accuracy': 0.9277522935779816}


Trainer is attempting to log a value of "{'accuracy': 0.9174311926605505}" of type <class 'dict'> for key "eval/accuracy" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "{'accuracy': 0.9277522935779816}" of type <class 'dict'> for key "eval/accuracy" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.


TrainOutput(global_step=4210, training_loss=0.23399433942418768, metrics={'train_runtime': 655.2928, 'train_samples_per_second': 205.554, 'train_steps_per_second': 6.425, 'total_flos': 2846046144957912.0, 'train_loss': 0.23399433942418768, 'epoch': 2.0})

In [None]:
model.to('cpu')

print("Finetuned model predictions:")

for review in example_reviews:
    inputs = tokenizer.encode(review, return_tensors="pt").to("cpu")
    logits = model(inputs).logits
    predictions = torch.max(logits,1).indices
    print(review + " - " + id2label[predictions.tolist()[0]])


Finetuned model predictions:
A thrilling ride from start to finish, with superb performances and a plot that keeps you on the edge of your seat. - Positive
Lacking originality and depth, this film fails to leave a lasting impression. - Negative
Despite its stunning visuals, the film falls flat due to a predictable and uninspired storyline. - Negative
A masterful blend of humor and heart, making it a must-watch for fans of the genre. - Positive
The movie's pacing drags, but its powerful message and compelling characters redeem it. - Positive


## As you can see above, the finetuned model perform better on the example reviews. Great now that we have finetuned a model, lets deep dive into finetuning.