<a href="https://colab.research.google.com/github/saravanan-nj/notebooks/blob/main/qlora-tiny-llm.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Install Requirements

In [None]:
!pip install -q accelerate==1.3.0 peft==0.14.0 bitsandbytes==0.45.2 transformers==4.48.3 trl==0.14.0 datasets==3.1.0 pretty_midi # fsspec==2024.10.0

In [2]:
import os
import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoModel,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    TrainingArguments,
    pipeline,
    logging,
)
from peft import LoraConfig, PeftModel
from trl import SFTTrainer, SFTConfig, DataCollatorForCompletionOnlyLM

In [3]:
# Load the training dataset
from google.colab import drive
import pandas as pd

df = pd.read_csv('/content/imdb/train_data.txt', delimiter=":::", names=["index", "title", "genre", "description"])

  df = pd.read_csv('/content/imdb/train_data.txt', delimiter=":::", names=["index", "title", "genre", "description"])


In [4]:
dataset_df = pd.DataFrame()
def get_text(row):
  return f"""
### Instruction:
Given a movie description, read the description, understand and analyse the story of the movie based on the given description and return the genre of the movie.

### Description:
{row["description"]}

### Genre:
{row["genre"].strip()}
<|endresponse>
"""
dataset_df["description"] = df["description"]
dataset_df["genre"] = df["genre"]
dataset_df["instruction"] = "Given a movie description, read the description, understand and analyse the story of the movie based on the given description and return the genre of the movie."
# dataset_df["text"] = dataset_df.apply(get_text, axis=1)

In [5]:
from datasets import Dataset
dataset = Dataset.from_pandas(dataset_df)

In [6]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)

In [7]:
model_name = "crumb/nano-mistral"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [8]:
device_map = "auto"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map=device_map
)
model.config.use_cache = False
model.config.pretraining_tp = 1

In [9]:
peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=64,
    bias="none",
    task_type="CAUSAL_LM",
)

training_arguments = SFTConfig(
    output_dir="./results",
    num_train_epochs=1,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=1,
    optim="paged_adamw_32bit",
    save_steps=0,
    logging_steps=100,
    learning_rate=1e-5,
    weight_decay=0.01,
    fp16=False,
    bf16=False,
    max_grad_norm=0.3,
    max_steps=-1,
    warmup_ratio=0.03,
    group_by_length=True,
    lr_scheduler_type="cosine",
    report_to="tensorboard",
    max_seq_length=256,
    packing=False
)

In [11]:
def print_tokens_with_ids(txt):
    tokens = tokenizer.tokenize(txt, add_special_tokens=False)
    token_ids = tokenizer.encode(txt, add_special_tokens=False)
    print(list(zip(tokens, token_ids)))


def formatted_prompts_func(example):
  output_texts = []
  for i in range(len(example["description"])):
    output_texts.append(f"""
### Instruction:
Given a movie description, read the description, understand and analyse the story of the movie based on the given description and return the genre of the movie.

### Description:
{example["description"][i]}

### Genre:
{example["genre"][i].strip()}
<|endresponse>
""")
  # print_tokens_with_ids(output_texts[0])
  return output_texts

response_template ="\n### Genre:"
collator = DataCollatorForCompletionOnlyLM(tokenizer.encode(response_template, add_special_tokens=False)[:2], tokenizer=tokenizer)


trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    args=training_arguments,
    processing_class=tokenizer,
    formatting_func=formatted_prompts_func,
    data_collator=collator
)
trainer.train()
trainer.model.save_pretrained("imdb-classifier")

Map:   0%|          | 0/54214 [00:00<?, ? examples/s]

Step,Training Loss
100,3.6947
200,3.7081
300,3.6328
400,3.545
500,3.4071
600,3.2821
700,3.1782
800,3.0684
900,2.9536
1000,2.8413


In [12]:
!zip -r imdb-classifier.zip results imdb-classifier

  adding: results/ (stored 0%)
  adding: results/runs/ (stored 0%)
  adding: results/runs/Feb11_16-39-00_07073f1bde0a/ (stored 0%)
  adding: results/runs/Feb11_16-39-00_07073f1bde0a/events.out.tfevents.1739292955.07073f1bde0a.745.8 (deflated 61%)
  adding: results/runs/Feb11_16-39-00_07073f1bde0a/events.out.tfevents.1739293047.07073f1bde0a.745.9 (deflated 61%)
  adding: results/runs/Feb11_16-39-00_07073f1bde0a/events.out.tfevents.1739292854.07073f1bde0a.745.6 (deflated 61%)
  adding: results/runs/Feb11_16-39-00_07073f1bde0a/events.out.tfevents.1739292259.07073f1bde0a.745.4 (deflated 61%)
  adding: results/runs/Feb11_16-39-00_07073f1bde0a/events.out.tfevents.1739292887.07073f1bde0a.745.7 (deflated 61%)
  adding: results/runs/Feb11_16-39-00_07073f1bde0a/events.out.tfevents.1739292554.07073f1bde0a.745.5 (deflated 61%)
  adding: results/runs/Feb11_16-39-00_07073f1bde0a/events.out.tfevents.1739291944.07073f1bde0a.745.2 (deflated 61%)
  adding: results/runs/Feb11_16-39-00_07073f1bde0a/events

In [16]:
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    device_map="auto",
)
model = PeftModel.from_pretrained(base_model, '/content/imdb-classifier')
model = model.merge_and_unload()
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=200, truncation=True)
base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    device_map="auto",
)
original_pipeline = pipeline(task="text-generation", model=base_model, tokenizer=tokenizer, max_length=200, truncation=True)

prompt_template = """
### Instruction:
Given a movie description, read the description, understand and analyse the story of the movie based on the given description and return the genre of the movie.

### Description:
{}

### Genre:
"""

Device set to use cuda:0
Device set to use cuda:0


In [17]:
movie_description = "Müzeyyen is a young woman who lives a drug-fuelled, chaotic life outside the norms of society. Her brother Ali interrupts his own successful, bourgeois life in Bolu, and goes to Antalya to meet with his recently- discovered sister. There, the legitimate child - Ali - will be tested through unguessed - at sins, while the illegitimate one - Müzeyyen - will be tempted in turn by Ali's hope."
print(len(movie_description))
movie_input = prompt_template.format(movie_description)
result = pipe(movie_input)
original_result = original_pipeline(movie_input)
print(result[0]["generated_text"])
print("*" * 60)
print(original_result[0]["generated_text"])

390

### Instruction:
Given a movie description, read the description, understand and analyse the story of the movie based on the given description and return the genre of the movie.

### Description:
Müzeyyen is a young woman who lives a drug-fuelled, chaotic life outside the norms of society. Her brother Ali interrupts his own successful, bourgeois life in Bolu, and goes to Antalya to meet with his recently- discovered sister. There, the legitimate child - Ali - will be tested through unguessed - at sins, while the illegitimate one - Müzeyyen - will be tempted in turn by Ali's hope.

### Genre:
comedy
<|endresponse>
<|endresponse>
<|endresponse>
<|endresponse>
<|endresponse>
<|endresponse>
<|end
************************************************************

### Instruction:
Given a movie description, read the description, understand and analyse the story of the movie based on the given description and return the genre of the movie.

### Description:
Müzeyyen is a young woman who lives

In [18]:
movie_description = "Shy bookworm Tommy vacations in Palm Springs, never expecting an amorous adventure. But once he spots Brendan, he can't stop fantasizing about being with the sensuous, dark-haired man. However, Brendan's friends have their own ideas, and continue to pull him into their orgy of alcohol and sex. Finally alone at the pool, Brendan approaches Tommy and they spend an afternoon exploring the desert town and each other. Just as Tommy saves Brendan from a poolside accident, Brendan's friends come back - and try to force the two apart. True love conquers all, as Tommy's vacation climaxes in a passionate declaration of love and desire. LOVE INN EXILE boasts several unforgettable erotic scenes - including the strangely tender, drunken ménage a trois between Brendan's friends, a spicy tequila shot off Brendan's shoulder, and, of course, Tommy's ultimate fulfillment by his ideal man. LOVE INN EXILE is filmed entirely on location at Inn Exile in Palm Springs. With music by adult film star Sharon Kane, this video makes exile seem like a perfectly good punishment!"
movie_input = prompt_template.format(movie_description[:450])
result = pipe(movie_input)
original_result = original_pipeline(movie_input)
print(result[0]["generated_text"])
print("*" * 60)
print(original_result[0]["generated_text"])


### Instruction:
Given a movie description, read the description, understand and analyse the story of the movie based on the given description and return the genre of the movie.

### Description:
Shy bookworm Tommy vacations in Palm Springs, never expecting an amorous adventure. But once he spots Brendan, he can't stop fantasizing about being with the sensuous, dark-haired man. However, Brendan's friends have their own ideas, and continue to pull him into their orgy of alcohol and sex. Finally alone at the pool, Brendan approaches Tommy and they spend an afternoon exploring the desert town and each other. Just as Tommy saves Brendan from 

### Genre:
comedy
<|endresponse>
  | 1
<|endresponse>
  | 2
<|endresponse>
  | 3
<|endresponse>
 
************************************************************

### Instruction:
Given a movie description, read the description, understand and analyse the story of the movie based on the given description and return the genre of the movie.

### Descript