<a href="https://colab.research.google.com/github/saravanan-nj/notebooks/blob/main/qlora-tiny-llm.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Install Requirements

In [3]:
!pip install -q accelerate==1.3.0 peft==0.14.0 bitsandbytes==0.45.2 transformers==4.48.3 trl==0.14.0 datasets==3.1.0 pretty_midi # fsspec==2024.10.0

In [4]:
import os
import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoModel,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    TrainingArguments,
    pipeline,
    logging,
)
from peft import LoraConfig, PeftModel
from trl import SFTTrainer, SFTConfig, DataCollatorForCompletionOnlyLM

In [6]:
# Load the training dataset
from google.colab import drive
import pandas as pd

df = pd.read_csv('/content/imdb/train_data.txt', delimiter=":::", names=["index", "title", "genre", "description"])

  df = pd.read_csv('/content/imdb/train_data.txt', delimiter=":::", names=["index", "title", "genre", "description"])


In [7]:
dataset_df = pd.DataFrame()
def get_text(row):
  return f"""
### Instruction:
Given a movie description, read the description, understand and analyse the story of the movie based on the given description and return the genre of the movie.

### Description:
{row["description"]}

### Genre:
{row["genre"]}
<|endresponse>
"""
dataset_df["description"] = df["description"]
dataset_df["genre"] = df["genre"]
dataset_df["instruction"] = "Given a movie description, read the description, understand and analyse the story of the movie based on the given description and return the genre of the movie."
# dataset_df["text"] = dataset_df.apply(get_text, axis=1)

In [8]:
from datasets import Dataset
dataset = Dataset.from_pandas(dataset_df)

In [9]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)

In [10]:
model_name = "crumb/nano-mistral"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/942 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/437 [00:00<?, ?B/s]

In [12]:
device_map = "auto"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map=device_map
)
model.config.use_cache = False
model.config.pretraining_tp = 1

CUDA is required but not available for bitsandbytes. Please consider installing the multi-platform enabled version of bitsandbytes, which is currently a work in progress. Please check currently supported platforms and installation instructions at https://huggingface.co/docs/bitsandbytes/main/en/installation#multi-backend


RuntimeError: CUDA is required but not available for bitsandbytes. Please consider installing the multi-platform enabled version of bitsandbytes, which is currently a work in progress. Please check currently supported platforms and installation instructions at https://huggingface.co/docs/bitsandbytes/main/en/installation#multi-backend

In [56]:
peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=64,
    bias="none",
    task_type="CAUSAL_LM",
)

training_arguments = SFTConfig(
    output_dir="./results",
    num_train_epochs=1,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=1,
    optim="paged_adamw_32bit",
    save_steps=0,
    logging_steps=100,
    learning_rate=1e-5,
    weight_decay=0.01,
    fp16=False,
    bf16=False,
    max_grad_norm=0.3,
    max_steps=-1,
    warmup_ratio=0.03,
    group_by_length=True,
    lr_scheduler_type="cosine",
    report_to="tensorboard",
    max_seq_length=512,
    packing=False
)

In [57]:
def formatted_prompts_func(example):
  output_texts = []
  for i in range(len(example["description"])):
    output_texts.append(f"""
### Instruction:
Given a movie description, read the description, understand and analyse the story of the movie based on the given description and return the genre of the movie.

### Description:
{example["description"][i]}

### Genre:
{example["genre"][i]}
<|endresponse>
""")
  return output_texts

response_template = " ### Genre:"
collator = DataCollatorForCompletionOnlyLM(response_template, tokenizer=tokenizer)


trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    args=training_arguments,
    processing_class=tokenizer,
    formatting_func=formatting_prompts_func,
    data_collator=collator
)
trainer.train()
trainer.model.save_pretrained("imdb-classifier")

Map:   0%|          | 0/9560 [00:00<?, ? examples/s]

Step,Training Loss
100,3.6651
200,3.5978
300,3.503
400,3.314
500,3.247
600,3.1531
700,3.0156
800,2.9398
900,2.8529
1000,2.7954


In [58]:
!zip -r imdb-classifier.zip results imdb-classifier

updating: results/ (stored 0%)
updating: results/checkpoint-2390/ (stored 0%)
updating: results/checkpoint-2390/adapter_model.safetensors (deflated 7%)
updating: results/checkpoint-2390/README.md (deflated 66%)
updating: results/checkpoint-2390/training_args.bin (deflated 51%)
updating: results/checkpoint-2390/trainer_state.json (deflated 73%)
updating: results/checkpoint-2390/special_tokens_map.json (deflated 73%)
updating: results/checkpoint-2390/optimizer.pt (deflated 8%)
updating: results/checkpoint-2390/adapter_config.json (deflated 54%)
updating: results/checkpoint-2390/rng_state.pth (deflated 25%)
updating: results/checkpoint-2390/scheduler.pt (deflated 56%)
updating: results/checkpoint-2390/tokenizer.json (deflated 85%)
updating: results/checkpoint-2390/tokenizer.model (deflated 55%)
updating: results/checkpoint-2390/tokenizer_config.json (deflated 69%)
updating: results/runs/ (stored 0%)
updating: results/runs/Feb09_13-18-11_41e14a874267/ (stored 0%)
updating: results/runs/Feb

In [62]:
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    device_map="auto",
)
model = PeftModel.from_pretrained(base_model, '/content/imdb-classifier')
model = model.merge_and_unload()
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=200, truncation=True)
original_pipeline = pipeline(task="text-generation", model=base_model, tokenizer=tokenizer, max_length=200, truncation=True)

prompt_template = """
### Instruction:
Given a movie description, read the description, understand and analyse the story of the movie based on the given description and return the genre of the movie.

### Description:
{}

### Genre:
"""

Device set to use cuda:0
Device set to use cuda:0


In [65]:
movie_description = "Müzeyyen is a young woman who lives a drug-fuelled, chaotic life outside the norms of society. Her brother Ali interrupts his own successful, bourgeois life in Bolu, and goes to Antalya to meet with his recently- discovered sister. There, the legitimate child - Ali - will be tested through unguessed - at sins, while the illegitimate one - Müzeyyen - will be tempted in turn by Ali's hope."
print(len(movie_description))
movie_input = prompt_template.format(movie_description)
result = pipe(movie_input)
original_result = original_pipeline(movie_input)
print(result[0]["generated_text"])
print("*" * 60)
print(original_result[0]["generated_text"])

390

### Instruction:
Given a movie description, read the description, understand and analyse the story of the movie based on the given description and return the genre of the movie.

### Description:
Müzeyyen is a young woman who lives a drug-fuelled, chaotic life outside the norms of society. Her brother Ali interrupts his own successful, bourgeois life in Bolu, and goes to Antalya to meet with his recently- discovered sister. There, the legitimate child - Ali - will be tested through unguessed - at sins, while the illegitimate one - Müzeyyen - will be tempted in turn by Ali's hope.

### Genre:
 fiction

### Genre:
 fiction

### Genre:
 fiction

### Genre:
 fiction

### Genre:
 fiction

### Genre:
 fiction

************************************************************

### Instruction:
Given a movie description, read the description, understand and analyse the story of the movie based on the given description and return the genre of the movie.

### Description:
Müzeyyen is a young wom

In [64]:
movie_description = "Shy bookworm Tommy vacations in Palm Springs, never expecting an amorous adventure. But once he spots Brendan, he can't stop fantasizing about being with the sensuous, dark-haired man. However, Brendan's friends have their own ideas, and continue to pull him into their orgy of alcohol and sex. Finally alone at the pool, Brendan approaches Tommy and they spend an afternoon exploring the desert town and each other. Just as Tommy saves Brendan from a poolside accident, Brendan's friends come back - and try to force the two apart. True love conquers all, as Tommy's vacation climaxes in a passionate declaration of love and desire. LOVE INN EXILE boasts several unforgettable erotic scenes - including the strangely tender, drunken ménage a trois between Brendan's friends, a spicy tequila shot off Brendan's shoulder, and, of course, Tommy's ultimate fulfillment by his ideal man. LOVE INN EXILE is filmed entirely on location at Inn Exile in Palm Springs. With music by adult film star Sharon Kane, this video makes exile seem like a perfectly good punishment!"
movie_input = prompt_template.format(movie_description[:450])
result = pipe(movie_input)
original_result = original_pipeline(movie_input)
print(result[0]["generated_text"])
print("*" * 60)
print(original_result[0]["generated_text"])


### Instruction:
Given a movie description, read the description, understand and analyse the story of the movie based on the given description and return the genre of the movie.

### Description:
Shy bookworm Tommy vacations in Palm Springs, never expecting an amorous adventure. But once he spots Brendan, he can't stop fantasizing about being with the sensuous, dark-haired man. However, Brendan's friends have their own ideas, and continue to pull him into their orgy of alcohol and sex. Finally alone at the pool, Brendan approaches Tommy and they spend an afternoon exploring the desert town and each other. Just as Tommy saves Brendan from 

### Genre:
 romance

### Genre:
 romance

### Genre:
 romance

### Genre:
 romance

### Genre:
 romance

### Genre:
 romance


************************************************************

### Instruction:
Given a movie description, read the description, understand and analyse the story of the movie based on the given description and return the genr