<a href="https://www.kaggle.com/code/golammostofas/finetune-llama3-2-3b-instruction-model?scriptVersionId=213393073" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Install dependencies

In [1]:
%%capture
%pip install -U transformers 
%pip install -U datasets 
%pip install -U accelerate 
%pip install -U peft 
%pip install trl==0.12.2
%pip install -U bitsandbytes 
%pip install -U wandb

# Accessing the Llama 3.2 Lightweight Models

In [2]:
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, TextStreamer
import torch


base_model = "/kaggle/input/llama-3.2/transformers/3b-instruct/1"

tokenizer = AutoTokenizer.from_pretrained(base_model)

model = AutoModelForCausalLM.from_pretrained(
    base_model,
    return_dict=True,
    low_cpu_mem_usage=True,
    torch_dtype=torch.float16,
    device_map="auto",
    trust_remote_code=True,
)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Set pad_token_id to avoid receiving warning messages.

In [3]:
if tokenizer.pad_token_id is None:
    tokenizer.pad_token_id = tokenizer.eos_token_id
if model.config.pad_token_id is None:
    model.config.pad_token_id = model.config.eos_token_id

In [4]:
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.float16,
    device_map="auto",
)

# Markdown output show function

In [5]:
from IPython.display import Markdown, display



def _output(text):
    display(Markdown(text))

In [6]:
messages = [{"role": "user", "content": "Who is Vincent van Gogh?"}]

prompt = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)

outputs = pipe(prompt, max_new_tokens=120, do_sample=True)

_output(outputs[0]["generated_text"].split('<|start_header_id|>assistant<|end_header_id|>')[1])



Vincent van Gogh (1853-1890) was a Dutch post-impressionist artist, widely considered one of the greatest painters in history. He is famous for his bold, expressive, and emotionally charged works of art that captured the beauty of the natural world and the human experience.

**Early Life and Career**

Van Gogh was born in Groot-Zundert, Netherlands, to a Protestant pastor's family. He was the eldest of six children, and his father's strict upbringing and expectations weighed heavily on him. Van Gogh struggled with mental health issues, including depression and anxiety

# Fine-tuning Llama 3.2 3B Instruct

In [7]:
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    TrainingArguments,
    pipeline,
    logging,
)
from peft import (
    LoraConfig,
    PeftModel,
    prepare_model_for_kbit_training,
    get_peft_model,
)
import os, torch, wandb
from datasets import load_dataset
from trl import SFTTrainer, setup_chat_format

# Huggingface Login

In [8]:
from huggingface_hub import login
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()
hf_token = user_secrets.get_secret("HUGGINGFACE_WRITE_MODE_TOKEN")
login(token = hf_token)

# Import Requirements

# Login to Weights & Biases using the API key and instantiate the new project. 

[Get API](https://wandb.ai/)

In [9]:
wb_token = user_secrets.get_secret("wandb_api_key")

wandb.login(key=wb_token)
run = wandb.init(
    project='Fine-tune Llama 3.2', 
    job_type="training", 
    anonymous="allow"
)

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Currently logged in as: [33mgolammostofa10001[0m ([33mgolammostofa10001-dhaka-university-of-engineernig-and-te[0m). Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Tracking run with wandb version 0.19.1
[34m[1mwandb[0m: Run data is saved locally in [35m[1m/kaggle/working/wandb/run-20241216_175034-lr98c8c4[0m
[34m[1mwandb[0m: Run [1m`wandb offline`[0m to turn off syncing.
[34m[1mwandb[0m: Syncing run [33mdutiful-wave-10[0m
[34m[1mwandb[0m: ⭐️ View project at [34m[4mhttps://wandb.ai/golammostofa10001-dhaka-university-of-engineernig-and-te/Fine-tune%20Llama%203.2[0m
[34m[1mwandb[0m: 🚀 View run at [34m[4mhttps://wandb.ai/golammostofa10001-dhaka-university-of-engineernig-and-te/Fine-tune%20Llama%203.2/runs/lr98c

In [10]:
base_model = "/kaggle/input/llama-3.2/transformers/3b-instruct/1"
new_model = "llama-3.2-3b-it-Ecommerce-ChatBot"
dataset_name = "bitext/Bitext-customer-support-llm-chatbot-training-dataset"

# Loading the model and tokenizer

In [11]:
# Set torch dtype and attention implementation
if torch.cuda.get_device_capability()[0] >= 8:
    !pip install -qqq flash-attn
    torch_dtype = torch.bfloat16
    attn_implementation = "flash_attention_2"
else:
    torch_dtype = torch.float16
    attn_implementation = "eager"

In [12]:
# QLoRA config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch_dtype,
    bnb_4bit_use_double_quant=True,
)
# Load model
model = AutoModelForCausalLM.from_pretrained(
    base_model,
    quantization_config=bnb_config,
    device_map="auto",
    attn_implementation=attn_implementation
)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [13]:
if tokenizer.pad_token_id is None:
    tokenizer.pad_token_id = tokenizer.eos_token_id
if model.config.pad_token_id is None:
    model.config.pad_token_id = model.config.eos_token_id

# Loading and processing the dataset

In [14]:
#Importing the dataset
dataset = load_dataset(dataset_name, split="train")
dataset = dataset.shuffle(seed=65).select(range(1000)) # Only use 1000 samples for quick demo
instruction = """You are a top-rated customer service agent named John. 
    Be polite to customers and answer all their questions.
    """
def format_chat_template(row):
    
    row_json = [{"role": "system", "content": instruction },
               {"role": "user", "content": row["instruction"]},
               {"role": "assistant", "content": row["response"]}]
    
    row["text"] = tokenizer.apply_chat_template(row_json, tokenize=False)
    return row

dataset = dataset.map(
    format_chat_template,
    num_proc= 4,
)


README.md:   0%|          | 0.00/11.9k [00:00<?, ?B/s]

(…)t_Training_Dataset_27K_responses-v11.csv:   0%|          | 0.00/19.2M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/26872 [00:00<?, ? examples/s]

  self.pid = os.fork()


Map (num_proc=4):   0%|          | 0/1000 [00:00<?, ? examples/s]

In [15]:
dataset['text'][3]

"<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a top-rated customer service agent named John. \n    Be polite to customers and answer all their questions.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\ncould you tell me about the options for shipping?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nOf course, I'd be delighted to provide you with information about our shipping options! Here are the various choices we offer:\n\n1. Standard Shipping: This option typically arrives within {{Date Range}} business days, catering to non-urgent items and ensuring a cost-effective delivery.\n\n2. Expedited Shipping: If you're looking for a faster option, choose expedited shipping. Your items will reach you within {{Date Range}} business days, offering a balance between speed and affordability.\n\n3. Overnight Shipping: For urgent needs, we have overnight shipping. This ensures your items are delivered on the next business day, offering the highest level o

In [16]:
dataset

Dataset({
    features: ['flags', 'instruction', 'category', 'intent', 'response', 'text'],
    num_rows: 1000
})

In [17]:
# Split the dataset into train (80%) and test (20%)
train_test_split = dataset.train_test_split(test_size=0.2, seed=42)

# Access train and test sets
train_dataset = train_test_split["train"]
test_dataset = train_test_split["test"]

In [18]:
train_dataset

Dataset({
    features: ['flags', 'instruction', 'category', 'intent', 'response', 'text'],
    num_rows: 800
})

# Setting up the model

In [19]:
import bitsandbytes as bnb

def find_all_linear_names(model):
    cls = bnb.nn.Linear4bit
    lora_module_names = set()
    for name, module in model.named_modules():
        if isinstance(module, cls):
            names = name.split('.')
            lora_module_names.add(names[0] if len(names) == 1 else names[-1])
    if 'lm_head' in lora_module_names:  # needed for 16 bit
        lora_module_names.remove('lm_head')
    return list(lora_module_names)

modules = find_all_linear_names(model)
modules

['up_proj', 'k_proj', 'gate_proj', 'v_proj', 'o_proj', 'down_proj', 'q_proj']

In [20]:
# LoRA config
peft_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=modules
)

# Check if chat_template exists and reset it
if hasattr(tokenizer, "chat_template") and tokenizer.chat_template is not None:
    tokenizer.chat_template = None

model, tokenizer = setup_chat_format(model, tokenizer)
model = get_peft_model(model, peft_config)

The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False`


In [21]:
model

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): LlamaForCausalLM(
      (model): LlamaModel(
        (embed_tokens): Embedding(128258, 3072)
        (layers): ModuleList(
          (0-27): 28 x LlamaDecoderLayer(
            (self_attn): LlamaAttention(
              (q_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=3072, out_features=3072, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.05, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=3072, out_features=16, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=16, out_features=3072, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
              (k_proj): lor

In [22]:

#Hyperparamter
training_arguments = TrainingArguments(
    output_dir=new_model,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    gradient_accumulation_steps=2,
    optim="paged_adamw_32bit",
    num_train_epochs=1,
    eval_strategy="steps",
    eval_steps=0.2,
    logging_steps=1,
    warmup_steps=10,
    logging_strategy="steps",
    learning_rate=2e-4,
    fp16=False,
    bf16=False,
    group_by_length=True,
    report_to="wandb"
)

In [23]:
# Setting sft parameters
trainer = SFTTrainer(
    model=model,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    peft_config=peft_config,
    max_seq_length= 512,
    dataset_text_field="text",
    tokenizer=tokenizer,
    args=training_arguments,
    packing= False,
)


Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.


Map:   0%|          | 0/800 [00:00<?, ? examples/s]

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

In [24]:
trainer.train()



Step,Training Loss,Validation Loss
80,0.8559,0.831459
160,0.6107,0.752098
240,0.6612,0.703468
320,0.6914,0.669036
400,0.8356,0.653162


Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.


TrainOutput(global_step=400, training_loss=0.7855093836784363, metrics={'train_runtime': 534.7463, 'train_samples_per_second': 1.496, 'train_steps_per_second': 0.748, 'total_flos': 2453385587226624.0, 'train_loss': 0.7855093836784363, 'epoch': 1.0})

In [25]:
wandb.finish()

[34m[1mwandb[0m:                                                                                
[34m[1mwandb[0m: 
[34m[1mwandb[0m: Run history:
[34m[1mwandb[0m:               eval/loss █▅▃▂▁
[34m[1mwandb[0m:            eval/runtime █▆▅▆▁
[34m[1mwandb[0m: eval/samples_per_second ▁▃▄▃█
[34m[1mwandb[0m:   eval/steps_per_second ▁▃▄▃█
[34m[1mwandb[0m:             train/epoch ▁▁▁▁▁▂▂▂▂▂▂▂▂▂▂▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇▇███
[34m[1mwandb[0m:       train/global_step ▁▁▁▁▁▂▂▂▃▃▃▃▃▃▃▃▃▃▃▄▄▄▄▄▄▅▅▆▆▆▆▆▆▆▆▇▇▇██
[34m[1mwandb[0m:         train/grad_norm ▇▅▆▆█▄▅▅▆▆▄▆▄▄▆▃▄▄▅▃▄▃▅▂▃▄▄▄▄█▄▄▆▆▁▄▄▃▃▄
[34m[1mwandb[0m:     train/learning_rate ▃████▇▇▇▇▇▆▆▆▅▅▅▄▄▄▄▄▄▄▄▄▃▃▃▃▃▂▂▂▂▂▁▁▁▁▁
[34m[1mwandb[0m:              train/loss █▅▃▅▄▃▃▃▂▂▂▃▂▃▂▃▃▂▃▁▂▃▁▁▁▂▁▂▂▂▂▃▃▂▂▁▁▁▁▃
[34m[1mwandb[0m: 
[34m[1mwandb[0m: Run summary:
[34m[1mwandb[0m:                eval/loss 0.65316
[34m[1mwandb[0m:             eval/runtime 37.2283
[34m[1mwandb[0m:  eval/samples_per_second 5.372
[34m[1mw

# Model Inference

In [26]:
messages = [{"role": "system", "content": instruction},
    {"role": "user", "content": "could you tell me about the options for shipping?"}]

prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    
inputs = tokenizer(prompt, return_tensors='pt', padding=True, truncation=True).to("cuda")

outputs = model.generate(**inputs, max_new_tokens=150, num_return_sequences=1)

text = tokenizer.decode(outputs[0], skip_special_tokens=True)

_output(text.split("assistant")[1])


We understand your curiosity about the various shipping options we offer. Our goal is to provide you with a seamless and convenient experience. Here are the details of our shipping options:

- Standard Shipping: This option typically takes {{Date Range}} business days for delivery. It's a reliable choice for those who value a hassle-free experience.

- Expedited Shipping: If you need your items sooner, we offer expedited shipping with delivery in {{Date Range}} business days. This is ideal for urgent purchases or if you're in a rush.

- Overnight Shipping: For an even faster delivery, we provide overnight shipping, ensuring you receive your items by {{Date}}. This is perfect for critical purchases or if you need your items urgently.

- In-Store Pickup

# Saving the tokenizer and model

In [27]:
# Save the fine-tuned model
trainer.model.save_pretrained(new_model)
trainer.model.push_to_hub(new_model, use_temp_dir=False)



README.md:   0%|          | 0.00/5.17k [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/1.67G [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/MdGolamMostofa/llama-3.2-3b-it-Ecommerce-ChatBot/commit/33514cff2656ffad39a9c924fda8c8db73018c3d', commit_message='Upload model', commit_description='', oid='33514cff2656ffad39a9c924fda8c8db73018c3d', pr_url=None, repo_url=RepoUrl('https://huggingface.co/MdGolamMostofa/llama-3.2-3b-it-Ecommerce-ChatBot', endpoint='https://huggingface.co', repo_type='model', repo_id='MdGolamMostofa/llama-3.2-3b-it-Ecommerce-ChatBot'), pr_revision=None, pr_num=None)

# Merging and Exporting Fine-tuned Llama 3.2 

[Next Notebook](https://www.kaggle.com/code/golammostofas/merging-and-exporting-fine-tuned-llama-3-2)

# Reference

**reference** : https://www.datacamp.com/tutorial/fine-tuning-llama-3-2