<a href="https://colab.research.google.com/github/ritikjain51/llm-finetuning/blob/main/Fine_Tuning_LLAMA2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!nvidia-smi

Fri May 24 06:10:29 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   67C    P8              11W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [None]:
pip install accelerate peft bitsandbytes transformers trl

## Import Required Packages

In [3]:
import os
import torch
import re
from datasets import load_dataset
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    BitsAndBytesConfig,
    HfArgumentParser,
    TrainingArguments,
    pipeline, logging
)
from peft import LoraConfig, PeftModel
from trl import SFTTrainer

Llama2 uses the Chat Model

Prompt Template
```
<s>[INST] <<SYS>>
System Prompt
<<SYS>>

User prompt [/INST] Model Answer </s>

```

Sample Prompt

```
<s>[INST] Me gradué hace poco de la carrera de medicina ¿Me podrías aconsejar para conseguir rápidamente un puesto de trabajo? [/INST] Esto vale tanto para médicos como para cualquier otra profesión tras finalizar los estudios aniversarios y mi consejo sería preguntar a cuántas personas haya conocido mejor. En este caso, mi primera opción sería hablar con otros profesionales médicos, echar currículos en hospitales y cualquier centro de salud. En paralelo, trabajaría por mejorar mi marca personal como médico mediante un blog o formas digitales de comunicación como los vídeos. Y, para mejorar las posibilidades de encontrar trabajo, también participaría en congresos y encuentros para conseguir más contactos. Y, además de todo lo anterior, seguiría estudiando para presentarme a las oposiciones y ejercer la medicina en el sector público de mi país. </s>


```

### Convert Dataset to LLAMA2 Prompt Template

In [63]:
# Load the dataset
dataset = load_dataset('timdettmers/openassistant-guanaco')

# Shuffle the dataset and slice it
dataset = dataset['train'].shuffle(seed=42).select(range(1000))

# Define a function to transform the data
def transform_conversation(example):
    conversation_text = example['text']
    segments = conversation_text.split('###')

    reformatted_segments = []

    # Iterate over pairs of segments
    for i in range(1, len(segments) - 1, 2):
        human_text = segments[i].strip().replace('Human:', '').strip()

        # Check if there is a corresponding assistant segment before processing
        if i + 1 < len(segments):
            assistant_text = segments[i+1].strip().replace('Assistant:', '').strip()

            # Apply the new template
            reformatted_segments.append(f'<s>[INST] {human_text} [/INST] {assistant_text} </s>')
        else:
            # Handle the case where there is no corresponding assistant segment
            reformatted_segments.append(f'<s>[INST] {human_text} [/INST] </s>')

    return {'text': ''.join(reformatted_segments)}


# Apply the transformation
transformed_dataset = dataset.map(transform_conversation)


Repo card metadata block was not found. Setting CardData to empty.


In [5]:
print(transformed_dataset.data[0][0])

<s>[INST] Me gradué hace poco de la carrera de medicina ¿Me podrías aconsejar para conseguir rápidamente un puesto de trabajo? [/INST] Esto vale tanto para médicos como para cualquier otra profesión tras finalizar los estudios aniversarios y mi consejo sería preguntar a cuántas personas haya conocido mejor. En este caso, mi primera opción sería hablar con otros profesionales médicos, echar currículos en hospitales y cualquier centro de salud. En paralelo, trabajaría por mejorar mi marca personal como médico mediante un blog o formas digitales de comunicación como los vídeos. Y, para mejorar las posibilidades de encontrar trabajo, también participaría en congresos y encuentros para conseguir más contactos. Y, además de todo lo anterior, seguiría estudiando para presentarme a las oposiciones y ejercer la medicina en el sector público de mi país. </s>


### Intiaizing Parameters

- Model Parameters
- Model Quantization (BitsAndBytes)
- PEFT Parameters (LoRA)
- Training Parameters
- Trainer Paramaters

In [51]:
model_name = "NousResearch/Llama-2-7b-chat-hf"

new_model_name = "Llama-2-chat-finetuned"


############################
# QLoRA Parameters
# LoRA Attention Dimension (Rank)
lora_r = 64

# LoRA scaling factor
lora_alpha = 16

# Dropout Probability
lora_dropout = 0.1



###################################
# BitsAndBytes Parameters

# Activate 4-bit precisoin for base model loading
use_4bit = True

# Compute dtype for 4bit base models
bnb_4bit_compute_dtype = "float16"

# Quantization type (fp4, nf4)
bnb_4bit_quant_type = "nf4"

# Activate nested quantization for 4-bit base model
use_nested_quant = False

In [52]:
############ Training Arguments ################
# Output Directory
output_dir = "./results"

# Number of Training Epochs
num_train_epochs = 1

# Enable fp16/bf16 training
fp16 = False
bf16 = False

# Batch size per GPU for Training
per_device_train_batch_size = 4

# Batch size for Evaluation
per_device_eval_batch_size = 4

# Number of update Steps to accumulate gredient
gradient_accumulation_steps = 1

# Enable gredient checkpointing
gredient_checkpointing = True

# Maximum gradient normal (Gradient Clipping)
max_grad_norm = 0.3

# Initial learning Rate (AdamW Optimizer)
learning_rate = 2e-4

# Weight Decay to apply on all layers
weight_decay = 0.001

# Optimizer to use
optim = "paged_adamw_32bit"

# Learning Rate Schedule
lr_scheduler_type = "cosine"

# Number of training steps
max_steps = -1

# Ratio of steps for linear warmup (from 0 to learning rate)
warmup_ratio = 0.03

# Group sequence into batches with same length
group_by_length = True

# Save checkpoint every X updated steps
save_steps = 0

# Log every X step
logging_steps=25



In [53]:
############# SFT Parameters ###################
### Supervised Fine-tuning

# Maximum sequence length to use
max_seq_length = True

# Pack multiple short examples in same input sequence
packing = False

# Load the entire model
device_map = {"": 0}


## Load the model, dataset and Start Fine-tuning

In [54]:
## Load the Dataset
dataset = transformed_dataset
# dataset = load_dataset("mlabonne/guanaco-llama2-1k", split="train")

# Load Tokenizer and model with QLoRA Configuration
compute_dtype = getattr(torch, bnb_4bit_compute_dtype)

bnb_config = BitsAndBytesConfig(
    load_in_4bit = use_4bit,
    bnb_4bit_quant_type = bnb_4bit_quant_type,
    bnb_4bit_compute_dtype = compute_dtype,
    bnb_4bit_use_double_quant = use_nested_quant
)

In [55]:
bnb_config

BitsAndBytesConfig {
  "_load_in_4bit": true,
  "_load_in_8bit": false,
  "bnb_4bit_compute_dtype": "float16",
  "bnb_4bit_quant_storage": "uint8",
  "bnb_4bit_quant_type": "nf4",
  "bnb_4bit_use_double_quant": false,
  "llm_int8_enable_fp32_cpu_offload": false,
  "llm_int8_has_fp16_weight": false,
  "llm_int8_skip_modules": null,
  "llm_int8_threshold": 6.0,
  "load_in_4bit": true,
  "load_in_8bit": false,
  "quant_method": "bitsandbytes"
}

In [56]:
# Check GPU compatibility with bfloat16
if compute_dtype == torch.float16 and use_4bit:
    major, _ = torch.cuda.get_device_capability()
    if major >= 8:
        print("=" * 80)
        print("Your GPU supports bfloat16: accelerate training with bf16=True")
        print("=" * 80)


In [57]:
# Load base model
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map=device_map
)
model.config.use_cache = False
model.config.pretraining_tp = 1


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]



In [58]:
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code = True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

In [59]:
## Load LoRA Configuration

peft_config = LoraConfig(
    lora_alpha = lora_alpha,
    lora_dropout = lora_dropout,
    r = lora_r,
    bias = "none",
    task_type="CAUSAL_LM"
)

peft_config

LoraConfig(peft_type=<PeftType.LORA: 'LORA'>, auto_mapping=None, base_model_name_or_path=None, revision=None, task_type='CAUSAL_LM', inference_mode=False, r=64, target_modules=None, lora_alpha=16, lora_dropout=0.1, fan_in_fan_out=False, bias='none', use_rslora=False, modules_to_save=None, init_lora_weights=True, layers_to_transform=None, layers_pattern=None, rank_pattern={}, alpha_pattern={}, megatron_config=None, megatron_core='megatron.core', loftq_config={}, use_dora=False, layer_replication=None)

In [60]:
training_arguments = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs = num_train_epochs,
    per_device_train_batch_size = per_device_train_batch_size,
    gradient_accumulation_steps = gradient_accumulation_steps,
    optim = optim,
    save_steps = save_steps,
    logging_steps = logging_steps,
    learning_rate = learning_rate,
    weight_decay = weight_decay,
    fp16 = fp16,
    bf16=bf16,
    max_grad_norm = max_grad_norm,
    max_steps = max_steps,
    warmup_ratio = warmup_ratio,
    group_by_length = group_by_length,
    lr_scheduler_type = lr_scheduler_type,
    report_to = "tensorboard"

)

In [65]:
## Trainer Parameters

trainer = SFTTrainer(
    model = model,
    train_dataset = dataset,
    peft_config = peft_config,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    tokenizer = tokenizer,
    args = training_arguments,
    packing=packing
)

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [66]:
## Train model

trainer.train()

Step,Training Loss
25,0.0
50,0.0
75,0.0
100,0.0
125,0.0
150,0.0
175,0.0
200,0.0
225,0.0
250,0.0


TrainOutput(global_step=250, training_loss=0.0, metrics={'train_runtime': 224.3644, 'train_samples_per_second': 4.457, 'train_steps_per_second': 1.114, 'total_flos': 39845388288000.0, 'train_loss': 0.0, 'epoch': 1.0})

In [67]:
trainer.model.save_pretrained(new_model_name)



In [68]:
%load_ext tensorboard
%tensorboard --logdir results/runs

The tensorboard extension is already loaded. To reload it, use:
  %reload_ext tensorboard


<IPython.core.display.Javascript object>

### Inference using pipeline

In [74]:
prompt = "How large language models can help humans?"

pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer,
                max_length=400)

prompt_template = "[INST]{prompt}[\INST]"

result = pipe(prompt_template.format(prompt = prompt))

print(result[0]["generated_text"])

[INST]How large language models can help humans?[\INST]  Large language models, such as transformer-based models like BERT, RoBERTa, and XLNet, have revolutionized the field of natural language processing (NLP) in recent years. everybody is talking about them, and they have achieved state-of-the-art results on a wide range of NLP tasks. But how do these models actually help humans? Here are some ways in which large language models can help humans:

1. Improved language understanding: Large language models can help humans by improving our understanding of language. By learning to predict the next word in a sequence of text, these models can help us better understand the context and meaning of a sentence or paragraph.
2. Language translation: Large language models can be trained on large datasets of text in multiple languages, allowing them to learn to translate between languages. This can help humans by enabling us to communicate with people who speak different languages.
3. Text genera

In [80]:
# Empty VRAM

del model
del trainer
del tokenizer
del pipe
del dataset
del transformed_dataset
import gc
gc.collect()
gc.collect()

0

Accessing New Llama2 Model

In [81]:
base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    device_map = device_map
)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]



In [82]:
peft_model = PeftModel.from_pretrained(base_model, new_model_name)
peft_model = peft_model.merge_and_unload()


In [83]:
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_cache=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

In [84]:
pipe = pipeline("text-generation", model=peft_model, tokenizer=tokenizer, max_length=300)

In [85]:
result = pipe(prompt_template.format(prompt=prompt))

print(result[0]["generated_text"])

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


[INST]How large language models can help humans?[\INST]  Large language models, such as transformer-based models like BERT, RoBERTa, and XLNet, have revolutionized the field of natural language processing (NLP) in recent years. Here are some ways in which large language models can help humans:

1. Improved language understanding: Large language models can analyze and understand language in ways that were previously impossible. They can recognize subtle nuances in language, such as the context and tone of a sentence, and use this understanding to improve language translation, question answering, and other NLP tasks.
2. Enhanced language generation: Large language models can generate high-quality text that is coherent and natural-sounding. This can be useful for a variety of applications, such as chatbots, language translation, and content generation.
3. Better text summarization: Large language models can summarize long documents or articles into shorter summaries that capture the most 