**Notebook created for CS685 Project - LLM Prompt Recovery**

This notebook is based on the unsloth starter notebook for fine-tuning unsloth/llama-2-7b: https://github.com/unslothai/unsloth

Installing the following dependencies:
- datasets: used to load data from huggingface API
- huggingface_hub: to push models to hub
- bitsandbytes: library to quantize language models
- accelerate: library to perform fast and efficient computation
- trl: to use SFTTrainer class for fine-tuning
- peft: to implement LoRA adapters
- unsloth: enables integration with HF models and provides faster fine-tuning capability

In [None]:
# Needs to be run when session times out
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps "xformers<0.0.26" trl peft accelerate bitsandbytes huggingface_hub datasets

Collecting unsloth[colab-new]@ git+https://github.com/unslothai/unsloth.git
  Cloning https://github.com/unslothai/unsloth.git to /tmp/pip-install-by7q1i3a/unsloth_98ba6626f8ab47f2890e3584de068335
  Running command git clone --filter=blob:none --quiet https://github.com/unslothai/unsloth.git /tmp/pip-install-by7q1i3a/unsloth_98ba6626f8ab47f2890e3584de068335
  Resolved https://github.com/unslothai/unsloth.git to commit 47ffd39abd02338e8a5f226d0f529347fb7e5f89
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting tyro (from unsloth[colab-new]@ git+https://github.com/unslothai/unsloth.git)
  Using cached tyro-0.8.4-py3-none-any.whl (102 kB)
Collecting datasets>=2.16.0 (from unsloth[colab-new]@ git+https://github.com/unslothai/unsloth.git)
  Using cached datasets-2.19.1-py3-none-any.whl (542 kB)
Collecting dil

In [None]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) n
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [None]:
import pandas as pd
import os
from datasets import Dataset
from datasets import load_dataset
from unsloth import FastLanguageModel
import torch
from trl import SFTTrainer
from transformers import TrainingArguments

In [None]:
from transformers import EarlyStoppingCallback

In [None]:
# Setting to default
max_seq_length = 2048
dtype = None
load_in_4bit = True

In [None]:
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/gemma-2b-it-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)



==((====))==  Unsloth: Fast Gemma patching release 2024.5
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.2.1+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. Xformers = 0.0.25.post1. FA = False.
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth


Unused kwargs: ['_load_in_4bit', '_load_in_8bit', 'quant_method']. These kwargs are not used in <class 'transformers.utils.quantization_config.BitsAndBytesConfig'>.


In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 8,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj","gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0,
    bias = "none",
    use_gradient_checkpointing = "unsloth",
    # random_state = 3407,
    # use_rslora = False,
    # loftq_config = None,
)

In [None]:
# The data is structured in a consistent format for the fine-tuning. The model is given a pair of original and target text as input and the rewrite prompt as the output label.
gemma_prompt = """Instruction:\nBelow, the "Original Text" passage has been rewritten/transformed/improved into "Rewritten Text" by Gemma and LLama large language models with a certain prompt/instruction. Your task is to carefully analyze the differences between the "Original Text" and "Rewritten Text", and try to infer the specific prompt or instruction that was likely given to the LLM to rewrite/transform/improve the text in this way.\n
### Original Text:
{}

### Rewriten Text:
{}

### Prompt:
{}
"""
def formatting_prompts_func(train_data):

    rewrite_prompt = train_data["RewritePrompt"]
    original_text  = train_data["OriginalText"]
    target_text    = train_data["TargetText"]
    texts = []
    for input, output, instruction in zip(original_text, target_text, rewrite_prompt):
        text = gemma_prompt.format(original_text, target_text, rewrite_prompt)
        texts.append(text)
    return { "text" : texts, }
pass

# Data is loaded from huggingface and mapped to the given format
# All 10k training samples are loaded along with 1k validation samples
train_dataset = load_dataset("tuhinatripathi/llm-prompt-recovery", split = 'train[:5000]')
eval_dataset = load_dataset("tuhinatripathi/llm-prompt-recovery", split = 'test[:1]')
train_data_map = train_dataset.map(formatting_prompts_func, batched = True,)
eval_data_map = eval_dataset.map(formatting_prompts_func, batched = True,)

Map:   0%|          | 0/1 [00:00<?, ? examples/s]

In [None]:
trainer_args = TrainingArguments(
    push_to_hub=True,
    output_dir="gemma2b-5kdata",
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    num_train_epochs=1,
    warmup_steps=int(0.06 * 2250),
    learning_rate=2e-4,
    fp16=True,
    logging_steps=20,
    optim="adamw_8bit",
    weight_decay=0.01,
    lr_scheduler_type="cosine",
    seed=3407,
    save_strategy="epoch",
    save_total_limit=3,
    load_best_model_at_end=True,
    evaluation_strategy="epoch"
)

# Clear the CUDA cache before starting training
torch.cuda.empty_cache()

output_dir = trainer_args.output_dir

# Check for the most recent checkpoint
checkpoint_dir = None
if os.path.exists(output_dir) and os.listdir(output_dir):
    checkpoints = [os.path.join(output_dir, d) for d in os.listdir(output_dir) if d.startswith("checkpoint")]
    if checkpoints:
        checkpoint_dir = max(checkpoints, key=os.path.getmtime)
        model_path = os.path.join(checkpoint_dir, 'pytorch_model.bin')
        if os.path.exists(model_path):
            model.load_state_dict(torch.load(model_path))

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=train_data_map,
    eval_dataset=eval_data_map,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    packing=False,
    args=trainer_args,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=3)]
)

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1 [00:00<?, ? examples/s]

In [None]:
trainer_stats = trainer.train()
#trainer.push_to_hub("llm-prompt-recovery/gemma-2b-1")
trainer.push_to_hub("tuhinatripathi/gemma2b-lpr-5kdata-it")

Epoch,Training Loss,Validation Loss
1,1.036,0.993864




CommitInfo(commit_url='https://huggingface.co/tuhinatripathi/gemma2b-5kdata/commit/c789d18344f6b8161215700ee0885805319c158f', commit_message='tuhinatripathi/gemma2b-lpr-5kdata-it', commit_description='', oid='c789d18344f6b8161215700ee0885805319c158f', pr_url=None, pr_revision=None, pr_num=None)

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
model_save_path = '/content/drive/MyDrive/SFT-gemma2b-5000data-it'
trainer.save_model(model_save_path)