# Kaggle Optimization Note

This notebook is optimized for execution on Kaggle.

**Benefits of using Kaggle:**
*   **Data Persistence:** Kaggle environments allow you to save your work, datasets, and trained models, which persists between sessions. This is a significant advantage over Google Colab's free tier where data is often lost when the runtime disconnects.
*   **GPU Resources:** Kaggle provides free access to GPUs, including the T4, for up to 30 hours per week, which is suitable for training models like the one in this notebook.
*   **Integrated Datasets:** Easily access and use datasets hosted on Kaggle.

While Google Colab might offer faster runtime environments in some cases, the data persistence and generous GPU quota on Kaggle make it a suitable platform for training tasks where saving progress and models is important. This notebook leverages these Kaggle features.

## Training

In [None]:
%%capture
!pip install unsloth

In [None]:
# Manage your secrets from the "Add-ons" menu in the top navigation of the editor
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()

# Set your HF token & username as environment variables
# This step is optional for most models. However, if you are training large models with A100 
# or accessing restricted models on Hugging Face that require authentication, you need this.
os.environ["HF_TOKEN"] = user_secrets.get_secret("HUGGING_FACE_HUB_TOKEN")
# Replace with your username
os.environ["HF_USERNAME"] = "Your_userName"

In [None]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
# load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/Meta-Llama-3.1-8B-bnb-4bit",      # Llama-3.1 15 trillion tokens model 2x faster!
    "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
    "unsloth/Meta-Llama-3.1-70B-bnb-4bit",
    "unsloth/Meta-Llama-3.1-405B-bnb-4bit",    # We also uploaded 4bit for 405b!
    "unsloth/Mistral-Nemo-Base-2407-bnb-4bit", # New Mistral 12b 2x faster!
    "unsloth/Mistral-Nemo-Instruct-2407-bnb-4bit",
    "unsloth/mistral-7b-v0.3-bnb-4bit",        # Mistral v3 2x faster!
    "unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
    "unsloth/Phi-3.5-mini-instruct",           # Phi-3.5 2x faster!
    "unsloth/Phi-3-medium-4k-instruct",
    "unsloth/gemma-2-9b-bnb-4bit",
    "unsloth/gemma-2-27b-bnb-4bit",            # Gemma 2x faster!
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    # Can select any from the below:
    # "unsloth/Qwen2.5-0.5B", "unsloth/Qwen2.5-1.5B", "unsloth/Qwen2.5-3B"
    # "unsloth/Qwen2.5-14B",  "unsloth/Qwen2.5-32B",  "unsloth/Qwen2.5-72B",
    # And also all Instruct versions and Math. Coding verisons!
    model_name = "unsloth/Meta-Llama-3.1-8B", # You can choose any model from the above list that works on both colab and Kaggle.
    max_seq_length = max_seq_length,
    dtype = dtype,
    # load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # LoRA rank. Suggested values: 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj"], # Modules to apply LoRA
    lora_alpha = 16, # Scaling factor for LoRA
    lora_dropout = 0, # Dropout rate for LoRA. 0 is optimized for no dropout
    bias = "none",    # Bias type. "none" is optimized
    use_gradient_checkpointing = "unsloth", # Use gradient checkpointing for memory efficiency. "unsloth" is optimized
    random_state = 3407, # Random seed for reproducibility
    use_rslora = False,  # Enable rank-stabilized LoRA (optional)
    loftq_config = None, # Configuration for LoftQ quantization (optional)
)


In [None]:
from datasets import load_dataset

# Define the Alpaca-style prompt template
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

# Retrieve the End-of-Sequence (EOS) token from the tokenizer
# The EOS token is crucial as it marks the end of a sequence, helping the model
# understand where the input or output ends. This is especially important for
# tasks like text generation to ensure proper formatting and termination.
EOS_TOKEN = tokenizer.eos_token

# Function to format the dataset examples into the Alpaca-style prompt
def formatting_prompts_func(examples):
    texts = []
    for example in examples["data"]:  # Iterate through the list of dictionaries
        instruction = example["instruction"]  # Extract the instruction
        input_text = example["input"]         # Extract the input context
        output = example["output"]            # Extract the expected output
        # Format the text using the Alpaca prompt template and append the EOS token
        text = alpaca_prompt.format(instruction, input_text, output) + EOS_TOKEN
        texts.append(text)  # Add the formatted text to the list
    return {"text": texts}  # Return the formatted texts as a dictionary


In [None]:
import json
from datasets import Dataset

# Load the dataset from a JSON file
new_dataset = json.load(open('/kaggle/input/formatted_train_data.json'))

# Convert list to a Hugging Face Dataset if needed
if isinstance(new_dataset, list):
    new_dataset = Dataset.from_dict({"data": new_dataset})

# Format the dataset using the Alpaca-style prompt function
dataset = new_dataset.map(formatting_prompts_func, batched=True)

# Save the processed Hugging Face dataset to disk for future use
dataset.save_to_disk("/kaggle/working/formatted_dataset")

In [None]:
dataset[0]

In [None]:
from trl import SFTTrainer  # Import SFTTrainer for supervised fine-tuning
from transformers import TrainingArguments  # Import training arguments
from unsloth import is_bfloat16_supported  # Check if bfloat16 is supported

trainer = SFTTrainer(
    model = model,  # Pre-trained model
    tokenizer = tokenizer,  # Tokenizer for text processing
    train_dataset = dataset,  # Training dataset
    dataset_text_field = "text",  # Field containing text data
    max_seq_length = max_seq_length,  # Max sequence length for inputs
    dataset_num_proc = 2,  # Number of processes for dataset preprocessing
    packing = False,  # Disable sequence packing for simplicity
    args = TrainingArguments(  # Training configuration
        per_device_train_batch_size = 2,  # Batch size per device
        gradient_accumulation_steps = 4,  # Steps to accumulate gradients
        warmup_steps = 5,  # Warmup steps for learning rate
        num_train_epochs = 1,  # Number of training epochs
        max_steps = 60,  # Max training steps
        learning_rate = 2e-4,  # Learning rate
        fp16 = not is_bfloat16_supported(),  # Use fp16 if bfloat16 unsupported
        bf16 = is_bfloat16_supported(),  # Use bfloat16 if supported
        logging_steps = 1,  # Log every step
        optim = "adamw_8bit",  # Optimizer with 8-bit precision
        weight_decay = 0.01,  # Weight decay for regularization
        lr_scheduler_type = "linear",  # Linear learning rate scheduler
        seed = 3407,  # Random seed for reproducibility
        output_dir = "outputs",  # Directory to save outputs
        report_to = "none",  # Disable reporting (e.g., WandB)
    ),
)

In [None]:
trainer_stats = trainer.train()

## Test / Inference the Model

### Normal Test

In [None]:
# alpaca_prompt = Copied from above
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    alpaca_prompt.format(
        "Identify key points missing from the given answer", # instruction
        '''"Answer": "The documentation of health impacts from ambient air pollution has been challenging due to exposure assessment complexities, confounding factors like smoking, infections, and allergies, and the intricacies of studying large populations. Recently, sophisticated global studies have conclusively shown that air pollution significantly affects human health. Respiratory symptoms, notably complicating chronic bronchitis, are the predominant adverse effects across different pollution types. Air pollution is linked to elevated risks of heart and lung disease mortality, even at sub-toxic levels. While there's limited evidence of air pollution as a primary cause of cancer, certain emission sources may contribute to cancer risk, especially in exceptional cases.",
        "key_points": [
            "Exposure assessment complexities and confounding factors make studying health impacts of air pollution challenging.",
            "Global studies have shown that air pollution significantly affects human health.",
            "Respiratory symptoms, such as chronic bronchitis, are predominant adverse effects of air pollution.",
            "Air pollution is linked to elevated risks of heart and lung disease mortality, even at sub-toxic levels.",
            "Mucosal irritation, including bronchitis, nasal irritation, and conjunctivitis, occurs with high pollution exposure.",
            "Eye irritation can be severe due to particulates or high concentrations of photochemical oxidants and aldehydes.",
            "Certain emission sources may contribute to cancer risk, though evidence of air pollution as a primary cause of cancer is limited."
        ]''', # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 64, use_cache = True)
tokenizer.batch_decode(outputs)

### Stream Test

In [None]:
# alpaca_prompt = Copied from above
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    alpaca_prompt.format(
        "Identify key points missing from the given answer look for all the points carefully and print nothing else and think twice before you answer and give the most appropriate answer", # instruction
        ''' "Answer": "The International Atomic Energy Agency's Marine Environmental Laboratory (IAEA-MEL) has played a vital role in ensuring the quality of marine radioactivity measurements for nearly three decades. Through numerous global and regional intercomparison exercises and the supply of reference materials, Analytical Quality Control Services (AQCS) has emerged as a critical component of the IAEA's mission. Recognizing the increasing importance of environmental data in economic, ecological, and legal decision-making, the IAEA-MEL is evaluating ways to further enhance data quality. Proposed enhancements include a comprehensive approach to total quality management for analytical laboratories, encompassing quality assurance programs and manuals, intercomparison exercises with effective feedback mechanisms, and laboratory accreditation.",
        "key_points": [
            "IAEA-MEL has ensured the quality of marine radioactivity measurements for nearly three decades.",
            "Analytical Quality Control Services (AQCS) is a critical component of the IAEA's mission.",
            "IAEA-MEL organized 41 intercomparison exercises for radionuclides and produced 35 reference materials.",
            "Environmental data is increasingly important in economic, ecological, and legal decision-making.",
            "Proposed enhancements include total quality management for analytical laboratories.",
            "Certified reference material production according to ISO standards is part of the proposed enhancements.",
            "Quality assessment and control training is included in the initiatives to improve data quality."
        ]''', # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 750)

In [None]:
model.save_pretrained("lora_model")  # Local saving
tokenizer.save_pretrained("lora_model")
# # model.push_to_hub("your_name/lora_model", token = "...") # Online saving
# # tokenizer.push_to_hub("your_name/lora_model", token = "...") # Online saving

### To Download the Weights for future use run the below code and download the zip file

In [None]:
import shutil

shutil.make_archive("/kaggle/working/qwen2.5_march_lora", 'zip', "/kaggle/working/lora_model")
