To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
<div class="align-center">
  <a href="https://github.com/unslothai/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/u54VK8m8tk"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
  <a href="https://ko-fi.com/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Kofi button.png" width="145"></a></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
</div>

To install Unsloth on your own computer, follow the installation instructions on our Github page [here](https://github.com/unslothai/unsloth?tab=readme-ov-file#-installation-instructions).

You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save) (eg for Llama.cpp).

In [1]:
%%capture
!pip install unsloth "xformers==0.0.28.post2"
# Also get the latest nightly Unsloth!
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

In [None]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/Meta-Llama-3.1-8B-bnb-4bit",      # Llama-3.1 15 trillion tokens model 2x faster!
    "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
    "unsloth/Meta-Llama-3.1-70B-bnb-4bit",
    "unsloth/Meta-Llama-3.1-405B-bnb-4bit",    # We also uploaded 4bit for 405b!
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Meta-Llama-3.1-8B",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

<a name="Data"></a>
### Dataset Preparation Default
We now use the Alpaca dataset from [yahma](https://huggingface.co/datasets/yahma/alpaca-cleaned), which is a filtered version of 52K of the original [Alpaca dataset](https://crfm.stanford.edu/2023/03/13/alpaca.html). You can replace this code section with your own data prep.

**[NOTE]** To train only on completions (ignoring the user's input) read TRL's docs [here](https://huggingface.co/docs/trl/sft_trainer#train-on-completions-only).

**[NOTE]** Remember to add the **EOS_TOKEN** to the tokenized output!! Otherwise you'll get infinite generations!

If you want to use the `llama-3` template for ShareGPT datasets, try our conversational [notebook](https://colab.research.google.com/drive/1XamvWYinY6FOSX9GLvnqSjjsNflxdhNc?usp=sharing).

For text completions like novel writing, try this [notebook](https://colab.research.google.com/drive/1ef-tab5bhkvWmBOObepl1WgJvfvSzn5Q?usp=sharing).

In [None]:
from unsloth import FastLanguageModel
import torch
from datasets import load_from_disk

# Define the Alpaca-style prompt template
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

# Use tokenizer's EOS token
EOS_TOKEN = tokenizer.eos_token  # Make sure this is defined in your environment

# Load the dataset from disk
dataset = load_from_disk("/content/formatted_dataset")

# Check if the data is loaded correctly
print("Current columns in the dataset:", dataset.column_names)

# Formatting function to map "text" if separate instruction, input, and output columns exist
def formatting_prompts_func(examples):
    # Check if separate columns exist
    if "instruction" in dataset.column_names and "input" in dataset.column_names and "output" in dataset.column_names:
        instructions = examples["instruction"]
        inputs = examples["input"]
        outputs = examples["output"]
    else:
        # If only "text" field exists, assume it is already formatted
        return examples  # No need to reformat if already in Alpaca style

    texts = []
    for instruction, input, output in zip(instructions, inputs, outputs):
        # Format with Alpaca-style prompt and add EOS_TOKEN
        text = alpaca_prompt.format(instruction, input, output) + EOS_TOKEN
        texts.append(text)

    return {"text": texts}

# View a sample before transformation if separate columns exist
if "instruction" in dataset.column_names:
    print("Sample of dataset before transformation:")
    print(dataset.select(range(3)))

# Map the formatting function to the dataset if needed
dataset = dataset.map(formatting_prompts_func, batched=True)

# View a sample of the dataset after transformation
print("\nSample of dataset after transformation:")
print(dataset.select(range(3))["text"])


**Load Dataset If Already Formatted The Custom Dataset to Alpaca-Prompt Template**

In [None]:
from datasets import load_from_disk

# Load the dataset
dataset = load_from_disk("/content/formatted_dataset")

# Check column names
print("Columns in the dataset:", dataset.column_names)

# Preview some data
print("Sample data:", dataset.select(range(3)))


In [None]:
from google.colab import drive
drive.mount('/content/drive')

**Three Fine-Tune Approaches and Training Settings**

In [None]:
from unsloth import FastLanguageModel
import torch
from datasets import load_from_disk
from transformers import (
    TrainingArguments,
    BitsAndBytesConfig,
    DataCollatorForLanguageModeling,
)
from trl import SFTTrainer
import gc
import time
import matplotlib.pyplot as plt

# ======== Configuration ========
MODEL_NAME = "unsloth/Meta-Llama-3.1-8B"
DATASET_PATH = "/content/formatted_dataset"
MAX_SEQ_LENGTH = 1024
TARGET_MODULES = ["q_proj", "v_proj"]  # Reduced for memory efficiency

# ======== Load Dataset ========
try:
    dataset = load_from_disk(DATASET_PATH)
    assert "text" in dataset.column_names
    print(f"Loaded dataset with {len(dataset)} samples")
except Exception as e:
    raise RuntimeError(f"Dataset loading failed: {str(e)}")

# ======== Base Model Setup ========
def load_base_model():
    quantization_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.float16,
        bnb_4bit_use_double_quant=True,
    )

    return FastLanguageModel.from_pretrained(
        model_name=MODEL_NAME,
        max_seq_length=MAX_SEQ_LENGTH,
        dtype=torch.float16,
        load_in_4bit=True,
        quantization_config=quantization_config,
        device_map="auto",
        use_cache=False,
        # Remove use_flash_attention_2=True or set attn_implementation='flash_attention_2'
        # use_flash_attention_2=True,
        attn_implementation="flash_attention_2", # Use this to enable flash attention 2
    )

# Define tokenizer outside the functions to make it globally accessible
model, tokenizer = load_base_model()  # Load the model and tokenizer here

# ======== Training Configurations ========
def get_training_args(method_name):
    return TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_ratio=0.1,
        max_steps=5,
        learning_rate=5e-5,
        fp16=True,
        logging_steps=1,
        optim="paged_adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        output_dir=f"./outputs/{method_name}",
        report_to="none",
        save_strategy="no",
        remove_unused_columns=False,
    )

# ======== LoRA Methods ========
def train_lora():
    # model, tokenizer = load_base_model()  # Remove this line as they are loaded globally now
    model_lora = FastLanguageModel.get_peft_model( # Renamed to avoid conflict with global 'model'
        model,
        r=16,
        target_modules=TARGET_MODULES,
        lora_alpha=32,
        lora_dropout=0.05,
        bias="none",
        use_gradient_checkpointing=True,
        use_rslora=False,
    )

    trainer = SFTTrainer(
        model=model_lora,  # Use model_lora here
        train_dataset=dataset,
        dataset_text_field="text",
        max_seq_length=MAX_SEQ_LENGTH,
        tokenizer=tokenizer,  # Now accessible globally
        args=get_training_args("lora"),
        packing=True,
    )

    start_time = time.time()
    trainer.train()
    return model_lora, trainer, time.time() - start_time # Return model_lora

def train_qlora():
    # model, tokenizer = load_base_model()  # Remove this line as they are loaded globally now
    model_qlora = FastLanguageModel.get_peft_model( # Renamed to avoid conflict with global 'model'
        model,
        r=8,  # Lower rank for 4-bit
        target_modules=TARGET_MODULES,
        lora_alpha=16,
        lora_dropout=0.1,
        bias="none",
        use_gradient_checkpointing=True,
        use_rslora=False,  # ✅ REMOVE use_4bit_quantization
    )

    trainer = SFTTrainer(
        model=model_qlora,  # Use model_qlora here
        train_dataset=dataset,
        dataset_text_field="text",
        max_seq_length=MAX_SEQ_LENGTH,
        tokenizer=tokenizer,  # Now accessible globally
        args=get_training_args("qlora"),
        packing=True,
    )

    start_time = time.time()
    trainer.train()
    return model_qlora, trainer, time.time() - start_time # Return model_qlora

# ======== BitFit Method ========
def train_bitfit():
    # model, tokenizer = load_base_model()  # Remove this line as they are loaded globally now
    model_bitfit = FastLanguageModel.get_peft_model( # Renamed to avoid conflict with global 'model'
        model,
        r=1,  # Set a positive rank value for get_peft_model
        target_modules=TARGET_MODULES,  # Use defined TARGET_MODULES
        lora_alpha=1,  # BitFit doesn't require Lora alpha
        lora_dropout=0.0,  # No dropout needed for BitFit
        bias="all",  # Update all bias parameters
        use_gradient_checkpointing=True,
        use_rslora=False,  # Not using RS-LoRA in this method
    )

    # ... (rest of the function remains the same)

    trainer = SFTTrainer(
        model=model_bitfit,  # Use model_bitfit here
        train_dataset=dataset,
        dataset_text_field="text",
        max_seq_length=MAX_SEQ_LENGTH,
        tokenizer=tokenizer,  # Now accessible globally
        args=get_training_args("bitfit"),
        packing=True,
    )

    start_time = time.time()
    trainer.train()
    return model_bitfit, trainer, time.time() - start_time # Return model_bitfit

# ======== Experiment Runner ========
results = {}

def run_experiment(method_name, trainer_func):
    # Memory cleanup
    gc.collect()
    torch.cuda.empty_cache()

    torch.cuda.reset_peak_memory_stats()
    start_time = time.time()

    model_exp = None  # Initialize model to None # Renamed to avoid conflict
    trainer = None  # Initialize trainer to None

    try:
        model_exp, trainer, duration = trainer_func()  # Use model_exp to store returned model
        peak_mem = torch.cuda.max_memory_reserved()

        # Accessing loss from the trainer's state
        loss = trainer.state.log_history[-1].get("loss", None)  # Get loss, default to None if not found

        # Check if loss is None and handle it, e.g., set a default value
        if loss is None:
            loss = float('inf')  # Or any other suitable default value like 0

        results[method_name] = {
            "time": duration,
            "memory": peak_mem,
            "params": sum(p.numel() for p in model_exp.parameters() if p.requires_grad), # Use model_exp
            "loss": loss,  # Assign the handled loss value
        }
    finally:

        #FastLanguageModel.save_pretrained_gguf(model, f"./outputs/{method_name}_gguf")
        model_exp.save_pretrained_gguf(f"./outputs/{method_name}_gguf", tokenizer=tokenizer) # Use model_exp & global tokenizer
        print(f"✅ Saved {method_name} model in GGUF format")

        # Only delete if they exist
        if model_exp is not None: # Use model_exp
            del model_exp
        if trainer is not None:
            del trainer
        gc.collect()
        torch.cuda.empty_cache()

# ======== Execute Experiments ========
print("=== Running LoRA ===")
run_experiment("LoRA", train_lora)

print("\n=== Running QLoRA ===")
run_experiment("QLoRA", train_qlora)

print("\n=== Running BitFit ===")  # Changed from RS-LoRA to BitFit
run_experiment("BitFit", train_bitfit)

# ======== Results Visualization ========
def plot_results(results):
    fig, axs = plt.subplots(2, 2, figsize=(18, 12))

    # Training Time
    axs[0,0].bar(results.keys(), [v["time"] for v in results.values()], color='skyblue')
    axs[0,0].set_title("Training Time Comparison", fontsize=14)
    axs[0,0].set_ylabel("Seconds", fontsize=12)

    # Memory Usage
    axs[0,1].bar(results.keys(), [v["memory"]/1e9 for v in results.values()], color='lightgreen')
    axs[0,1].set_title("Peak GPU Memory Usage", fontsize=14)
    axs[0,1].set_ylabel("GB", fontsize=12)

    # Trainable Parameters
    axs[1,0].bar(results.keys(), [v["params"]/1e6 for v in results.values()], color='salmon')
    axs[1,0].set_title("Trainable Parameters", fontsize=14)
    axs[1,0].set_ylabel("Millions", fontsize=12)

    # Training Loss
    axs[1,1].bar(results.keys(), [v["loss"] for v in results.values()], color='gold')
    axs[1,1].set_title("Final Training Loss", fontsize=14)
    axs[1,1].set_ylabel("Loss", fontsize=12)

    plt.tight_layout()
    plt.show()

plot_results(results)

# ======== Print Numerical Results ========
print("\n=== Comparative Results ===")
for method, metrics in results.items():
    print(f"\n{method}:")
    print(f"  Training Time: {metrics['time']:.2f}s")
    print(f"  Peak Memory: {metrics['memory']/1e9:.2f}GB")
    print(f"  Trainable Params: {metrics['params']/1e6:.2f}M")
    print(f"  Final Loss: {metrics['loss']:.4f}")

**Rest of Script are DEMOS, Modify as per convenience**

**Direct inference with groundtruth data**

In [None]:
import torch
import json
from transformers import AutoTokenizer
from unsloth import FastLanguageModel

# Define the Alpaca-style prompt template
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

# Load the tokenizer (Tokenized same for all three)
tokenizer = AutoTokenizer.from_pretrained("./outputs/BitFit_gguf")

# Define model paths
model_paths = {
    "BitFit": "./outputs/BitFit_gguf",
    "LoRA": "./outputs/LoRA_gguf",
    "QLoRA": "./outputs/QLoRA_gguf"
}

# Load ground truth data from JSON
with open("ground_truth.json", "r") as f:
    ground_truth_data = json.load(f)

# Function to generate responses from all models
def generate_responses(question):
    inputs = tokenizer([alpaca_prompt.format(question, "", "")], return_tensors="pt").to("cuda")
    model_responses = {}

    for model_name, model_path in model_paths.items():
        # Load model
        model = FastLanguageModel.from_pretrained(model_path).to("cuda")  # Remove torch_dtype=torch.float16
        FastLanguageModel.for_inference(model)  # Enable Unsloth optimization

        # Generate output
        outputs = model.generate(**inputs, max_new_tokens=200, use_cache=True)
        response = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]

        # Store response
        model_responses[model_name] = response

        # Free GPU memory after inference
        del model
        torch.cuda.empty_cache()

    return model_responses

# Perform inference on all ground truth prompts
results = []
for entry in ground_truth_data:
    question = entry["prompt"]
    ground_truth = entry["response"]

    model_responses = generate_responses(question)

    results.append({
        "prompt": question,
        "ground_truth": ground_truth,
        "model_responses": model_responses
    })

# Save results to a JSON file
output_file = "model_comparison_results.json"
with open(output_file, "w") as f:
    json.dump(results, f, indent=4)

print(f"✅ Results saved in {output_file}")


**Evaluations Samples**

In [None]:
!pip install rouge-score
!pip install sacrebleu bert-score pandas seaborn matplotlib
!pip install numpy matplotlib seaborn editdistance scikit-learn
!pip install nltk
import nltk
nltk.download('wordnet')

In [None]:
import json
import logging
import editdistance
from nltk.translate import meteor_score
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
from rouge_score import rouge_scorer
import bert_score

# Setup logging
logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")

# Function to calculate BLEU score
def calculate_bleu(ground_truth, response):
    ground_truth_tokens = ground_truth.lower().split()
    response_tokens = response.lower().split()

    # BLEU score calculation using 1-gram and smoothing
    smoothing = SmoothingFunction().method4
    bleu_score = sentence_bleu([ground_truth_tokens], response_tokens, smoothing_function=smoothing)
    return bleu_score

# Function to calculate METEOR score
def calculate_meteor(ground_truth, response):
    hypothesis_tokens = response.lower().split()
    reference_tokens = ground_truth.lower().split()
    return meteor_score.single_meteor_score(reference_tokens, hypothesis_tokens)

# Function to calculate TER (Translation Edit Rate)
def calculate_ter(ground_truth, response):
    return editdistance.eval(ground_truth, response) / max(len(ground_truth.split()), 1)

# Function to calculate ROUGE score
def calculate_rouge(ground_truth, response):
    scorer = rouge_scorer.RougeScorer(["rouge1", "rouge2", "rougeL"], use_stemmer=True)
    scores = scorer.score(ground_truth, response)
    return scores["rouge1"].fmeasure, scores["rouge2"].fmeasure, scores["rougeL"].fmeasure

# Function to calculate BERTScore
def calculate_bertscore(ground_truth, response):
    P, R, F1 = bert_score.score([response], [ground_truth], lang="en")
    return F1.item()  # Return the F1 score (semantic similarity)

# Function to calculate F1 Score (Quantitative Metric)
def calculate_f1(ground_truth, response):
    gt_tokens = ground_truth.lower().split()
    response_tokens = response.lower().split()

    if not gt_tokens or not response_tokens:
        return 0.0  # Return 0 if either is empty

    # Pad shorter list with 0s to match the length of the longer list
    max_len = max(len(gt_tokens), len(response_tokens))
    gt_tokens.extend([0] * (max_len - len(gt_tokens)))
    response_tokens.extend([0] * (max_len - len(response_tokens)))

    common = set(gt_tokens) & set(response_tokens)

    # Calculate F1 score
    precision = len(common) / max(len(response_tokens), 1)
    recall = len(common) / max(len(gt_tokens), 1)
    f1 = 2 * (precision * recall) / (precision + recall) if precision + recall > 0 else 0
    return f1

# Function to calculate CER (Character Error Rate)
def calculate_cer(ground_truth, response):
    if not ground_truth or not response:
        return 1.0 if ground_truth or response else 0.0
    return editdistance.eval(ground_truth, response) / max(len(ground_truth), 1)

# Evaluation function
def evaluate_models(data):
    results = {}

    for group in data["groups"]:
        group_name = group["group_name"]
        group_results = []

        for entry in group["questions"]:
            question = entry["question"]
            ground_truth = entry["groundtruth"]
            responses = entry["responses"]

            # Evaluate responses here...
            model_results = {}
            for model_name, response in responses.items():
                bleu = calculate_bleu(ground_truth, response)
                meteor = calculate_meteor(ground_truth, response)
                ter = calculate_ter(ground_truth, response)
                rouge1, rouge2, rougeL = calculate_rouge(ground_truth, response)
                bertscore = calculate_bertscore(ground_truth, response)
                f1 = calculate_f1(ground_truth, response)
                cer = calculate_cer(ground_truth, response)

                # Store the model evaluation results
                model_results[model_name] = {
                    "f1_score": f1,
                    "cer": cer,
                    "bleu_score": bleu,
                    "meteor_score": meteor,
                    "ter_score": ter,
                    "rouge1_score": rouge1,
                    "rouge2_score": rouge2,
                    "rougeL_score": rougeL,
                    "bertscore": bertscore,
                }

            # Append results for this question
            group_results.append({
                "question": question,
                "groundtruth": ground_truth,
                "results": model_results
            })

        results[group_name] = group_results

    return results

# Main script
if __name__ == "__main__":
    # Load evaluation dataset
    with open("evaluation_results_organized.json", 'r', encoding='utf-8') as file:
        data = json.load(file)

    logging.info("Loaded JSON data successfully!")

    # Perform evaluation
    evaluation_results = evaluate_models(data)

    # Save results to JSON
    with open("evaluation_results_final.json", 'w', encoding='utf-8') as file:
        json.dump(evaluation_results, file, indent=4)

    logging.info("Evaluation complete! Results saved to 'evaluation_results_final.json'")



In [None]:
# alpaca_prompt = Copied from above
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    alpaca_prompt.format(
        "What defines arterial hypertension according to the National High Blood Pressure Education Programme (NHBPEP) for below 13 years?", # instruction
        "", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 200, use_cache = True)
tokenizer.batch_decode(outputs)

In [None]:
import torch
from unsloth import FastLanguageModel
from transformers import AutoTokenizer

# ======== Load Models ========
BASE_MODEL_NAME = "unsloth/Meta-Llama-3.1-8B"
MODEL_PATHS = {
    "Base Model": None,  # Base model will be loaded from Hugging Face
    "BitFit": "./outputs/BitFit_gguf",
    "LoRA": "./outputs/LoRA_gguf",
    "QLoRA": "./outputs/QLoRA_gguf"
}

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL_NAME)

# Define the Alpaca-style prompt template
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

# Input Prompt
instruction = "How should resistant hypertension be managed in children?"
input_text = ""

# Tokenize input
def tokenize_input(instruction, input_text):
    prompt = alpaca_prompt.format(instruction, input_text, "")
    return tokenizer([prompt], return_tensors="pt").to("cuda")

# Generate response function
def generate_response(model, inputs):
    outputs = model.generate(**inputs, max_new_tokens=200, use_cache=True)
    return tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]

# Run inference for each model
responses = {}

for model_name, model_path in MODEL_PATHS.items():
    print(f"\n=== Running inference with {model_name} ===")

    # Load base model or fine-tuned model
    if model_path is None:
        model, tokenizer = FastLanguageModel.from_pretrained(
            model_name=BASE_MODEL_NAME,
            max_seq_length=1024,
            dtype=torch.float16,
            load_in_4bit=True,
            device_map="auto"
        )
    else:
        model, tokenizer = FastLanguageModel.from_pretrained(
            model_name=BASE_MODEL_NAME,
            max_seq_length=1024,
            dtype=torch.float16,
            load_in_4bit=True,
            device_map="auto"
        )
        model.load_adapter(model_path)  # Load LoRA/QLoRA/RS-LoRA adapters

    # ✅ Ensure model is correctly referenced
    FastLanguageModel.for_inference(model)  # Enable fast inference

    # Prepare input
    inputs = tokenize_input(instruction, input_text)

    # Generate output
    response = generate_response(model, inputs)

    # Store response
    responses[model_name] = response

    # Print response
    print(response)

    # Clean up memory
    del model
    torch.cuda.empty_cache()

# Print all responses together
print("\n=== Final Comparative Results ===")
for model_name, response in responses.items():
    print(f"\n{model_name} Response:\n{response}")


**Default Inference Scripts of Unsloth**

In [None]:
# Assuming you've already loaded the model and tokenizer from your fine-tuning process
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load the trained model and tokenizer
model = AutoModelForCausalLM.from_pretrained('/content/lora_model')  # Your saved output directory
tokenizer = AutoTokenizer.from_pretrained('/content/lora_model')  # Your saved tokenizer

# For faster inference with the custom model (if applicable)
#FastLanguageModel.for_inference(model)  # Enable faster inference (optional if supported)

# Set the model to evaluation mode
model.eval()

# Define a sample instruction and input from your dataset for testing
instruction = "A 10-year-old girl is presented for a blood pressure check in your practice. How do you proceed with the measurement?"
input_text = ""  # No input in this case or you can provide specific input
output_text = ""  # Leave this blank for generation

# Format the prompt
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:""".format(instruction, input_text)

# Tokenize the formatted prompt
inputs = tokenizer(
    [alpaca_prompt], return_tensors="pt", truncation=True, padding=True, max_length=512
).to("cuda")  # Assuming you're using GPU (cuda), otherwise use .to("cpu")

# Generate the response with adjusted parameters
outputs = model.generate(
    **inputs,
    max_new_tokens=128,
    temperature=0.7,  # Add some randomness to the generation
    top_p=0.9,  # Use top-p sampling for better diversity
    top_k=50,  # Limit the sampling pool for diversity
    use_cache=True
)

# Decode the generated output
generated_text = tokenizer.batch_decode(outputs, skip_special_tokens=True)

# Print the result
print(f"Instruction: {instruction}")
print(f"Input: {input_text}")
print(f"Generated Response: {generated_text[0]}")

 You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!

In [None]:
# alpaca_prompt
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    alpaca_prompt.format(
        "A 10-year-old girl is presented for a blood pressure check in your practice. How do you proceed with the measurement?", # instruction
        "First, ensure the girl rests and relaxes for 5 minutes. Then, use an age- and size-appropriate cuff, and measure the blood pressure on the upper arm. If unusual values occur, measure blood pressure in both arms and in the legs.", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128)

<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In [None]:
model.save_pretrained("lora_model") # Local saving
tokenizer.save_pretrained("lora_model")
# model.push_to_hub("your_name/lora_model", token = "...") # Online saving
# tokenizer.push_to_hub("your_name/lora_model", token = "...") # Online saving

### Saving to float16 for VLLM

We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.

In [None]:
# Merge to 16bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_16bit", token = "")

# Merge to 4bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_4bit", token = "")

# Just LoRA adapters
if False: model.save_pretrained_merged("model", tokenizer, save_method = "lora",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "lora", token = "")

### GGUF / llama.cpp Conversion
To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.

Some supported quant methods (full list on our [Wiki page](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options)):
* `q8_0` - Fast conversion. High resource use, but generally acceptable.
* `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
* `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.

[**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/drive/1WZDi7APtQ9VsvOrQSSC5DDtxq159j8iZ?usp=sharing)

In [None]:
# Save to 8bit Q8_0
if False: model.save_pretrained_gguf("model", tokenizer,)
# Remember to go to https://huggingface.co/settings/tokens for a token!
# And change hf to your username!
if False: model.push_to_hub_gguf("hf/model", tokenizer, token = "")

# Save to 16bit GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "")

# Save to q4_k_m GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q4_k_m", token = "")

# Save to multiple GGUF options - much faster if you want multiple!
if False:
    model.push_to_hub_gguf(
        "hf/model", # Change hf to your username!
        tokenizer,
        quantization_method = ["q4_k_m", "q8_0", "q5_k_m",],
        token = "",
    )