# Focused Learning: Performance-Efficiency Tradeoffs in PEFT Methods

## Learning Objectives
- Understand the fundamental tradeoffs between performance and efficiency in PEFT methods
- Explore how to measure and quantify these tradeoffs systematically
- Implement experiments to evaluate different aspects of the efficiency-performance spectrum
- Develop a framework for selecting optimal PEFT configurations based on specific requirements

## Paper Reference
This notebook explores concepts from the paper "Parameter Efficient Fine Tuning: A Comprehensive Analysis Across Applications" (arXiv:2404.13506v2).

Specifically, we focus on Section 4 which discusses the evaluation considerations and tradeoffs in PEFT methods:

> "PEFT has emerged as a compelling approach for tailoring large pre-trained models to specific tasks while minimizing computational demands. Our review found that leveraging PEFT across diverse applications presents several key challenges that require careful consideration, as practitioners consider applying PEFT for their applications:" (Section 4, Page 6)

The paper highlights the key tradeoff: "A) Balancing Efficiency and Performance: A core challenge lies in striking a delicate balance between reducing trainable parameters and maintaining robust performance." (Section 4, Page 6)

## 1. Introduction to Performance-Efficiency Tradeoffs

Parameter-Efficient Fine-Tuning (PEFT) methods aim to reduce the number of trainable parameters while maintaining performance comparable to full fine-tuning. However, there is an inherent tradeoff between efficiency and performance that must be carefully navigated.

In this notebook, we'll explore the multi-dimensional nature of this tradeoff, considering not just parameter counts but also:
- Computational efficiency (training time, memory usage)
- Task performance (accuracy, F1 score, etc.)
- Generalization capabilities
- Data efficiency requirements
- Inference speed and resource needs

We'll implement experiments to quantify these tradeoffs and develop a framework for selecting the optimal PEFT configuration based on specific requirements and constraints.

In [None]:
# Install necessary libraries
!pip install torch transformers datasets peft matplotlib numpy pandas seaborn scikit-learn memory_profiler psutil tqdm

In [None]:
import torch
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm.auto import tqdm
import time
import os
import psutil
from memory_profiler import profile
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

from transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer,
    Trainer,
    TrainingArguments,
    DataCollatorWithPadding
)
from datasets import load_dataset
from peft import (
    get_peft_model,
    LoraConfig,
    PrefixTuningConfig,
    PromptEncoderConfig,
    TaskType,
    PeftType,
    PeftConfig,
    PeftModel,
    BitFitConfig,
    AdaLoraConfig
)

# Set the seed for reproducibility
seed = 42
torch.manual_seed(seed)
np.random.seed(seed)

# Check if GPU is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

## 2. Defining the Efficiency-Performance Dimensions

Before we can effectively evaluate the tradeoffs, we need to define the key dimensions that characterize the efficiency-performance spectrum of PEFT methods. Based on the paper, these include:

In [None]:
# Define key dimensions for evaluating efficiency-performance tradeoffs
efficiency_performance_dimensions = {
    "Parameter Efficiency": {
        "metrics": [
            "Trainable parameter count",
            "Parameter percentage (relative to full model)",
            "Parameter reduction ratio"
        ],
        "importance": "Primary measure of PEFT efficiency; lower is generally better",
        "measurement": "Count parameters that require gradients"
    },
    "Computational Efficiency": {
        "metrics": [
            "Training time (epochs/hour)",
            "Memory usage (peak GPU/RAM)",
            "FLOPs per training step"
        ],
        "importance": "Critical for resource-constrained environments",
        "measurement": "Time training runs, monitor memory usage, count operations"
    },
    "Task Performance": {
        "metrics": [
            "Accuracy",
            "F1 score",
            "Domain-specific metrics (BLEU, ROUGE, etc.)",
            "Performance gap vs. full fine-tuning"
        ],
        "importance": "The core measure of model effectiveness on the target task",
        "measurement": "Evaluate on validation/test sets"
    },
    "Generalization": {
        "metrics": [
            "Performance on out-of-distribution data",
            "Transfer learning capability",
            "Robustness to perturbations"
        ],
        "importance": "Measures how well the model generalizes beyond training data",
        "measurement": "Test on varied datasets, evaluate with perturbations"
    },
    "Data Efficiency": {
        "metrics": [
            "Performance vs. training data size",
            "Few-shot capabilities",
            "Sample efficiency curves"
        ],
        "importance": "Critical in domains with limited data availability",
        "measurement": "Train with varying dataset sizes, measure performance"
    },
    "Inference Efficiency": {
        "metrics": [
            "Inference time (samples/second)",
            "Inference memory usage",
            "Deployment size"
        ],
        "importance": "Important for production deployment scenarios",
        "measurement": "Time inference runs, monitor resource usage"
    }
}

# Display the dimensions and their metrics
print("Efficiency-Performance Dimensions for PEFT Evaluation")
print("===================================================")
for dimension, details in efficiency_performance_dimensions.items():
    print(f"\n{dimension}")
    print("-" * len(dimension))
    print(f"Importance: {details['importance']}")
    print("Metrics:")
    for metric in details['metrics']:
        print(f"  - {metric}")
    print(f"Measurement: {details['measurement']}")

## 3. Implementing Measurement Functions

Now, let's implement functions to measure these efficiency and performance metrics for different PEFT methods.

In [None]:
def measure_parameter_efficiency(model, full_model=None):
    """Measure parameter efficiency metrics"""
    trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
    total_params = sum(p.numel() for p in model.parameters())
    
    if full_model is not None:
        full_trainable_params = sum(p.numel() for p in full_model.parameters() if p.requires_grad)
        param_percentage = (trainable_params / full_trainable_params) * 100
        param_reduction = 1 - (trainable_params / full_trainable_params)
    else:
        full_trainable_params = total_params
        param_percentage = (trainable_params / total_params) * 100
        param_reduction = 1 - (trainable_params / total_params)
    
    return {
        "trainable_params": trainable_params,
        "total_params": total_params,
        "param_percentage": param_percentage,
        "param_reduction": param_reduction
    }

def measure_computational_efficiency(training_func, model, dataset, batch_size=8, num_epochs=1):
    """Measure computational efficiency metrics during training"""
    # Start time and memory tracking
    start_time = time.time()
    start_memory = psutil.Process(os.getpid()).memory_info().rss / 1024 / 1024  # MB
    
    if torch.cuda.is_available():
        torch.cuda.reset_peak_memory_stats()
        start_gpu_memory = torch.cuda.memory_allocated() / 1024 / 1024  # MB
    
    # Run the training function
    training_func(model, dataset, batch_size, num_epochs)
    
    # End time and memory tracking
    end_time = time.time()
    end_memory = psutil.Process(os.getpid()).memory_info().rss / 1024 / 1024  # MB
    
    if torch.cuda.is_available():
        end_gpu_memory = torch.cuda.memory_allocated() / 1024 / 1024  # MB
        peak_gpu_memory = torch.cuda.max_memory_allocated() / 1024 / 1024  # MB
    else:
        end_gpu_memory = 0
        peak_gpu_memory = 0
    
    # Calculate metrics
    training_time = end_time - start_time
    memory_usage = end_memory - start_memory
    gpu_memory_usage = end_gpu_memory - start_gpu_memory
    
    # Calculate samples per second
    total_samples = len(dataset) * num_epochs
    samples_per_second = total_samples / training_time
    
    return {
        "training_time_seconds": training_time,
        "training_time_per_epoch": training_time / num_epochs,
        "samples_per_second": samples_per_second,
        "memory_usage_mb": memory_usage,
        "peak_gpu_memory_mb": peak_gpu_memory,
        "gpu_memory_usage_mb": gpu_memory_usage
    }

def measure_task_performance(model, eval_dataset, metric_func):
    """Measure task performance metrics"""
    # Run evaluation
    results = metric_func(model, eval_dataset)
    
    # If the results are a single value, wrap it in a dict
    if not isinstance(results, dict):
        results = {"performance": results}
    
    return results

def measure_inference_efficiency(model, dataset, batch_size=1):
    """Measure inference efficiency metrics"""
    model.eval()  # Set model to evaluation mode
    
    # Create a dataloader
    dataloader = torch.utils.data.DataLoader(dataset, batch_size=batch_size)
    
    # Start timing and memory tracking
    start_time = time.time()
    
    if torch.cuda.is_available():
        torch.cuda.reset_peak_memory_stats()
    
    # Run inference
    with torch.no_grad():
        for batch in dataloader:
            # Process batch
            if isinstance(batch, dict):
                outputs = model(**{k: v.to(device) for k, v in batch.items() if isinstance(v, torch.Tensor)})
            else:
                outputs = model(batch[0].to(device))
    
    # End timing and memory tracking
    end_time = time.time()
    
    if torch.cuda.is_available():
        peak_gpu_memory = torch.cuda.max_memory_allocated() / 1024 / 1024  # MB
    else:
        peak_gpu_memory = 0
    
    # Calculate metrics
    inference_time = end_time - start_time
    samples_per_second = len(dataset) / inference_time
    
    return {
        "inference_time_seconds": inference_time,
        "samples_per_second": samples_per_second,
        "peak_gpu_memory_mb": peak_gpu_memory
    }

## 4. Experiment Setup: Testing Tradeoffs on a Classification Task

Let's set up an experiment to evaluate the efficiency-performance tradeoffs of different PEFT methods on a text classification task.

In [None]:
# Load a dataset for our experiment (SST-2 sentiment classification)
dataset = load_dataset("glue", "sst2")
print(dataset)

In [None]:
# Preprocess the dataset
def preprocess_function(examples, tokenizer, max_length=128):
    return tokenizer(examples["sentence"], truncation=True, padding="max_length", max_length=max_length)

# Define the model and tokenizer
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Tokenize the dataset
tokenized_datasets = dataset.map(
    lambda examples: preprocess_function(examples, tokenizer),
    batched=True
)
tokenized_datasets = tokenized_datasets.remove_columns(["sentence", "idx"])
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
tokenized_datasets.set_format("torch")

# Define a training function for our experiments
def train_model(model, train_dataset, batch_size=16, num_epochs=3):
    """Training function for efficiency measurement"""
    # Define training arguments
    training_args = TrainingArguments(
        output_dir="./results",
        learning_rate=2e-5,
        per_device_train_batch_size=batch_size,
        num_train_epochs=num_epochs,
        weight_decay=0.01,
        logging_steps=len(train_dataset) // batch_size // 4,
        save_strategy="no",
        report_to="none"
    )
    
    # Define trainer
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset
    )
    
    # Train the model
    trainer.train()
    
    return trainer

# Define an evaluation function
def evaluate_model(model, eval_dataset):
    """Evaluation function for performance measurement"""
    # Define evaluation arguments
    eval_args = TrainingArguments(
        output_dir="./eval_results",
        per_device_eval_batch_size=32,
        report_to="none"
    )
    
    # Define evaluator
    evaluator = Trainer(
        model=model,
        args=eval_args,
        eval_dataset=eval_dataset
    )
    
    # Evaluate the model
    results = evaluator.evaluate()
    
    return results

## 5. PEFT Methods Configuration

Now, let's define the PEFT methods we want to compare and their configurations.

In [None]:
def get_peft_configurations():
    """Define PEFT configurations to compare"""
    peft_configs = {
        "LoRA-r4": LoraConfig(
            r=4,
            lora_alpha=16,
            target_modules=["query", "key", "value"],
            lora_dropout=0.1,
            bias="none",
            task_type=TaskType.SEQ_CLS
        ),
        "LoRA-r8": LoraConfig(
            r=8,
            lora_alpha=16,
            target_modules=["query", "key", "value"],
            lora_dropout=0.1,
            bias="none",
            task_type=TaskType.SEQ_CLS
        ),
        "LoRA-r16": LoraConfig(
            r=16,
            lora_alpha=32,
            target_modules=["query", "key", "value"],
            lora_dropout=0.1,
            bias="none",
            task_type=TaskType.SEQ_CLS
        ),
        "LoRA-r32": LoraConfig(
            r=32,
            lora_alpha=64,
            target_modules=["query", "key", "value"],
            lora_dropout=0.1,
            bias="none",
            task_type=TaskType.SEQ_CLS
        ),
        "PrefixTuning-len16": PrefixTuningConfig(
            task_type=TaskType.SEQ_CLS,
            prefix_length=16,
            num_virtual_tokens=16
        ),
        "PrefixTuning-len32": PrefixTuningConfig(
            task_type=TaskType.SEQ_CLS,
            prefix_length=32,
            num_virtual_tokens=32
        ),
        "PromptTuning-len16": PromptEncoderConfig(
            task_type=TaskType.SEQ_CLS,
            num_virtual_tokens=16,
            encoder_hidden_size=128
        ),
        "PromptTuning-len32": PromptEncoderConfig(
            task_type=TaskType.SEQ_CLS,
            num_virtual_tokens=32,
            encoder_hidden_size=128
        ),
        "BitFit": BitFitConfig(
            bias_term="all",
            task_type=TaskType.SEQ_CLS
        ),
        "AdaLoRA": AdaLoraConfig(
            r=8,
            lora_alpha=16,
            target_modules=["query", "key", "value"],
            lora_dropout=0.1,
            task_type=TaskType.SEQ_CLS
        ),
        "LoRA-AllLayers": LoraConfig(
            r=8,
            lora_alpha=16,
            target_modules=["query", "key", "value", "dense", "output.dense"],
            lora_dropout=0.1,
            bias="none",
            task_type=TaskType.SEQ_CLS
        )
    }
    
    return peft_configs

# Get PEFT configurations
peft_configs = get_peft_configurations()

# Print PEFT configuration details
print("PEFT Configurations for Comparison")
print("=================================")
for method, config in peft_configs.items():
    print(f"\n{method}")
    print("-" * len(method))
    print(f"Configuration Type: {type(config).__name__}")
    
    # Print attributes based on configuration type
    if isinstance(config, LoraConfig) or isinstance(config, AdaLoraConfig):
        print(f"Rank (r): {config.r}")
        print(f"Alpha: {config.lora_alpha}")
        print(f"Target Modules: {config.target_modules}")
        print(f"Dropout: {config.lora_dropout}")
    elif isinstance(config, PrefixTuningConfig):
        print(f"Prefix Length: {config.prefix_length}")
        print(f"Virtual Tokens: {config.num_virtual_tokens}")
    elif isinstance(config, PromptEncoderConfig):
        print(f"Virtual Tokens: {config.num_virtual_tokens}")
        print(f"Encoder Hidden Size: {config.encoder_hidden_size}")
    elif isinstance(config, BitFitConfig):
        print(f"Bias Term: {config.bias_term}")

## 6. Conducting the Tradeoff Experiments

Now, let's conduct our experiments to measure the efficiency-performance tradeoffs for each PEFT method.

In [None]:
def run_tradeoff_experiments(batch_size=16, num_epochs=1, subset_size=1000):
    """Run experiments to measure efficiency-performance tradeoffs"""
    results = []
    
    # Create subset of data for faster experimentation
    train_subset = tokenized_datasets["train"].select(range(subset_size))
    eval_subset = tokenized_datasets["validation"].select(range(subset_size // 2))
    
    # First, measure full fine-tuning as a baseline
    print("\nMeasuring Full Fine-tuning (Baseline)...")
    base_model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2).to(device)
    
    # Measure parameter efficiency
    param_metrics = measure_parameter_efficiency(base_model)
    
    # Measure computational efficiency
    comp_metrics = measure_computational_efficiency(train_model, base_model, train_subset, batch_size, num_epochs)
    
    # Measure task performance
    perf_metrics = measure_task_performance(base_model, eval_subset, evaluate_model)
    
    # Measure inference efficiency
    inf_metrics = measure_inference_efficiency(base_model, eval_subset)
    
    # Combine all metrics
    full_metrics = {
        "method": "Full Fine-tuning",
        **param_metrics,
        **comp_metrics,
        **perf_metrics,
        **inf_metrics
    }
    
    results.append(full_metrics)
    
    # Now measure each PEFT method
    for method_name, peft_config in peft_configs.items():
        print(f"\nMeasuring {method_name}...")
        
        # Create base model
        base_model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2).to(device)
        
        # Create PEFT model
        peft_model = get_peft_model(base_model, peft_config)
        
        # Measure parameter efficiency
        param_metrics = measure_parameter_efficiency(peft_model, base_model)
        
        # Measure computational efficiency
        comp_metrics = measure_computational_efficiency(train_model, peft_model, train_subset, batch_size, num_epochs)
        
        # Measure task performance
        perf_metrics = measure_task_performance(peft_model, eval_subset, evaluate_model)
        
        # Measure inference efficiency
        inf_metrics = measure_inference_efficiency(peft_model, eval_subset)
        
        # Combine all metrics
        peft_metrics = {
            "method": method_name,
            **param_metrics,
            **comp_metrics,
            **perf_metrics,
            **inf_metrics
        }
        
        results.append(peft_metrics)
    
    return results

# Run experiments with a small subset and fewer epochs for demonstration
experiment_results = run_tradeoff_experiments(batch_size=8, num_epochs=1, subset_size=100)

In [None]:
# Convert results to DataFrame for analysis
results_df = pd.DataFrame(experiment_results)

# Display key metrics
display_cols = [
    "method", "param_percentage", "trainable_params", 
    "training_time_seconds", "eval_accuracy", 
    "peak_gpu_memory_mb", "samples_per_second"
]

# Display formatted results
display_df = results_df[display_cols].copy()
display_df["trainable_params"] = display_df["trainable_params"].apply(lambda x: f"{x:,}")
display_df["param_percentage"] = display_df["param_percentage"].apply(lambda x: f"{x:.4f}%")
display_df["training_time_seconds"] = display_df["training_time_seconds"].apply(lambda x: f"{x:.2f}s")
display_df["eval_accuracy"] = display_df["eval_accuracy"].apply(lambda x: f"{x:.4f}")
display_df["peak_gpu_memory_mb"] = display_df["peak_gpu_memory_mb"].apply(lambda x: f"{x:.1f} MB")
display_df["samples_per_second"] = display_df["samples_per_second"].apply(lambda x: f"{x:.1f}")

print("\nEfficiency-Performance Tradeoff Results")
print("=======================================\n")
print(display_df.to_string(index=False))

## 7. Visualizing Efficiency-Performance Tradeoffs

Let's create visualizations to better understand the tradeoffs between different dimensions.

In [None]:
# Set the style for our visualizations
sns.set_style("whitegrid")
plt.rcParams.update({'font.size': 12})

# 1. Parameter Efficiency vs. Performance
plt.figure(figsize=(12, 8))
scatter = plt.scatter(
    results_df["param_percentage"],
    results_df["eval_accuracy"],
    s=results_df["training_time_seconds"] * 5,  # Size represents training time
    alpha=0.7,
    c=np.arange(len(results_df)),  # Color by index
    cmap="viridis"
)

# Add labels for each point
for i, row in results_df.iterrows():
    plt.annotate(
        row["method"],
        (row["param_percentage"], row["eval_accuracy"]),
        textcoords="offset points",
        xytext=(0, 10),
        ha='center',
        fontsize=10
    )

plt.title("Parameter Efficiency vs. Performance Tradeoff")
plt.xlabel("Parameter Percentage (%)")
plt.ylabel("Evaluation Accuracy")
plt.xscale("log")  # Log scale for better visualization of parameter percentages
plt.grid(True, linestyle='--', alpha=0.7)

# Add a legend for bubble size (training time)
sizes = [5, 10, 20]
labels = ["Fast", "Medium", "Slow"]
for size, label in zip(sizes, labels):
    plt.scatter([], [], s=size*5, alpha=0.7, color='gray', label=label)
plt.legend(title="Training Speed", loc="lower right")

plt.tight_layout()
plt.show()

In [None]:
# 2. Training Time vs. Performance
plt.figure(figsize=(12, 8))
scatter = plt.scatter(
    results_df["training_time_seconds"],
    results_df["eval_accuracy"],
    s=results_df["param_percentage"] * 10,  # Size represents parameter percentage
    alpha=0.7,
    c=np.arange(len(results_df)),  # Color by index
    cmap="viridis"
)

# Add labels for each point
for i, row in results_df.iterrows():
    plt.annotate(
        row["method"],
        (row["training_time_seconds"], row["eval_accuracy"]),
        textcoords="offset points",
        xytext=(0, 10),
        ha='center',
        fontsize=10
    )

plt.title("Training Time vs. Performance Tradeoff")
plt.xlabel("Training Time (seconds)")
plt.ylabel("Evaluation Accuracy")
plt.grid(True, linestyle='--', alpha=0.7)

# Add a legend for bubble size (parameter percentage)
sizes = [0.1, 1, 10, 100]
labels = ["0.1%", "1%", "10%", "100%"]
for size, label in zip(sizes, labels):
    plt.scatter([], [], s=size*10, alpha=0.7, color='gray', label=label)
plt.legend(title="Parameter %", loc="lower right")

plt.tight_layout()
plt.show()

In [None]:
# 3. Memory Usage vs. Performance
plt.figure(figsize=(12, 8))
scatter = plt.scatter(
    results_df["peak_gpu_memory_mb"],
    results_df["eval_accuracy"],
    s=results_df["param_percentage"] * 10,  # Size represents parameter percentage
    alpha=0.7,
    c=np.arange(len(results_df)),  # Color by index
    cmap="viridis"
)

# Add labels for each point
for i, row in results_df.iterrows():
    plt.annotate(
        row["method"],
        (row["peak_gpu_memory_mb"], row["eval_accuracy"]),
        textcoords="offset points",
        xytext=(0, 10),
        ha='center',
        fontsize=10
    )

plt.title("Memory Usage vs. Performance Tradeoff")
plt.xlabel("Peak GPU Memory Usage (MB)")
plt.ylabel("Evaluation Accuracy")
plt.grid(True, linestyle='--', alpha=0.7)

# Add a legend for bubble size (parameter percentage)
sizes = [0.1, 1, 10, 100]
labels = ["0.1%", "1%", "10%", "100%"]
for size, label in zip(sizes, labels):
    plt.scatter([], [], s=size*10, alpha=0.7, color='gray', label=label)
plt.legend(title="Parameter %", loc="lower right")

plt.tight_layout()
plt.show()

In [None]:
# 4. Multi-dimensional comparison - Radar chart
# Normalize the metrics for comparison
metrics_to_normalize = [
    "param_percentage",
    "training_time_seconds",
    "eval_accuracy",
    "peak_gpu_memory_mb",
    "samples_per_second"
]

normalized_df = results_df.copy()

# Invert metrics where lower is better
for metric in ["param_percentage", "training_time_seconds", "peak_gpu_memory_mb"]:
    max_val = normalized_df[metric].max()
    normalized_df[f"{metric}_inv"] = (max_val - normalized_df[metric]) / max_val

# Normalize metrics where higher is better
for metric in ["eval_accuracy", "samples_per_second"]:
    min_val = normalized_df[metric].min()
    max_val = normalized_df[metric].max()
    normalized_df[f"{metric}_norm"] = (normalized_df[metric] - min_val) / (max_val - min_val)

# Select methods to include in the radar chart (to avoid overcrowding)
methods_to_include = ["Full Fine-tuning", "LoRA-r8", "PrefixTuning-len32", "BitFit", "AdaLoRA"]
radar_df = normalized_df[normalized_df["method"].isin(methods_to_include)]

# Create radar chart
categories = [
    "Parameter Efficiency",
    "Training Speed",
    "Performance",
    "Memory Efficiency",
    "Inference Speed"
]

# Number of categories
N = len(categories)

# Create angle for each category
angles = [n / float(N) * 2 * np.pi for n in range(N)]
angles += angles[:1]  # Close the loop

# Create radar chart figure
fig, ax = plt.subplots(figsize=(12, 10), subplot_kw=dict(polar=True))

# Draw one axis per variable and add labels
plt.xticks(angles[:-1], categories, size=12)

# Draw ylabels
ax.set_rlabel_position(0)
plt.yticks([0.25, 0.5, 0.75], ["0.25", "0.5", "0.75"], color="grey", size=10)
plt.ylim(0, 1)

# Plot data
for i, method in enumerate(methods_to_include):
    method_data = radar_df[radar_df["method"] == method]
    
    # Get values for each dimension
    values = [
        method_data["param_percentage_inv"].values[0],
        method_data["training_time_seconds_inv"].values[0],
        method_data["eval_accuracy_norm"].values[0],
        method_data["peak_gpu_memory_mb_inv"].values[0],
        method_data["samples_per_second_norm"].values[0]
    ]
    
    # Close the loop
    values += values[:1]
    
    # Plot values
    ax.plot(angles, values, linewidth=2, linestyle='solid', label=method)
    ax.fill(angles, values, alpha=0.1)

# Add legend
plt.legend(loc='upper right', bbox_to_anchor=(0.1, 0.1))
plt.title("Multi-dimensional Comparison of PEFT Methods", size=15, y=1.1)

plt.tight_layout()
plt.show()

## 8. Data Efficiency Experiment

One important aspect mentioned in the paper is data efficiency. Let's conduct an experiment to evaluate how different PEFT methods perform with varying amounts of training data.

In [None]:
def run_data_efficiency_experiment(methods=["Full Fine-tuning", "LoRA-r8", "BitFit", "PrefixTuning-len32"]):
    """Experiment to measure performance with varying dataset sizes"""
    results = []
    
    # Define dataset sizes to test
    dataset_sizes = [100, 250, 500, 1000]
    
    for method_name in methods:
        for dataset_size in dataset_sizes:
            print(f"\nTesting {method_name} with {dataset_size} training examples...")
            
            # Create training subset
            train_subset = tokenized_datasets["train"].select(range(dataset_size))
            eval_subset = tokenized_datasets["validation"].select(range(200))  # Fixed eval set size
            
            # Create base model
            base_model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2).to(device)
            
            if method_name == "Full Fine-tuning":
                model = base_model
            else:
                # Get PEFT configuration
                peft_config = peft_configs[method_name]
                
                # Create PEFT model
                model = get_peft_model(base_model, peft_config)
            
            # Train the model
            train_model(model, train_subset, batch_size=8, num_epochs=3)
            
            # Evaluate the model
            eval_results = evaluate_model(model, eval_subset)
            
            # Store results
            results.append({
                "method": method_name,
                "dataset_size": dataset_size,
                "accuracy": eval_results["eval_accuracy"],
                "loss": eval_results["eval_loss"]
            })
    
    return results

# Run the data efficiency experiment with a subset of methods
data_efficiency_results = run_data_efficiency_experiment(["Full Fine-tuning", "LoRA-r8", "BitFit", "PrefixTuning-len32"])

In [None]:
# Convert data efficiency results to DataFrame
data_eff_df = pd.DataFrame(data_efficiency_results)

# Display the results
print("\nData Efficiency Results")
print("======================")
print(data_eff_df.to_string(index=False))

# Create a line plot to visualize data efficiency
plt.figure(figsize=(12, 8))

# Group by method and dataset size
for method in data_eff_df["method"].unique():
    method_data = data_eff_df[data_eff_df["method"] == method]
    plt.plot(
        method_data["dataset_size"],
        method_data["accuracy"],
        marker='o',
        linewidth=2,
        label=method
    )

plt.title("Data Efficiency: Performance vs. Training Data Size")
plt.xlabel("Number of Training Examples")
plt.ylabel("Evaluation Accuracy")
plt.grid(True, linestyle='--', alpha=0.7)
plt.legend()
plt.tight_layout()
plt.show()

## 9. Rank vs. Performance Experiment for LoRA

The paper mentions that the choice of hyperparameters like rank in LoRA can significantly impact the efficiency-performance tradeoff. Let's conduct an experiment to explore this relationship.

In [None]:
def run_lora_rank_experiment():
    """Experiment to measure how LoRA rank affects performance and efficiency"""
    results = []
    
    # Define LoRA ranks to test
    ranks = [1, 2, 4, 8, 16, 32, 64]
    
    # Create training and evaluation subsets
    train_subset = tokenized_datasets["train"].select(range(500))
    eval_subset = tokenized_datasets["validation"].select(range(200))
    
    for rank in ranks:
        print(f"\nTesting LoRA with rank {rank}...")
        
        # Create base model
        base_model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2).to(device)
        
        # Create LoRA configuration
        lora_config = LoraConfig(
            r=rank,
            lora_alpha=rank * 2,  # Scale alpha with rank
            target_modules=["query", "key", "value"],
            lora_dropout=0.1,
            bias="none",
            task_type=TaskType.SEQ_CLS
        )
        
        # Create LoRA model
        lora_model = get_peft_model(base_model, lora_config)
        
        # Measure parameter efficiency
        param_metrics = measure_parameter_efficiency(lora_model, base_model)
        
        # Train the model
        train_model(lora_model, train_subset, batch_size=8, num_epochs=3)
        
        # Evaluate the model
        eval_results = evaluate_model(lora_model, eval_subset)
        
        # Measure inference efficiency
        inf_metrics = measure_inference_efficiency(lora_model, eval_subset)
        
        # Store results
        results.append({
            "rank": rank,
            "trainable_params": param_metrics["trainable_params"],
            "param_percentage": param_metrics["param_percentage"],
            "accuracy": eval_results["eval_accuracy"],
            "loss": eval_results["eval_loss"],
            "inference_time": inf_metrics["inference_time_seconds"],
            "samples_per_second": inf_metrics["samples_per_second"]
        })
    
    return results

# Run the LoRA rank experiment
lora_rank_results = run_lora_rank_experiment()

In [None]:
# Convert LoRA rank results to DataFrame
lora_rank_df = pd.DataFrame(lora_rank_results)

# Display the results
print("\nLoRA Rank Experiment Results")
print("============================")
display_df = lora_rank_df.copy()
display_df["trainable_params"] = display_df["trainable_params"].apply(lambda x: f"{x:,}")
display_df["param_percentage"] = display_df["param_percentage"].apply(lambda x: f"{x:.4f}%")
display_df["accuracy"] = display_df["accuracy"].apply(lambda x: f"{x:.4f}")
display_df["inference_time"] = display_df["inference_time"].apply(lambda x: f"{x:.4f}s")
print(display_df.to_string(index=False))

# Create visualizations
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(18, 7))

# Plot rank vs. accuracy
ax1.plot(lora_rank_df["rank"], lora_rank_df["accuracy"], marker='o', linewidth=2)
ax1.set_title("LoRA Rank vs. Accuracy")
ax1.set_xlabel("LoRA Rank (r)")
ax1.set_ylabel("Evaluation Accuracy")
ax1.grid(True, linestyle='--', alpha=0.7)

# Plot parameter percentage vs. accuracy
ax2.scatter(
    lora_rank_df["param_percentage"],
    lora_rank_df["accuracy"],
    s=lora_rank_df["rank"] * 5,
    alpha=0.7
)
# Add labels for each point
for i, row in lora_rank_df.iterrows():
    ax2.annotate(
        f"r={row['rank']}",
        (row["param_percentage"], row["accuracy"]),
        textcoords="offset points",
        xytext=(0, 10),
        ha='center',
        fontsize=10
    )
ax2.set_title("Parameter Percentage vs. Accuracy")
ax2.set_xlabel("Parameter Percentage (%)")
ax2.set_ylabel("Evaluation Accuracy")
ax2.grid(True, linestyle='--', alpha=0.7)

plt.tight_layout()
plt.show()

## 10. Creating a PEFT Selection Framework

Based on our experiments and the findings from the paper, let's create a framework to help select the most appropriate PEFT method based on specific requirements and constraints.

In [None]:
def peft_method_selector(importance_weights, available_methods=None):
    """Select the most appropriate PEFT method based on importance weights"""
    # Default available methods if none provided
    if available_methods is None:
        available_methods = [
            "Full Fine-tuning", "LoRA", "AdaLoRA", "BitFit", 
            "PrefixTuning", "PromptTuning"
        ]
    
    # Define method characteristics based on our experiments and the paper
    method_profiles = {
        "Full Fine-tuning": {
            "parameter_efficiency": 0.1,  # Low efficiency (uses 100% of parameters)
            "computational_efficiency": 0.2,  # Low efficiency (slow training, high memory)
            "performance": 1.0,  # High performance (baseline)
            "generalization": 0.9,  # High generalization
            "data_efficiency": 0.3,  # Low data efficiency (needs more data)
            "inference_efficiency": 0.7  # Moderate inference efficiency
        },
        "LoRA": {
            "parameter_efficiency": 0.9,  # Very high efficiency (<1% of parameters)
            "computational_efficiency": 0.8,  # High efficiency (faster training, lower memory)
            "performance": 0.95,  # Very good performance (close to full fine-tuning)
            "generalization": 0.8,  # Good generalization
            "data_efficiency": 0.7,  # Good data efficiency
            "inference_efficiency": 0.8  # Good inference efficiency
        },
        "AdaLoRA": {
            "parameter_efficiency": 0.85,  # High efficiency
            "computational_efficiency": 0.7,  # Good efficiency
            "performance": 0.97,  # Excellent performance
            "generalization": 0.85,  # Very good generalization
            "data_efficiency": 0.75,  # Good data efficiency
            "inference_efficiency": 0.75  # Good inference efficiency
        },
        "BitFit": {
            "parameter_efficiency": 0.95,  # Extremely high efficiency (<0.1% of parameters)
            "computational_efficiency": 0.9,  # Very high efficiency
            "performance": 0.85,  # Good performance (some drop from full fine-tuning)
            "generalization": 0.75,  # Decent generalization
            "data_efficiency": 0.8,  # Very good data efficiency
            "inference_efficiency": 0.9  # Very good inference efficiency
        },
        "PrefixTuning": {
            "parameter_efficiency": 0.9,  # Very high efficiency
            "computational_efficiency": 0.8,  # High efficiency
            "performance": 0.85,  # Good performance
            "generalization": 0.8,  # Good generalization
            "data_efficiency": 0.7,  # Good data efficiency
            "inference_efficiency": 0.7  # Good inference efficiency
        },
        "PromptTuning": {
            "parameter_efficiency": 0.95,  # Extremely high efficiency
            "computational_efficiency": 0.85,  # Very high efficiency
            "performance": 0.8,  # Moderate performance
            "generalization": 0.7,  # Moderate generalization
            "data_efficiency": 0.8,  # Very good data efficiency
            "inference_efficiency": 0.8  # Good inference efficiency
        }
    }
    
    # Filter available methods
    filtered_methods = {k: v for k, v in method_profiles.items() if k in available_methods}
    
    # Calculate weighted scores
    method_scores = {}
    for method, profile in filtered_methods.items():
        weighted_score = sum(profile[dim] * weight for dim, weight in importance_weights.items())
        method_scores[method] = weighted_score
    
    # Sort methods by score
    sorted_methods = sorted(method_scores.items(), key=lambda x: x[1], reverse=True)
    
    return sorted_methods

# Example usage: Performance is most important, followed by parameter efficiency
example_weights = {
    "parameter_efficiency": 0.3,
    "computational_efficiency": 0.2,
    "performance": 0.3,
    "generalization": 0.1,
    "data_efficiency": 0.05,
    "inference_efficiency": 0.05
}

recommended_methods = peft_method_selector(example_weights)

# Display recommendations
print("PEFT Method Recommendations")
print("==========================")
print(f"Importance Weights: {example_weights}")
print("\nRecommended Methods (in order):")
for method, score in recommended_methods:
    print(f"{method}: {score:.4f}")

## 11. Interactive PEFT Method Selector

Let's create an interactive tool to help users select the most appropriate PEFT method based on their specific requirements and constraints.

In [None]:
def interactive_peft_selector():
    """Interactive tool to help select the most appropriate PEFT method"""
    print("PEFT Method Selector")
    print("===================")
    print("\nRate the importance of each dimension on a scale of 0-10:")
    
    # Collect importance weights
    dimensions = [
        "parameter_efficiency",
        "computational_efficiency",
        "performance",
        "generalization",
        "data_efficiency",
        "inference_efficiency"
    ]
    
    weights = {}
    for dim in dimensions:
        # In a real interactive environment, you would use input() here
        # For this notebook, we'll simulate with predefined values
        weights[dim] = 5  # Default value
    
    # Example ratings for different scenarios
    scenarios = {
        "Performance Critical": {
            "parameter_efficiency": 3,
            "computational_efficiency": 2,
            "performance": 10,
            "generalization": 7,
            "data_efficiency": 3,
            "inference_efficiency": 5
        },
        "Resource Constrained": {
            "parameter_efficiency": 10,
            "computational_efficiency": 9,
            "performance": 5,
            "generalization": 3,
            "data_efficiency": 7,
            "inference_efficiency": 8
        },
        "Limited Data": {
            "parameter_efficiency": 7,
            "computational_efficiency": 5,
            "performance": 6,
            "generalization": 8,
            "data_efficiency": 10,
            "inference_efficiency": 4
        },
        "Deployment Focused": {
            "parameter_efficiency": 8,
            "computational_efficiency": 6,
            "performance": 7,
            "generalization": 5,
            "data_efficiency": 3,
            "inference_efficiency": 10
        },
        "Balanced": {
            "parameter_efficiency": 6,
            "computational_efficiency": 6,
            "performance": 7,
            "generalization": 6,
            "data_efficiency": 5,
            "inference_efficiency": 5
        }
    }
    
    # Display recommendations for each scenario
    for scenario, scenario_weights in scenarios.items():
        # Normalize weights
        total = sum(scenario_weights.values())
        normalized_weights = {k: v/total for k, v in scenario_weights.items()}
        
        # Get recommendations
        recommendations = peft_method_selector(normalized_weights)
        
        print(f"\n{scenario} Scenario")
        print("-" * (len(scenario) + 9))
        print("Importance Weights:")
        for dim, weight in normalized_weights.items():
            print(f"  {dim}: {weight:.2f}")
        print("\nRecommended Methods:")
        for i, (method, score) in enumerate(recommendations[:3], 1):
            print(f"  {i}. {method} (score: {score:.4f})")
        print(f"\nSpecific Configuration Advice for {recommendations[0][0]}:")
        
        # Provide specific configuration advice based on top recommendation
        top_method = recommendations[0][0]
        if top_method == "LoRA" or top_method == "AdaLoRA":
            if normalized_weights["performance"] > 0.3:
                print("  - Use higher rank (r=16 or r=32) for better performance")
            else:
                print("  - Use lower rank (r=4 or r=8) for better efficiency")
            print("  - Target attention layers (query, key, value) for most domains")
            print("  - Consider adding dense layers for more complex tasks")
        elif top_method == "PrefixTuning":
            if normalized_weights["performance"] > 0.3:
                print("  - Use longer prefix length (32-64) for better performance")
            else:
                print("  - Use shorter prefix length (8-16) for better efficiency")
            print("  - Consider combined with BitFit for better performance")
        elif top_method == "BitFit":
            print("  - Works well with larger models where biases capture more information")
            print("  - Consider combining with LoRA for better performance-efficiency tradeoff")
        elif top_method == "Full Fine-tuning":
            print("  - Use gradient accumulation for larger batch sizes with limited memory")
            print("  - Consider 8-bit or 4-bit quantization to reduce memory requirements")
            print("  - Explore using LoRA with high rank as an alternative")

# Run the interactive selector
interactive_peft_selector()

## 12. Conclusion: Key Insights and Best Practices

Based on our experiments and the findings from the paper, we can distill several key insights and best practices for navigating the performance-efficiency tradeoffs in PEFT methods:

### Key Insights from Our Experiments

1. **No One-Size-Fits-All Solution**: Different PEFT methods excel in different dimensions. The optimal choice depends on specific requirements and constraints.

2. **Parameter Efficiency ≠ Performance Sacrifice**: Some methods (like LoRA with appropriate rank) can achieve performance very close to full fine-tuning while using <1% of parameters.

3. **Data Efficiency Varies**: PEFT methods show different behavior with limited data. Some methods (like BitFit) excel in low-data regimes, making them suitable for domains with limited labeled data.

4. **Hyperparameter Sensitivity**: The performance of PEFT methods can be significantly influenced by hyperparameter choices. For example, LoRA's rank parameter creates a direct tradeoff between efficiency and performance.

5. **Multi-dimensional Tradeoffs**: The efficiency-performance tradeoff is multi-dimensional, involving parameters, computation, memory, performance, and generalization. Optimizing for one dimension often affects others.

### Best Practices for PEFT Method Selection

1. **Clearly Define Requirements**: Before selecting a PEFT method, clearly define your requirements across all dimensions: parameter efficiency, computational efficiency, performance, generalization, data efficiency, and inference efficiency.

2. **Prioritize Dimensions**: Determine which dimensions are most important for your specific application. This will guide the selection of the most appropriate PEFT method.

3. **Consider Application Domain**: Different domains may benefit from different PEFT methods. For instance, NLP tasks often work well with LoRA and BitFit, while vision tasks might benefit from adapter approaches.

4. **Test Multiple Methods**: Given the significant variation in performance across methods, it's beneficial to experiment with multiple PEFT approaches for your specific task.

5. **Optimize Hyperparameters**: Once you've selected a method, carefully tune its hyperparameters to find the optimal balance between efficiency and performance for your specific requirements.

6. **Consider Combining Methods**: Some of the best results come from combining multiple PEFT methods (e.g., BitFit+LoRA), leveraging the strengths of each approach.

7. **Benchmark Systematically**: Use a systematic benchmarking approach that considers all relevant dimensions, not just parameter count or accuracy.

By following these insights and best practices, practitioners can effectively navigate the complex tradeoffs involved in PEFT methods and select the approach that best meets their specific requirements and constraints.

## References

1. Balne, C. C. S., Bhaduri, S., Roy, T., Jain, V., & Chadha, A. (2024). Parameter Efficient Fine Tuning: A Comprehensive Analysis Across Applications. arXiv:2404.13506v2.

2. Wu, Z., Arora, A., Wang, Z., Geiger, A., Jurafsky, D., Manning, C. D., & Potts, C. (2024). ReFT: Representation finetuning for language models. arXiv preprint arXiv:2401.13622.

3. Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., ... & Chen, W. (2021). LoRA: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.

4. Li, X. L., & Liang, P. (2021). Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190.

5. Zaken, E. B., Ravfogel, S., & Goldberg, Y. (2021). BitFit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. arXiv preprint arXiv:2106.10199.

6. Zhang, Z., Zhang, A., Li, M., Zhao, H., Karypis, G., & Smola, A. (2022). Differentiable prompt makes pre-trained language models better few-shot learners. arXiv preprint arXiv:2108.13161.