# Focused Learning: Comparative Analysis Frameworks for PEFT Methods

## Learning Objectives
- Implement a robust framework for evaluating and comparing different PEFT methods
- Understand the metrics and benchmarks used to evaluate PEFT techniques
- Analyze the trade-offs between parameter efficiency, performance, and computational cost
- Develop methodologies for fair and systematic comparison across diverse applications

## Paper Reference
This notebook explores concepts from the paper "Parameter Efficient Fine Tuning: A Comprehensive Analysis Across Applications" (arXiv:2404.13506v2).

Specifically, we focus on Tables 1-3 and Section 4 which present comparative analyses of PEFT methods:

> "PEFT has emerged as a compelling approach for tailoring large pre-trained models to specific tasks while minimizing computational demands. Our review found that leveraging PEFT across diverse applications presents several key challenges that require careful consideration..." (Section 4, Page 6)

The paper provides a comprehensive comparative analysis across different PEFT techniques and applications, which we'll explore in this focused learning notebook.

## 1. Introduction to Comparative Analysis of PEFT Methods

When evaluating Parameter-Efficient Fine-Tuning (PEFT) methods, it's crucial to have a structured framework for comparison. Different techniques can vary widely in their parameter efficiency, computational requirements, and performance across various tasks and domains.

In this notebook, we'll develop and demonstrate a comprehensive framework for comparing PEFT methods, inspired by the analysis presented in the paper. We'll explore:

1. Key metrics for evaluating PEFT methods
2. Implementation of a comparative analysis framework
3. Visualization techniques for comparing methods
4. Application-specific comparisons

In [None]:
# Install necessary libraries
!pip install torch transformers datasets peft matplotlib numpy pandas seaborn

In [None]:
import torch
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm.auto import tqdm
import time
from tabulate import tabulate

from transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer,
    Trainer,
    TrainingArguments,
    DataCollatorWithPadding
)
from datasets import load_dataset
from peft import (
    get_peft_model,
    LoraConfig,
    PrefixTuningConfig,
    PromptEncoderConfig,
    TaskType,
    PeftType,
    PeftConfig,
    PeftModel,
    BitFitConfig,
    AdaLoraConfig
)

# Set the seed for reproducibility
seed = 42
torch.manual_seed(seed)
np.random.seed(seed)

# Check if GPU is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

## 2. Metrics for Evaluating PEFT Methods

Based on the paper, we'll define a set of key metrics for evaluating PEFT methods:

### 2.1 Parameter Efficiency Metrics

1. **Trainable Parameter Count**: The number of parameters that are updated during fine-tuning
2. **Parameter Percentage**: The percentage of trainable parameters relative to the full model size
3. **Parameter Reduction Ratio**: The ratio of parameters saved by using PEFT (1 - parameter percentage)

### 2.2 Performance Metrics

1. **Task-Specific Metrics**: Accuracy, F1 score, BLEU, etc., depending on the task
2. **Performance Gap**: The difference in performance between PEFT and full fine-tuning
3. **Performance Retention**: The percentage of full fine-tuning performance retained by PEFT

### 2.3 Computational Efficiency Metrics

1. **Training Time**: The time required to train the model
2. **Memory Usage**: The peak memory consumption during training
3. **Training Speed**: The number of samples processed per second

Let's implement functions to compute these metrics:

In [None]:
def count_parameters(model):
    """Count the number of trainable parameters in a model"""
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

def count_total_parameters(model):
    """Count the total number of parameters in a model"""
    return sum(p.numel() for p in model.parameters())

def parameter_efficiency_metrics(model, full_model):
    """Compute parameter efficiency metrics"""
    trainable_params = count_parameters(model)
    total_params = count_total_parameters(model)
    full_trainable_params = count_parameters(full_model)
    full_total_params = count_total_parameters(full_model)
    
    param_percentage = trainable_params / full_trainable_params * 100
    param_reduction = 1 - (trainable_params / full_trainable_params)
    
    return {
        "trainable_params": trainable_params,
        "total_params": total_params,
        "param_percentage": param_percentage,
        "param_reduction": param_reduction
    }

def performance_metrics(peft_results, full_results, metric_name="accuracy"):
    """Compute performance metrics"""
    peft_performance = peft_results[metric_name]
    full_performance = full_results[metric_name]
    
    performance_gap = full_performance - peft_performance
    performance_retention = peft_performance / full_performance * 100
    
    return {
        "peft_performance": peft_performance,
        "full_performance": full_performance,
        "performance_gap": performance_gap,
        "performance_retention": performance_retention
    }

def computational_efficiency_metrics(peft_time, full_time, peft_memory, full_memory, samples_per_epoch):
    """Compute computational efficiency metrics"""
    training_time_reduction = 1 - (peft_time / full_time)
    memory_usage_reduction = 1 - (peft_memory / full_memory)
    peft_speed = samples_per_epoch / peft_time
    full_speed = samples_per_epoch / full_time
    speed_improvement = peft_speed / full_speed
    
    return {
        "peft_training_time": peft_time,
        "full_training_time": full_time,
        "training_time_reduction": training_time_reduction,
        "peft_memory_usage": peft_memory,
        "full_memory_usage": full_memory,
        "memory_usage_reduction": memory_usage_reduction,
        "peft_speed": peft_speed,
        "full_speed": full_speed,
        "speed_improvement": speed_improvement
    }

## 3. Implementing a Comparative Analysis Framework

Now, let's implement a framework for systematically comparing different PEFT methods. We'll use the GLUE benchmark's SST-2 dataset for sentiment analysis as an example task.

In [None]:
# Load the SST-2 dataset
dataset = load_dataset("glue", "sst2")
print(dataset)

In [None]:
# Preprocess the dataset
def preprocess_function(examples, tokenizer, max_length=128):
    return tokenizer(examples["sentence"], truncation=True, padding="max_length", max_length=max_length)

# Define the model and tokenizer
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Tokenize the dataset
tokenized_datasets = dataset.map(
    lambda examples: preprocess_function(examples, tokenizer),
    batched=True
)
tokenized_datasets = tokenized_datasets.remove_columns(["sentence", "idx"])
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
tokenized_datasets.set_format("torch")

# Define the data collator
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

### 3.1 Define PEFT Configurations

Let's define configurations for different PEFT methods that we want to compare:

In [None]:
def get_peft_configs():
    """Define PEFT configurations for comparison"""
    peft_configs = {
        "LoRA_r8": LoraConfig(
            r=8,
            lora_alpha=16,
            target_modules=["query", "key", "value"],
            lora_dropout=0.1,
            bias="none",
            task_type=TaskType.SEQ_CLS
        ),
        "LoRA_r16": LoraConfig(
            r=16,
            lora_alpha=32,
            target_modules=["query", "key", "value"],
            lora_dropout=0.1,
            bias="none",
            task_type=TaskType.SEQ_CLS
        ),
        "LoRA_r32": LoraConfig(
            r=32,
            lora_alpha=64,
            target_modules=["query", "key", "value"],
            lora_dropout=0.1,
            bias="none",
            task_type=TaskType.SEQ_CLS
        ),
        "PrefixTuning": PrefixTuningConfig(
            task_type=TaskType.SEQ_CLS,
            prefix_length=30,
            num_virtual_tokens=20,
        ),
        "PromptTuning": PromptEncoderConfig(
            task_type=TaskType.SEQ_CLS,
            num_virtual_tokens=20,
            encoder_hidden_size=128
        ),
        "BitFit": BitFitConfig(
            bias_term="all",
            task_type=TaskType.SEQ_CLS
        ),
        "AdaLoRA": AdaLoraConfig(
            r=8,
            lora_alpha=16,
            target_modules=["query", "key", "value"],
            lora_dropout=0.1,
            task_type=TaskType.SEQ_CLS
        )
    }
    
    return peft_configs

### 3.2 Training and Evaluation Function

Now, let's implement a function to train and evaluate models with different PEFT configurations:

In [None]:
def train_and_evaluate(model_name, peft_configs, tokenized_datasets, metric_name="accuracy", num_epochs=3):
    """Train and evaluate different PEFT configurations"""
    results = []
    
    # Train and evaluate full fine-tuning first
    print("Training full fine-tuning model...")
    full_model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2).to(device)
    full_trainable_params = count_parameters(full_model)
    
    # Define training arguments
    training_args = TrainingArguments(
        output_dir="./results",
        learning_rate=2e-5,
        per_device_train_batch_size=16,
        per_device_eval_batch_size=16,
        num_train_epochs=num_epochs,
        weight_decay=0.01,
        evaluation_strategy="epoch",
        save_strategy="epoch",
        load_best_model_at_end=True,
        push_to_hub=False,
        report_to="none"
    )
    
    # Define trainer
    trainer = Trainer(
        model=full_model,
        args=training_args,
        train_dataset=tokenized_datasets["train"],
        eval_dataset=tokenized_datasets["validation"],
        tokenizer=tokenizer,
        data_collator=data_collator,
    )
    
    # Train full model and measure time and memory
    start_time = time.time()
    full_memory_before = torch.cuda.memory_allocated() if torch.cuda.is_available() else 0
    trainer.train()
    full_memory_after = torch.cuda.memory_allocated() if torch.cuda.is_available() else 0
    full_memory_usage = full_memory_after - full_memory_before
    full_training_time = time.time() - start_time
    
    # Evaluate full model
    full_eval_results = trainer.evaluate()
    
    # Store full model results
    full_results = {
        "method": "Full Fine-tuning",
        "trainable_params": full_trainable_params,
        "param_percentage": 100.0,
        "training_time": full_training_time,
        "memory_usage": full_memory_usage,
        metric_name: full_eval_results["eval_accuracy"]
    }
    
    results.append(full_results)
    
    # Train and evaluate PEFT models
    for method_name, peft_config in peft_configs.items():
        print(f"\nTraining {method_name} model...")
        
        # Create base model
        base_model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2).to(device)
        
        # Create PEFT model
        peft_model = get_peft_model(base_model, peft_config)
        peft_trainable_params = count_parameters(peft_model)
        param_percentage = peft_trainable_params / full_trainable_params * 100
        
        # Print trainable parameters
        peft_model.print_trainable_parameters()
        
        # Define trainer
        trainer = Trainer(
            model=peft_model,
            args=training_args,
            train_dataset=tokenized_datasets["train"],
            eval_dataset=tokenized_datasets["validation"],
            tokenizer=tokenizer,
            data_collator=data_collator,
        )
        
        # Train PEFT model and measure time and memory
        start_time = time.time()
        peft_memory_before = torch.cuda.memory_allocated() if torch.cuda.is_available() else 0
        trainer.train()
        peft_memory_after = torch.cuda.memory_allocated() if torch.cuda.is_available() else 0
        peft_memory_usage = peft_memory_after - peft_memory_before
        peft_training_time = time.time() - start_time
        
        # Evaluate PEFT model
        peft_eval_results = trainer.evaluate()
        
        # Calculate performance metrics
        perf_metrics = performance_metrics(
            {metric_name: peft_eval_results["eval_accuracy"]},
            {metric_name: full_results[metric_name]},
            metric_name
        )
        
        # Store PEFT model results
        peft_results = {
            "method": method_name,
            "trainable_params": peft_trainable_params,
            "param_percentage": param_percentage,
            "training_time": peft_training_time,
            "memory_usage": peft_memory_usage,
            metric_name: peft_eval_results["eval_accuracy"],
            "performance_gap": perf_metrics["performance_gap"],
            "performance_retention": perf_metrics["performance_retention"],
            "training_time_reduction": 1 - (peft_training_time / full_training_time),
            "memory_reduction": 1 - (peft_memory_usage / full_memory_usage)
        }
        
        results.append(peft_results)
    
    return results

### 3.3 Run the Comparative Analysis

Now, let's run our comparative analysis framework on the selected PEFT methods:

In [None]:
# Get PEFT configurations
peft_configs = get_peft_configs()

# Run the comparative analysis
results = train_and_evaluate(model_name, peft_configs, tokenized_datasets, "accuracy", num_epochs=3)

# Convert results to DataFrame for easier analysis
results_df = pd.DataFrame(results)

## 4. Visualizing Comparative Results

Let's create visualizations to compare the performance of different PEFT methods based on our evaluation results.

In [None]:
# Display the results table
print("Comparative Analysis Results:")
display_cols = [
    "method", "trainable_params", "param_percentage", "accuracy", 
    "performance_retention", "training_time", "training_time_reduction"
]
display_df = results_df[display_cols].copy()
display_df["trainable_params"] = display_df["trainable_params"].apply(lambda x: f"{x:,}")
display_df["param_percentage"] = display_df["param_percentage"].apply(lambda x: f"{x:.2f}%")
display_df["accuracy"] = display_df["accuracy"].apply(lambda x: f"{x:.4f}")
display_df["performance_retention"] = display_df["performance_retention"].apply(lambda x: f"{x:.2f}%" if not pd.isna(x) else "100.00%")
display_df["training_time"] = display_df["training_time"].apply(lambda x: f"{x:.2f}s")
display_df["training_time_reduction"] = display_df["training_time_reduction"].apply(lambda x: f"{x*100:.2f}%" if not pd.isna(x) else "0.00%")

# Print the table using tabulate
print(tabulate(display_df, headers="keys", tablefmt="grid", showindex=False))

In [None]:
# Visualize parameter efficiency vs. performance
plt.figure(figsize=(12, 6))
full_model_accuracy = results_df[results_df["method"] == "Full Fine-tuning"]["accuracy"].values[0]

# Filter out the full model for better visualization
peft_results = results_df[results_df["method"] != "Full Fine-tuning"].copy()

# Create scatter plot
sns.scatterplot(
    data=peft_results, 
    x="param_percentage", 
    y="accuracy", 
    hue="method", 
    size="training_time",
    sizes=(100, 500),
    alpha=0.7
)

# Add a horizontal line for full model accuracy
plt.axhline(y=full_model_accuracy, color='r', linestyle='--', label=f"Full Fine-tuning ({full_model_accuracy:.4f})")

# Add labels for each point
for i, row in peft_results.iterrows():
    plt.annotate(
        row["method"], 
        (row["param_percentage"], row["accuracy"]),
        textcoords="offset points", 
        xytext=(0, 10), 
        ha='center'
    )

plt.title("Parameter Efficiency vs. Performance")
plt.xlabel("Parameter Percentage (%)")
plt.ylabel("Accuracy")
plt.xscale("log")
plt.grid(True, linestyle='--', alpha=0.7)
plt.legend(title="Method")
plt.tight_layout()
plt.show()

In [None]:
# Visualize training time vs. performance
plt.figure(figsize=(12, 6))

sns.scatterplot(
    data=results_df, 
    x="training_time", 
    y="accuracy", 
    hue="method", 
    size="param_percentage",
    sizes=(50, 500),
    alpha=0.7
)

# Add labels for each point
for i, row in results_df.iterrows():
    plt.annotate(
        row["method"], 
        (row["training_time"], row["accuracy"]),
        textcoords="offset points", 
        xytext=(0, 10), 
        ha='center'
    )

plt.title("Training Time vs. Performance")
plt.xlabel("Training Time (seconds)")
plt.ylabel("Accuracy")
plt.grid(True, linestyle='--', alpha=0.7)
plt.legend(title="Method")
plt.tight_layout()
plt.show()

In [None]:
# Visualize the efficiency-performance trade-off
plt.figure(figsize=(15, 6))

# Create data for radar chart
peft_methods = results_df["method"].tolist()
param_efficiency = 100 - results_df["param_percentage"]
normalized_accuracy = results_df["accuracy"] / results_df["accuracy"].max() * 100
time_efficiency = (1 - (results_df["training_time"] / results_df["training_time"].max())) * 100

# Plot parameter efficiency
plt.subplot(1, 3, 1)
bars = plt.bar(peft_methods, param_efficiency, color=sns.color_palette("viridis", len(peft_methods)))
plt.title("Parameter Efficiency (higher is better)")
plt.xlabel("Method")
plt.ylabel("Efficiency (%)")
plt.xticks(rotation=45, ha='right')
plt.ylim(0, 105)
plt.grid(axis='y', linestyle='--', alpha=0.7)

# Plot normalized accuracy
plt.subplot(1, 3, 2)
bars = plt.bar(peft_methods, normalized_accuracy, color=sns.color_palette("viridis", len(peft_methods)))
plt.title("Normalized Accuracy (higher is better)")
plt.xlabel("Method")
plt.ylabel("Normalized Accuracy (%)")
plt.xticks(rotation=45, ha='right')
plt.ylim(0, 105)
plt.grid(axis='y', linestyle='--', alpha=0.7)

# Plot time efficiency
plt.subplot(1, 3, 3)
bars = plt.bar(peft_methods, time_efficiency, color=sns.color_palette("viridis", len(peft_methods)))
plt.title("Time Efficiency (higher is better)")
plt.xlabel("Method")
plt.ylabel("Efficiency (%)")
plt.xticks(rotation=45, ha='right')
plt.ylim(0, 105)
plt.grid(axis='y', linestyle='--', alpha=0.7)

plt.tight_layout()
plt.show()

## 5. Performance Across Different Tasks

One of the key contributions of the paper is the analysis of PEFT methods across different application domains. Let's recreate and analyze some of the findings from the paper.

In [None]:
# Data from the paper on LoRA performance across domains
domain_data = {
    "Application": [
        "Commonsense Reasoning",
        "Arithmetic Reasoning",
        "Video Text Generation",
        "Medical Imaging",
        "Protein Models",
        "Code Review",
        "Speech Synthesis"
    ],
    "Backbone": [
        "LLaMA-7B",
        "LLaMA-7B",
        "CLIP, LLaMA-7B",
        "ResNet-50, ViT",
        "ESM2",
        "LLaMA-6.7B",
        "WavLM, Whisper"
    ],
    "PEFT_Method": [
        "LoRA",
        "LoRA",
        "LoRA + AGAdapter",
        "LoRA + BitFit",
        "LoRA + BitFit",
        "Zero-init + LoRA",
        "LoRA"
    ],
    "Param_Percentage": [0.83, 0.83, 0.81, 0.81, 0.81, 0.8, 0.8],
    "Performance_Retention": [95, 90, 97, 93, 99, 94, 98],
    "Primary_Benefit": [
        "High accuracy with few parameters",
        "Computational efficiency",
        "Multi-modal adaptation",
        "Reduced data requirements",
        "Improved prediction accuracy",
        "Fast fine-tuning",
        "Enhanced fairness scores"
    ]
}

domain_df = pd.DataFrame(domain_data)

In [None]:
# Display the data table
print("LoRA Performance Across Application Domains (Based on Paper Data):")
print(tabulate(domain_df, headers="keys", tablefmt="grid", showindex=False))

In [None]:
# Visualize performance retention across domains
plt.figure(figsize=(12, 6))
bars = plt.bar(
    domain_df["Application"],
    domain_df["Performance_Retention"],
    color=sns.color_palette("viridis", len(domain_df))
)
plt.axhline(y=90, color='r', linestyle='--', label='90% Retention Threshold')
plt.title("LoRA Performance Retention Across Application Domains")
plt.xlabel("Application Domain")
plt.ylabel("Performance Retention (%)")
plt.xticks(rotation=45, ha='right')
plt.ylim(80, 105)
plt.grid(axis='y', linestyle='--', alpha=0.7)

# Add data labels on the bars
for bar in bars:
    height = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2., height + 0.5,
             f'{height:.0f}%', ha='center', va='bottom')

plt.legend()
plt.tight_layout()
plt.show()

In [None]:
# Visualize parameter percentage vs. performance retention
plt.figure(figsize=(12, 6))
plt.scatter(
    domain_df["Param_Percentage"],
    domain_df["Performance_Retention"],
    s=100,
    c=np.arange(len(domain_df)),
    cmap="viridis",
    alpha=0.7
)

# Add labels for each point
for i, row in domain_df.iterrows():
    plt.annotate(
        row["Application"], 
        (row["Param_Percentage"], row["Performance_Retention"]),
        textcoords="offset points", 
        xytext=(0, 10), 
        ha='center'
    )

plt.title("Parameter Efficiency vs. Performance Retention Across Domains")
plt.xlabel("Parameter Percentage (%)")
plt.ylabel("Performance Retention (%)")
plt.grid(True, linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()

## 6. Comparing Different PEFT Methods on Commonsense Reasoning

The paper provides a detailed comparison of different PEFT methods on commonsense reasoning tasks. Let's recreate and analyze this comparison.

In [None]:
# Data from Table 1 in the paper (LLaMA-7B model)
commonsense_data_7b = {
    "Method": ["ChatGPT", "PrefT", "AdapterS", "AdapterP", "LoRA", "DoRA (half)", "DoRA", "LoReFT"],
    "Params_Percentage": [None, 0.110, 0.990, 3.540, 0.830, 0.430, 0.840, 0.031],
    "Average_Accuracy": [77.0, 64.6, 70.8, 72.3, 74.7, 77.5, 78.1, 80.2]
}

# Data from Table 1 in the paper (LLaMA-13B model)
commonsense_data_13b = {
    "Method": ["ChatGPT", "PrefT", "AdapterS", "AdapterP", "LoRA", "DoRA (half)", "DoRA", "LoReFT"],
    "Params_Percentage": [None, 0.030, 0.800, 2.890, 0.670, 0.350, 0.680, 0.025],
    "Average_Accuracy": [77.0, 68.4, 79.5, 81.5, 80.5, 80.8, 81.5, 83.3]
}

commonsense_df_7b = pd.DataFrame(commonsense_data_7b)
commonsense_df_13b = pd.DataFrame(commonsense_data_13b)

# Add model column
commonsense_df_7b["Model"] = "LLaMA-7B"
commonsense_df_13b["Model"] = "LLaMA-13B"

# Combine the data
commonsense_df = pd.concat([commonsense_df_7b, commonsense_df_13b], ignore_index=True)

In [None]:
# Display the data table
print("PEFT Methods Comparison on Commonsense Reasoning:")
display_df = commonsense_df.copy()
display_df["Params_Percentage"] = display_df["Params_Percentage"].apply(lambda x: f"{x:.3f}%" if x is not None else "N/A")
display_df["Average_Accuracy"] = display_df["Average_Accuracy"].apply(lambda x: f"{x:.1f}%")
print(tabulate(display_df, headers="keys", tablefmt="grid", showindex=False))

In [None]:
# Visualize accuracy comparison
plt.figure(figsize=(14, 6))

# Filter out ChatGPT for parameter percentage comparison
filtered_df = commonsense_df[commonsense_df["Method"] != "ChatGPT"].copy()

# Create two bar charts side by side
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(18, 8))

# Plot LLaMA-7B
df_7b = filtered_df[filtered_df["Model"] == "LLaMA-7B"].sort_values(by="Average_Accuracy")
bars1 = ax1.bar(
    df_7b["Method"],
    df_7b["Average_Accuracy"],
    color=sns.color_palette("viridis", len(df_7b))
)
ax1.axhline(y=77.0, color='r', linestyle='--', label='ChatGPT (77.0%)')
ax1.set_title("LLaMA-7B: Accuracy on Commonsense Reasoning")
ax1.set_xlabel("PEFT Method")
ax1.set_ylabel("Average Accuracy (%)")
ax1.set_ylim(60, 85)
ax1.grid(axis='y', linestyle='--', alpha=0.7)
ax1.legend()

# Add parameter percentage annotations
for i, bar in enumerate(bars1):
    ax1.text(
        bar.get_x() + bar.get_width()/2.,
        bar.get_height() + 0.5,
        f'{df_7b["Params_Percentage"].iloc[i]:.3f}%',
        ha='center',
        va='bottom',
        rotation=0,
        fontsize=9
    )

# Plot LLaMA-13B
df_13b = filtered_df[filtered_df["Model"] == "LLaMA-13B"].sort_values(by="Average_Accuracy")
bars2 = ax2.bar(
    df_13b["Method"],
    df_13b["Average_Accuracy"],
    color=sns.color_palette("viridis", len(df_13b))
)
ax2.axhline(y=77.0, color='r', linestyle='--', label='ChatGPT (77.0%)')
ax2.set_title("LLaMA-13B: Accuracy on Commonsense Reasoning")
ax2.set_xlabel("PEFT Method")
ax2.set_ylabel("Average Accuracy (%)")
ax2.set_ylim(60, 85)
ax2.grid(axis='y', linestyle='--', alpha=0.7)
ax2.legend()

# Add parameter percentage annotations
for i, bar in enumerate(bars2):
    ax2.text(
        bar.get_x() + bar.get_width()/2.,
        bar.get_height() + 0.5,
        f'{df_13b["Params_Percentage"].iloc[i]:.3f}%',
        ha='center',
        va='bottom',
        rotation=0,
        fontsize=9
    )

plt.tight_layout()
plt.show()

In [None]:
# Create a scatter plot of parameter percentage vs. accuracy
plt.figure(figsize=(14, 7))

# Filter out ChatGPT
filtered_df = commonsense_df[commonsense_df["Method"] != "ChatGPT"].copy()

# Plot LLaMA-7B
plt.scatter(
    filtered_df[filtered_df["Model"] == "LLaMA-7B"]["Params_Percentage"],
    filtered_df[filtered_df["Model"] == "LLaMA-7B"]["Average_Accuracy"],
    s=150,
    marker='o',
    label='LLaMA-7B',
    alpha=0.7
)

# Plot LLaMA-13B
plt.scatter(
    filtered_df[filtered_df["Model"] == "LLaMA-13B"]["Params_Percentage"],
    filtered_df[filtered_df["Model"] == "LLaMA-13B"]["Average_Accuracy"],
    s=150,
    marker='s',
    label='LLaMA-13B',
    alpha=0.7
)

# Add horizontal line for ChatGPT
plt.axhline(y=77.0, color='r', linestyle='--', label='ChatGPT (77.0%)')

# Add method labels
for model in ["LLaMA-7B", "LLaMA-13B"]:
    model_df = filtered_df[filtered_df["Model"] == model]
    for i, row in model_df.iterrows():
        plt.annotate(
            row["Method"],
            (row["Params_Percentage"], row["Average_Accuracy"]),
            textcoords="offset points",
            xytext=(0, 10 if model == "LLaMA-7B" else -15),
            ha='center',
            fontsize=9
        )

plt.title("Parameter Efficiency vs. Accuracy on Commonsense Reasoning")
plt.xlabel("Parameter Percentage (%)")
plt.ylabel("Average Accuracy (%)")
plt.xscale("log")
plt.grid(True, linestyle='--', alpha=0.7)
plt.legend()
plt.tight_layout()
plt.show()

## 7. Analysis of Results and Implications

Based on our comparative analysis and the data from the paper, we can draw several key insights about PEFT methods and their evaluation frameworks:

### 7.1 Parameter Efficiency vs. Performance Trade-off

1. **Efficiency-Performance Spectrum**: Different PEFT methods occupy different positions on the efficiency-performance spectrum. Some methods (like LoReFT) achieve remarkable parameter efficiency with minimal performance degradation, while others prioritize performance at the cost of more parameters.

2. **Non-linear Relationship**: The relationship between parameter percentage and performance is not linear. Some methods with very few parameters (like LoReFT with only 0.025-0.031% of parameters) can outperform methods with more parameters.

3. **Diminishing Returns**: There appears to be a point of diminishing returns, where adding more trainable parameters yields minimal performance improvements.

### 7.2 Application-Specific Considerations

1. **Domain Dependency**: The performance of PEFT methods can vary significantly across different application domains. While LoRA shows consistent performance (90-99% retention) across domains, the optimal method may differ based on the specific application.

2. **Model Size Effects**: The benefits of PEFT methods are often more pronounced with larger models. For instance, the performance gap between PEFT methods and full fine-tuning tends to be smaller with the LLaMA-13B model compared to the LLaMA-7B model.

3. **Task Complexity**: More complex tasks (like arithmetic reasoning) may benefit from different PEFT approaches compared to simpler tasks.

### 7.3 Evaluation Framework Considerations

1. **Multi-dimensional Evaluation**: A comprehensive evaluation of PEFT methods should consider multiple dimensions, including parameter efficiency, performance, training time, memory usage, and generalization.

2. **Standardized Benchmarks**: To ensure fair comparisons, it's important to use standardized benchmarks and consistent evaluation protocols across different methods.

3. **Hyperparameter Sensitivity**: The performance of PEFT methods can be sensitive to hyperparameters. A fair comparison should account for this by optimizing hyperparameters for each method or using consistent hyperparameter settings.

4. **Computational Efficiency Metrics**: Beyond parameter counts, metrics like training time, memory usage, and inference speed provide important insights into the practical efficiency of PEFT methods.

## 8. Conclusion and Best Practices

Based on our analysis and the findings from the paper, we can distill some best practices for evaluating and comparing PEFT methods:

### 8.1 Best Practices for PEFT Evaluation

1. **Comprehensive Metrics**: Include a diverse set of metrics covering parameter efficiency, performance, computational efficiency, and generalization.

2. **Baseline Comparison**: Always compare PEFT methods against full fine-tuning and zero-shot performance as baselines.

3. **Diverse Tasks**: Evaluate methods on a diverse set of tasks to assess their versatility across different application domains.

4. **Hyperparameter Sensitivity Analysis**: Analyze the sensitivity of PEFT methods to their hyperparameters to ensure robust conclusions.

5. **Resource Constraints**: Consider the target deployment environment and its resource constraints when evaluating PEFT methods.

6. **Statistical Significance**: Run multiple trials with different random seeds to assess the statistical significance of performance differences.

7. **Visualization Tools**: Use effective visualizations to communicate the trade-offs between different dimensions of PEFT methods.

### 8.2 Future Directions

The paper suggests several promising directions for future research on PEFT methods and their evaluation:

1. **Task-Agnostic PEFT**: Developing PEFT methods that are universally applicable across different downstream tasks without the need for task-specific adaptation.

2. **Privacy-Preserving PEFT**: Exploring techniques like federated learning or homomorphic encryption for privacy-preserving PEFT in sensitive domains like healthcare.

3. **Limited Data Scenarios**: Enhancing the robustness of PEFT methods in scenarios with limited labeled data through techniques like active learning or curriculum learning.

4. **Interpretability**: Improving the interpretability of fine-tuned models to understand how PEFT methods affect the model's decision-making process.

5. **Automated PEFT Selection**: Developing methods to automatically select the most appropriate PEFT technique for a given task and model architecture.

By following these best practices and exploring these future directions, researchers and practitioners can more effectively leverage PEFT methods to make deep learning more accessible, efficient, and adaptable across diverse applications.

## References

1. Balne, C. C. S., Bhaduri, S., Roy, T., Jain, V., & Chadha, A. (2024). Parameter Efficient Fine Tuning: A Comprehensive Analysis Across Applications. arXiv:2404.13506v2.

2. Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., ... & Chen, W. (2021). LoRA: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.

3. Wu, Z., Arora, A., Wang, Z., Geiger, A., Jurafsky, D., Manning, C. D., & Potts, C. (2024). ReFT: Representation finetuning for language models. arXiv preprint arXiv:2401.13622.

4. Li, X. L., & Liang, P. (2021). Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190.

5. Zaken, E. B., Ravfogel, S., & Goldberg, Y. (2021). BitFit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. arXiv preprint arXiv:2106.10199.