# LLaMA-Reviewer: Advancing Code Review Automation with Large Language Models through Parameter-Efficient Fine-Tuning

## Paper Information
- **Title**: LLaMA-Reviewer: Advancing Code Review Automation with Large Language Models through Parameter-Efficient Fine-Tuning
- **Authors**: Junyi Lu, Lei Yu, Xiaojia Li, Li Yang, Chun Zuo
- **Paper Link**: [arXiv:2308.11148v2](https://arxiv.org/abs/2308.11148v2)
- **GitHub**: [https://github.com/LLaMA-Reviewer](https://doi.org/10.5281/zenodo.7991113)

## Abstract Summary
This paper presents LLaMA-Reviewer, a framework that leverages LLaMA (Large Language Model) for automating code review tasks using Parameter-Efficient Fine-Tuning (PEFT) methods. The framework achieves competitive performance with state-of-the-art code review models while using less than 1% of trainable parameters. The system addresses three core tasks: review necessity prediction, review comment generation, and code refinement.

## 1. Environment Setup and Dependencies

Install required packages for implementing LLaMA-Reviewer with LangChain and related tools.

In [None]:
# Install core dependencies
!pip install -q transformers==4.36.0
!pip install -q peft==0.7.1  # Parameter-Efficient Fine-Tuning library
!pip install -q bitsandbytes==0.41.3  # For 8-bit quantization
!pip install -q accelerate==0.25.0
!pip install -q datasets==2.16.0
!pip install -q langchain==0.1.0
!pip install -q langchain-community==0.1.0
!pip install -q torch==2.1.2
!pip install -q einops==0.7.0
!pip install -q deepeval==0.20.90  # For evaluation metrics
!pip install -q rouge-score==0.1.2
!pip install -q nltk==3.8.1

In [None]:
import torch
import transformers
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    Trainer
)
from peft import (
    LoraConfig,
    PrefixTuningConfig,
    get_peft_model,
    TaskType,
    prepare_model_for_kbit_training
)
from datasets import Dataset, load_dataset
import pandas as pd
import numpy as np
from typing import Dict, List, Optional, Tuple
import json
from dataclasses import dataclass
from langchain.llms import HuggingFacePipeline
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate
import warnings
warnings.filterwarnings('ignore')

print(f"PyTorch version: {torch.__version__}")
print(f"Transformers version: {transformers.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA version: {torch.version.cuda}")
    print(f"GPU: {torch.cuda.get_device_name(0)}")

## 2. Data Preparation and Prompt Templates

Implementation of the prompt templates as described in the paper (Figure 3).

In [None]:
@dataclass
class CodeReviewTask:
    """Dataclass for code review tasks"""
    name: str
    instruction: str
    input_format: str
    output_format: str

# Define the three core tasks from the paper
REVIEW_TASKS = {
    "review_necessity_prediction": CodeReviewTask(
        name="Review Necessity Prediction",
        instruction="Determine whether the provided diff hunk requires a code review. Respond with either 'yes' or 'no'.",
        input_format="The diff hunk is: '{diff_hunk}'",
        output_format="yes/no"
    ),
    "review_comment_generation": CodeReviewTask(
        name="Review Comment Generation",
        instruction="Review the given code and provide a constructive code review comment.",
        input_format="The code is: '{code}'",
        output_format="{comment}"
    ),
    "code_refinement": CodeReviewTask(
        name="Code Refinement",
        instruction="Refine the given code based on the provided code review comment.",
        input_format="The comment is: '{comment}'\nThe code is: '{source_code}'",
        output_format="{target_code}"
    )
}

def create_prompt_template(task: CodeReviewTask) -> str:
    """Create the prompt template as per Figure 3 in the paper"""
    template = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Input:
{input}

### Response:
{output}"""
    return template

# Example usage
print("Prompt template structure:")
print(create_prompt_template(REVIEW_TASKS["review_necessity_prediction"]))

## 3. Mock Dataset Creation

Creating synthetic datasets that simulate the CRer and Tufano datasets mentioned in the paper.

In [None]:
def create_mock_code_review_dataset(task_type: str, num_samples: int = 100) -> Dataset:
    """Create mock dataset for code review tasks"""
    
    if task_type == "review_necessity_prediction":
        # Mock diff hunks
        diff_hunks = [
            "+    if (user != null && user.isActive()) {",
            "-    // TODO: implement this function",
            "+    logger.debug('Processing user: ' + user.id);",
            "-    return x + y;",
            "+    return x + y; // Fixed addition"
        ]
        
        data = []
        for i in range(num_samples):
            diff = diff_hunks[i % len(diff_hunks)]
            needs_review = "yes" if i % 2 == 0 else "no"
            data.append({
                "diff_hunk": diff,
                "label": needs_review
            })
    
    elif task_type == "review_comment_generation":
        # Mock code snippets and comments
        code_snippets = [
            "public void processUser(User user) {\n    user.save();\n}",
            "def calculate_sum(a, b):\n    return a + b",
            "const fetchData = async () => {\n    const res = await fetch(url);\n    return res;\n}"
        ]
        
        comments = [
            "Consider adding null check before saving user",
            "Add type hints for better code clarity",
            "Missing error handling for fetch operation"
        ]
        
        data = []
        for i in range(num_samples):
            data.append({
                "code": code_snippets[i % len(code_snippets)],
                "comment": comments[i % len(comments)]
            })
    
    elif task_type == "code_refinement":
        # Mock code refinement examples
        refinements = [
            {
                "source_code": "public void save(User u) {\n    u.save();\n}",
                "comment": "Add null check before saving",
                "target_code": "public void save(User u) {\n    if (u != null) {\n        u.save();\n    }\n}"
            },
            {
                "source_code": "def add(a, b):\n    return a + b",
                "comment": "Add type hints",
                "target_code": "def add(a: int, b: int) -> int:\n    return a + b"
            }
        ]
        
        data = []
        for i in range(num_samples):
            ref = refinements[i % len(refinements)]
            data.append(ref)
    
    return Dataset.from_pandas(pd.DataFrame(data))

# Create sample datasets
rnp_dataset = create_mock_code_review_dataset("review_necessity_prediction", 50)
rcg_dataset = create_mock_code_review_dataset("review_comment_generation", 50)
cr_dataset = create_mock_code_review_dataset("code_refinement", 50)

print(f"Review Necessity Prediction dataset size: {len(rnp_dataset)}")
print(f"Review Comment Generation dataset size: {len(rcg_dataset)}")
print(f"Code Refinement dataset size: {len(cr_dataset)}")
print("\nSample from RNP dataset:")
print(rnp_dataset[0])

## 4. PEFT Configuration: LoRA and Prefix-Tuning

Implementation of the two PEFT methods described in the paper.

In [None]:
class PEFTConfigFactory:
    """Factory for creating PEFT configurations as per paper specifications"""
    
    @staticmethod
    def create_lora_config(r: int = 16, lora_alpha: int = 16) -> LoraConfig:
        """
        Create LoRA configuration based on paper parameters:
        - r (rank): 8 or 16 as tested in the paper
        - lora_alpha: scaling factor set to 16
        """
        return LoraConfig(
            r=r,
            lora_alpha=lora_alpha,
            target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
            lora_dropout=0.1,
            bias="none",
            task_type=TaskType.CAUSAL_LM,
        )
    
    @staticmethod
    def create_prefix_config(num_virtual_tokens: int = 10, prefix_projection: bool = True) -> PrefixTuningConfig:
        """
        Create Prefix-tuning configuration:
        - num_virtual_tokens: 10 as per paper
        - prefix_projection: True for zero-init attention
        """
        return PrefixTuningConfig(
            task_type=TaskType.CAUSAL_LM,
            num_virtual_tokens=num_virtual_tokens,
            encoder_hidden_size=None,  # Will be set automatically
            prefix_projection=prefix_projection,
        )

# Create configurations
lora_config_r8 = PEFTConfigFactory.create_lora_config(r=8)
lora_config_r16 = PEFTConfigFactory.create_lora_config(r=16)
prefix_config = PEFTConfigFactory.create_prefix_config()

print("LoRA Configuration (r=16):")
print(lora_config_r16)
print("\nPrefix-tuning Configuration:")
print(prefix_config)

## 5. Model Loading and PEFT Application

Due to computational constraints, we'll use a smaller model to demonstrate the concept.

In [None]:
def load_model_with_peft(model_name: str = "microsoft/phi-2", peft_config=None, load_in_8bit: bool = True):
    """
    Load model with PEFT configuration.
    Note: In the paper, they use LLaMA-7B. Here we use a smaller model for demonstration.
    """
    
    # Quantization config for memory efficiency
    bnb_config = BitsAndBytesConfig(
        load_in_8bit=load_in_8bit,
        bnb_8bit_compute_dtype=torch.float16,
        bnb_8bit_use_double_quant=True,
        bnb_8bit_quant_type="nf4"
    )
    
    # Load base model
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        quantization_config=bnb_config if load_in_8bit else None,
        device_map="auto",
        trust_remote_code=True
    )
    
    # Load tokenizer
    tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
    tokenizer.pad_token = tokenizer.eos_token
    
    # Apply PEFT if config provided
    if peft_config:
        model = prepare_model_for_kbit_training(model)
        model = get_peft_model(model, peft_config)
        
        # Print trainable parameters
        trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
        all_params = sum(p.numel() for p in model.parameters())
        print(f"Trainable params: {trainable_params:,} || All params: {all_params:,} || Trainable%: {100 * trainable_params / all_params:.2f}")
    
    return model, tokenizer

# Example: Load model with LoRA
print("Loading model with LoRA (r=16)...")
# Uncomment below to actually load the model (requires GPU)
# model_lora, tokenizer = load_model_with_peft(peft_config=lora_config_r16)
print("Model loading demonstration complete.")

## 6. Training Pipeline

Implementation of the training pipeline for the three code review tasks.

In [None]:
class CodeReviewTrainer:
    """Trainer for code review tasks using PEFT"""
    
    def __init__(self, model, tokenizer, task_type: str):
        self.model = model
        self.tokenizer = tokenizer
        self.task = REVIEW_TASKS[task_type]
    
    def prepare_dataset(self, dataset: Dataset) -> Dataset:
        """Prepare dataset with proper formatting"""
        
        def format_example(example):
            # Format according to the prompt template
            if hasattr(self.task, 'input_format'):
                input_text = self.task.input_format.format(**example)
            else:
                input_text = str(example)
            
            prompt = create_prompt_template(self.task).format(
                instruction=self.task.instruction,
                input=input_text,
                output=example.get('label', example.get('comment', example.get('target_code', '')))
            )
            
            return {'text': prompt}
        
        return dataset.map(format_example)
    
    def get_training_args(self, output_dir: str, num_epochs: int = 5) -> TrainingArguments:
        """Get training arguments based on paper specifications"""
        return TrainingArguments(
            output_dir=output_dir,
            num_train_epochs=num_epochs,
            per_device_train_batch_size=4,
            per_device_eval_batch_size=4,
            gradient_accumulation_steps=16,  # Effective batch size of 64
            warmup_steps=100,
            logging_steps=10,
            save_strategy="epoch",
            evaluation_strategy="epoch",
            learning_rate=3e-4,  # LoRA learning rate from paper
            fp16=True,
            report_to=[],
        )

# Example trainer setup
print("Code Review Trainer configured successfully.")
print("Training arguments based on paper specifications:")
trainer = CodeReviewTrainer(None, None, "review_necessity_prediction")
print(trainer.get_training_args("./output"))

## 7. Evaluation Metrics Implementation

Implementation of evaluation metrics using deepeval as specified.

In [None]:
from typing import List, Dict, Tuple
from sklearn.metrics import precision_recall_fscore_support
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
import nltk
nltk.download('punkt', quiet=True)

class CodeReviewEvaluator:
    """Evaluator for code review tasks"""
    
    @staticmethod
    def evaluate_review_necessity(predictions: List[str], labels: List[str]) -> Dict[str, float]:
        """Evaluate review necessity prediction (binary classification)"""
        # Convert to binary
        pred_binary = [1 if p.lower() == 'yes' else 0 for p in predictions]
        label_binary = [1 if l.lower() == 'yes' else 0 for l in labels]
        
        precision, recall, f1, _ = precision_recall_fscore_support(
            label_binary, pred_binary, average='binary'
        )
        
        return {
            'precision': precision,
            'recall': recall,
            'f1': f1
        }
    
    @staticmethod
    def calculate_bleu4(predictions: List[str], references: List[str]) -> float:
        """Calculate BLEU-4 score for generation tasks"""
        smoothing = SmoothingFunction().method1
        bleu_scores = []
        
        for pred, ref in zip(predictions, references):
            # Tokenize
            pred_tokens = pred.split()
            ref_tokens = ref.split()
            
            # Calculate BLEU-4
            score = sentence_bleu(
                [ref_tokens], 
                pred_tokens,
                weights=(0.25, 0.25, 0.25, 0.25),
                smoothing_function=smoothing
            )
            bleu_scores.append(score)
        
        return np.mean(bleu_scores) * 100  # Return as percentage
    
    @staticmethod
    def evaluate_generation_task(predictions: List[str], references: List[str]) -> Dict[str, float]:
        """Evaluate comment generation and code refinement tasks"""
        bleu4 = CodeReviewEvaluator.calculate_bleu4(predictions, references)
        
        return {
            'bleu4': bleu4
        }

# Test evaluation metrics
evaluator = CodeReviewEvaluator()

# Test review necessity evaluation
test_preds = ['yes', 'no', 'yes', 'yes', 'no']
test_labels = ['yes', 'no', 'no', 'yes', 'no']
rnp_metrics = evaluator.evaluate_review_necessity(test_preds, test_labels)
print("Review Necessity Prediction Metrics:")
print(f"Precision: {rnp_metrics['precision']:.3f}")
print(f"Recall: {rnp_metrics['recall']:.3f}")
print(f"F1: {rnp_metrics['f1']:.3f}")

# Test BLEU-4 calculation
test_pred_text = ["Add null check before saving"]
test_ref_text = ["Consider adding null check before saving user"]
bleu_score = evaluator.calculate_bleu4(test_pred_text, test_ref_text)
print(f"\nBLEU-4 Score: {bleu_score:.2f}")

## 8. LangChain Integration for Inference

Using LangChain to create inference pipelines for the trained models.

In [None]:
class LLaMAReviewerChain:
    """LangChain-based inference for LLaMA-Reviewer"""
    
    def __init__(self, model_path: str = None):
        self.chains = {}
        self._setup_chains()
    
    def _setup_chains(self):
        """Setup LangChain chains for each task"""
        
        # Review Necessity Prediction Chain
        rnp_template = PromptTemplate(
            input_variables=["diff_hunk"],
            template="""Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Determine whether the provided diff hunk requires a code review. Respond with either 'yes' or 'no'.

### Input:
The diff hunk is: '{diff_hunk}'

### Response:"""
        )
        
        # Review Comment Generation Chain
        rcg_template = PromptTemplate(
            input_variables=["code"],
            template="""Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Review the given code and provide a constructive code review comment.

### Input:
The code is: '{code}'

### Response:"""
        )
        
        # Code Refinement Chain
        cr_template = PromptTemplate(
            input_variables=["comment", "source_code"],
            template="""Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Refine the given code based on the provided code review comment.

### Input:
The comment is: '{comment}'
The code is: '{source_code}'

### Response:"""
        )
        
        # Store templates (chains would be created with actual LLM)
        self.chains = {
            'review_necessity': rnp_template,
            'comment_generation': rcg_template,
            'code_refinement': cr_template
        }
    
    def predict_review_necessity(self, diff_hunk: str) -> str:
        """Predict if code review is needed"""
        # In practice, this would use the actual model
        return "yes" if "TODO" in diff_hunk or "debug" in diff_hunk else "no"
    
    def generate_comment(self, code: str) -> str:
        """Generate review comment"""
        # Mock implementation
        if "null" not in code and "User" in code:
            return "Consider adding null check before processing user"
        return "Code looks good"
    
    def refine_code(self, source_code: str, comment: str) -> str:
        """Refine code based on comment"""
        # Mock implementation
        if "null check" in comment:
            return source_code.replace("{", "{\n    if (user != null) {")
        return source_code

# Test the chain
reviewer = LLaMAReviewerChain()

# Test review necessity
test_diff = "+    // TODO: implement validation"
necessity = reviewer.predict_review_necessity(test_diff)
print(f"Review needed for diff '{test_diff}': {necessity}")

# Test comment generation
test_code = "public void processUser(User user) {\n    user.save();\n}"
comment = reviewer.generate_comment(test_code)
print(f"\nGenerated comment: {comment}")

# Test code refinement
refined = reviewer.refine_code(test_code, comment)
print(f"\nRefined code:\n{refined}")

## 9. Results Analysis and Visualization

Recreating key results from the paper's evaluation.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Results from Table V and VI in the paper
results_data = {
    'Review Comment Generation': {
        'Models': ['Transformer-b', 'CodeT5', 'CodeReviewer', 'LLaMA-Reviewer (Prefix)', 'LLaMA-Reviewer (LoRA)'],
        'BLEU-4': [4.76, 4.83, 5.32, 5.16, 5.70],
        'Parameters (M)': [220, 220, 220, 1.2, 8.4]
    },
    'Code Refinement': {
        'Models': ['CodeT5', 'CodeReviewer', 'LLaMA-Reviewer (Prefix)', 'LLaMA-Reviewer (LoRA r=8)', 'LLaMA-Reviewer (LoRA r=16)'],
        'BLEU-4': [80.82, 82.61, 76.71, 81.59, 82.27],
        'Parameters (M)': [220, 220, 1.2, 4.2, 8.4]
    }
}

# Create visualization
fig, axes = plt.subplots(1, 2, figsize=(15, 6))

for idx, (task, data) in enumerate(results_data.items()):
    ax = axes[idx]
    
    # Create bar plot
    x = np.arange(len(data['Models']))
    width = 0.35
    
    # BLEU-4 scores
    bars1 = ax.bar(x - width/2, data['BLEU-4'], width, label='BLEU-4', color='skyblue')
    
    # Add value labels
    for bar in bars1:
        height = bar.get_height()
        ax.text(bar.get_x() + bar.get_width()/2., height,
                f'{height:.2f}', ha='center', va='bottom')
    
    # Formatting
    ax.set_xlabel('Models')
    ax.set_ylabel('BLEU-4 Score')
    ax.set_title(f'{task} Performance')
    ax.set_xticks(x)
    ax.set_xticklabels(data['Models'], rotation=45, ha='right')
    ax.legend()
    ax.grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

# Parameter efficiency comparison
fig, ax = plt.subplots(figsize=(10, 6))

models = ['Full Fine-tuning', 'LLaMA-Reviewer\n(LoRA r=16)', 'LLaMA-Reviewer\n(LoRA r=8)', 'LLaMA-Reviewer\n(Prefix)']
params = [6700, 8.4, 4.2, 1.2]  # in millions
colors = ['red', 'green', 'blue', 'orange']

bars = ax.bar(models, params, color=colors, alpha=0.7)

# Add percentage labels
for i, (bar, param) in enumerate(zip(bars, params)):
    height = bar.get_height()
    percentage = (param / 6700) * 100 if i > 0 else 100
    ax.text(bar.get_x() + bar.get_width()/2., height,
            f'{param}M\n({percentage:.2f}%)', ha='center', va='bottom')

ax.set_ylabel('Trainable Parameters (Millions)')
ax.set_title('Parameter Efficiency: PEFT vs Full Fine-tuning')
ax.set_ylim(0, 7000)
plt.tight_layout()
plt.show()

## 10. Research Extension Template

Template for extending this research with your own experiments.

In [None]:
class ResearchExtension:
    """Template for extending LLaMA-Reviewer research"""
    
    def __init__(self):
        self.experiment_log = []
    
    def experiment_1_different_llm(self):
        """
        Experiment: Test with different LLMs (e.g., CodeLLaMA, StarCoder)
        """
        print("Experiment 1: Testing different LLMs")
        print("- Try CodeLLaMA for better code understanding")
        print("- Compare with StarCoder for code-specific tasks")
        print("- Evaluate trade-offs between unified and code-specific LLMs")
    
    def experiment_2_advanced_peft(self):
        """
        Experiment: Test newer PEFT methods
        """
        print("\nExperiment 2: Advanced PEFT methods")
        print("- QLoRA: 4-bit quantization + LoRA")
        print("- AdaLoRA: Adaptive rank allocation")
        print("- IA3: Infused Adapter by Inhibiting and Amplifying")
    
    def experiment_3_multilingual_review(self):
        """
        Experiment: Extend to multilingual code review
        """
        print("\nExperiment 3: Multilingual code review")
        print("- Test on non-English comments")
        print("- Mixed language code bases")
        print("- Cross-language code refinement")
    
    def experiment_4_realtime_integration(self):
        """
        Experiment: Integration with development tools
        """
        print("\nExperiment 4: Real-time integration")
        print("- VS Code extension using the model")
        print("- GitHub Actions integration")
        print("- Performance optimization for real-time use")

# Show research extension ideas
research = ResearchExtension()
research.experiment_1_different_llm()
research.experiment_2_advanced_peft()
research.experiment_3_multilingual_review()
research.experiment_4_realtime_integration()

## Summary and Key Takeaways

### Paper Contributions:
1. **First application of LLMs to code review automation** - Demonstrated that general-purpose LLMs can match specialized models
2. **PEFT paradigm for code review** - Achieved <1% trainable parameters while maintaining performance
3. **Comprehensive evaluation** - Tested on multiple datasets and tasks with thorough ablation studies
4. **Open-source implementation** - Made code and models publicly available

### Key Technical Insights:
- **LoRA outperforms Prefix-tuning** for code review tasks
- **Input representation matters** - Keeping code format close to pre-training data improves performance
- **Instruction tuning helps** but the effect is task-dependent
- **Larger LoRA rank (r=16)** provides better performance than r=8

### Practical Applications:
- Automated PR review systems
- IDE integration for real-time code suggestions
- Training data generation for code quality models
- Multi-language code review support

### Future Research Directions:
1. Testing with larger LLaMA models (13B, 70B)
2. Exploring code-specific LLMs (CodeLLaMA)
3. Real-time optimization for production use
4. Multi-modal approaches (code + documentation)
5. Integration with existing development workflows