# Focused Learning: Instruction Tuning with LLaMA for Domain Adaptation
## Deep Dive into Code-Domain Instruction Following

### Learning Objectives:
- Understand the two-stage fine-tuning approach
- Master instruction tuning for code review domain adaptation
- Compare Code Alpaca vs mixed NL+PL instruction datasets
- Implement custom instruction templates for code tasks

### Paper References:
- **Section III.B**: Instruction Tuning on LLaMA (Page 3)
- **Section V.C**: The Impact of Instruction Tuning (Page 8-9)
- **Figure 3**: Prompt template and instruction formats
- **Table VIII**: Impact of instruction tuning results

## 1. Understanding Instruction Tuning

**Definition**: Instruction tuning adapts pre-trained LLMs to follow natural language instructions by fine-tuning on datasets with instruction-input-output triplets.

**LLaMA-Reviewer's Approach**:
1. **Stage 1**: Instruction tuning on code-centric domain data (Code Alpaca)
2. **Stage 2**: Task-specific fine-tuning for each code review subtask

This helps the model understand task instructions and improves downstream performance.

In [None]:
import json
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from typing import Dict, List, Tuple, Optional
from dataclasses import dataclass
import random
from collections import defaultdict

# Set random seed for reproducibility
random.seed(42)
np.random.seed(42)

@dataclass
class InstructionExample:
    """Data structure for instruction-following examples"""
    instruction: str
    input: Optional[str] = None
    output: str = ""
    source: str = "custom"  # 'alpaca', 'code_alpaca', 'custom'
    
    def to_prompt(self, template: str = "alpaca") -> str:
        """Convert to formatted prompt"""
        if template == "alpaca":
            if self.input:
                return f"""Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{self.instruction}

### Input:
{self.input}

### Response:
{self.output}"""
            else:
                return f"""Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{self.instruction}

### Response:
{self.output}"""
        
        return f"Instruction: {self.instruction}\nOutput: {self.output}"

# Visualize the two-stage approach
def visualize_two_stage_training():
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6))
    
    # Stage 1: Instruction Tuning
    ax1.set_xlim(0, 10)
    ax1.set_ylim(0, 8)
    
    # Base LLaMA
    base_rect = plt.Rectangle((1, 6), 8, 1, facecolor='lightblue', edgecolor='black', linewidth=2)
    ax1.add_patch(base_rect)
    ax1.text(5, 6.5, 'Pre-trained LLaMA', ha='center', va='center', fontweight='bold')
    
    # Instruction tuning data
    code_alpaca_rect = plt.Rectangle((2, 4), 3, 1, facecolor='lightgreen', edgecolor='black')
    ax1.add_patch(code_alpaca_rect)
    ax1.text(3.5, 4.5, 'Code Alpaca\n(20K examples)', ha='center', va='center', fontsize=10)
    
    alpaca_rect = plt.Rectangle((5.5, 4), 3, 1, facecolor='lightyellow', edgecolor='black')
    ax1.add_patch(alpaca_rect)
    ax1.text(7, 4.5, 'Alpaca\n(52K examples)', ha='center', va='center', fontsize=10)
    
    # Result
    result_rect = plt.Rectangle((3, 2), 4, 1, facecolor='lightcoral', edgecolor='black', linewidth=2)
    ax1.add_patch(result_rect)
    ax1.text(5, 2.5, 'Instruction-tuned\nLLaMA', ha='center', va='center', fontweight='bold')
    
    # Arrows
    ax1.arrow(5, 5.8, 0, -0.5, head_width=0.2, head_length=0.1, fc='black', ec='black')
    ax1.arrow(3.5, 3.8, 0.5, -1.0, head_width=0.15, head_length=0.1, fc='green', ec='green')
    ax1.arrow(7, 3.8, -0.5, -1.0, head_width=0.15, head_length=0.1, fc='orange', ec='orange')
    
    ax1.set_title('Stage 1: Instruction Tuning', fontsize=14, fontweight='bold')
    ax1.axis('off')
    
    # Stage 2: Task-specific fine-tuning
    ax2.set_xlim(0, 10)
    ax2.set_ylim(0, 8)
    
    # Instruction-tuned model
    inst_rect = plt.Rectangle((3, 6), 4, 1, facecolor='lightcoral', edgecolor='black', linewidth=2)
    ax2.add_patch(inst_rect)
    ax2.text(5, 6.5, 'Instruction-tuned\nLLaMA', ha='center', va='center', fontweight='bold')
    
    # Task-specific datasets
    tasks = ['RNP', 'RCG', 'CR']
    colors = ['lightblue', 'lightgreen', 'lightyellow']
    
    for i, (task, color) in enumerate(zip(tasks, colors)):
        x = 1 + i * 2.5
        task_rect = plt.Rectangle((x, 4), 2, 1, facecolor=color, edgecolor='black')
        ax2.add_patch(task_rect)
        ax2.text(x + 1, 4.5, f'{task}\nDataset', ha='center', va='center', fontsize=10)
        
        # Result
        result_rect = plt.Rectangle((x, 2), 2, 1, facecolor='gold', edgecolor='black')
        ax2.add_patch(result_rect)
        ax2.text(x + 1, 2.5, f'{task}\nModel', ha='center', va='center', fontweight='bold', fontsize=10)
        
        # Arrows
        ax2.arrow(5, 5.8, x + 1 - 5, -1.0, head_width=0.1, head_length=0.1, fc='red', ec='red')
        ax2.arrow(x + 1, 3.8, 0, -0.5, head_width=0.1, head_length=0.1, fc='black', ec='black')
    
    ax2.set_title('Stage 2: Task-specific Fine-tuning', fontsize=14, fontweight='bold')
    ax2.axis('off')
    
    plt.tight_layout()
    plt.show()

visualize_two_stage_training()

## 2. Creating Code-Domain Instruction Datasets

The paper uses Code Alpaca dataset, which contains programming-related instruction-following examples. Let's create similar examples for code review tasks.

In [None]:
class CodeInstructionGenerator:
    """Generate code-domain instruction examples"""
    
    def __init__(self):
        # Instruction templates for different code tasks
        self.instruction_templates = {
            'code_explanation': [
                "Explain what this code does",
                "Provide a detailed explanation of the following code",
                "Describe the functionality of this code snippet"
            ],
            'code_review': [
                "Review this code and suggest improvements",
                "Identify potential issues in the following code",
                "Provide constructive feedback for this code"
            ],
            'code_refactoring': [
                "Refactor this code to make it more readable",
                "Improve the following code",
                "Rewrite this code following best practices"
            ],
            'bug_detection': [
                "Find potential bugs in this code",
                "Identify errors in the following code",
                "Check this code for any issues"
            ],
            'documentation': [
                "Write documentation for this function",
                "Generate docstring for the following code",
                "Create comments explaining this code"
            ]
        }
        
        # Sample code snippets
        self.code_examples = [
            {
                'language': 'python',
                'code': '''def fibonacci(n):
    if n <= 1:
        return n
    return fibonacci(n-1) + fibonacci(n-2)''',
                'explanation': 'Calculates the nth Fibonacci number using recursion',
                'issues': ['Inefficient due to repeated calculations', 'No input validation'],
                'improved': '''def fibonacci(n: int) -> int:
    """Calculate the nth Fibonacci number efficiently."""
    if not isinstance(n, int) or n < 0:
        raise ValueError("Input must be a non-negative integer")
    
    a, b = 0, 1
    for _ in range(n):
        a, b = b, a + b
    return a'''
            },
            {
                'language': 'javascript',
                'code': '''function processUser(user) {
    user.name = user.name.toUpperCase();
    user.save();
    return user;
}''',
                'explanation': 'Processes a user object by converting name to uppercase and saving',
                'issues': ['No null checks', 'Missing error handling', 'Mutates input object'],
                'improved': '''function processUser(user) {
    if (!user || !user.name) {
        throw new Error('Invalid user object');
    }
    
    try {
        const processedUser = {
            ...user,
            name: user.name.toUpperCase()
        };
        await processedUser.save();
        return processedUser;
    } catch (error) {
        console.error('Error processing user:', error);
        throw error;
    }
}'''
            },
            {
                'language': 'java',
                'code': '''public class Calculator {
    public int divide(int a, int b) {
        return a / b;
    }
}''',
                'explanation': 'A simple calculator class with a division method',
                'issues': ['Division by zero not handled', 'No input validation'],
                'improved': '''public class Calculator {
    /**
     * Divides two integers safely.
     * @param dividend The number to be divided
     * @param divisor The number to divide by
     * @return The result of division
     * @throws IllegalArgumentException if divisor is zero
     */
    public double divide(int dividend, int divisor) {
        if (divisor == 0) {
            throw new IllegalArgumentException("Cannot divide by zero");
        }
        return (double) dividend / divisor;
    }
}'''
            }
        ]
    
    def generate_instruction_examples(self, num_examples: int = 50) -> List[InstructionExample]:
        """Generate diverse instruction-following examples for code tasks"""
        examples = []
        task_types = list(self.instruction_templates.keys())
        
        for _ in range(num_examples):
            task_type = random.choice(task_types)
            code_example = random.choice(self.code_examples)
            instruction = random.choice(self.instruction_templates[task_type])
            
            if task_type == 'code_explanation':
                output = f"This {code_example['language']} code {code_example['explanation']}."
            elif task_type == 'code_review':
                issues = code_example['issues']
                output = "Issues found:\n" + "\n".join([f"- {issue}" for issue in issues[:2]])
            elif task_type == 'code_refactoring':
                output = f"Here's the improved version:\n\n```{code_example['language']}\n{code_example['improved']}\n```"
            elif task_type == 'bug_detection':
                output = f"Potential bugs found: {', '.join(code_example['issues'][:2])}"
            elif task_type == 'documentation':
                output = f"Documentation: {code_example['explanation']}"
            
            example = InstructionExample(
                instruction=instruction,
                input=code_example['code'],
                output=output,
                source='code_alpaca_like'
            )
            examples.append(example)
        
        return examples
    
    def generate_code_review_instructions(self) -> List[InstructionExample]:
        """Generate specific instruction examples for code review tasks"""
        examples = [
            # Review Necessity Prediction
            InstructionExample(
                instruction="Determine whether the provided diff hunk requires a code review. Respond with either 'yes' or 'no'.",
                input="+ console.log('Debug: user data', userData);",
                output="yes",
                source="rnp_task"
            ),
            InstructionExample(
                instruction="Determine whether the provided diff hunk requires a code review. Respond with either 'yes' or 'no'.",
                input="+ // Updated documentation",
                output="no",
                source="rnp_task"
            ),
            
            # Review Comment Generation  
            InstructionExample(
                instruction="Review the given code and provide a constructive code review comment.",
                input="function login(user) {\n    return authenticate(user);\n}",
                output="Consider adding input validation to check if 'user' is null or undefined before calling authenticate().",
                source="rcg_task"
            ),
            
            # Code Refinement
            InstructionExample(
                instruction="Refine the given code based on the provided code review comment.",
                input="Comment: Add null check\nCode: function save(user) { user.persist(); }",
                output="function save(user) {\n    if (user) {\n        user.persist();\n    }\n}",
                source="cr_task"
            )
        ]
        
        return examples

# Generate example instruction datasets
generator = CodeInstructionGenerator()

# General code instructions (Code Alpaca style)
code_examples = generator.generate_instruction_examples(20)

# Specific code review instructions
review_examples = generator.generate_code_review_instructions()

print("Sample Code Instruction Examples:\n")
for i, example in enumerate(code_examples[:3]):
    print(f"Example {i+1}:")
    print(example.to_prompt())
    print("\n" + "="*60 + "\n")

print("\nCode Review Task Examples:\n")
for i, example in enumerate(review_examples[:2]):
    print(f"Review Example {i+1}:")
    print(example.to_prompt())
    print("\n" + "="*60 + "\n")

## 3. Comparing Instruction Datasets: Code vs Mixed

The paper compares three approaches:
- **PL**: Only Code Alpaca (programming language instructions)
- **PL + NL**: Code Alpaca + Alpaca (mixed programming and natural language)
- **No Instruction Tuning**: Direct task fine-tuning

**Key Finding**: Code-only instructions (PL) outperformed mixed instructions (PL + NL).

In [None]:
class InstructionDatasetAnalyzer:
    """Analyze different instruction dataset compositions"""
    
    def __init__(self):
        self.dataset_stats = {
            'alpaca': {
                'size': 52000,
                'domain': 'general',
                'languages': ['natural language'],
                'task_types': ['writing', 'reasoning', 'qa', 'classification'],
                'complexity': 'varied'
            },
            'code_alpaca': {
                'size': 20000,
                'domain': 'programming',
                'languages': ['python', 'javascript', 'java', 'c++'],
                'task_types': ['code_generation', 'debugging', 'explanation'],
                'complexity': 'high'
            }
        }
        
        # Results from Table VIII in the paper
        self.results = {
            'no_instruction': {
                'rnp_f1': 70.20,
                'rcg_bleu': 5.58,
                'cr_bleu': 81.87
            },
            'pl_only': {
                'rnp_f1': 69.34,
                'rcg_bleu': 5.64,
                'cr_bleu': 81.59
            },
            'pl_plus_nl': {
                'rnp_f1': 69.82,
                'rcg_bleu': 5.23,
                'cr_bleu': 81.17
            }
        }
    
    def analyze_instruction_diversity(self) -> Dict:
        """Analyze diversity in instruction types"""
        
        # Generate sample instruction types
        generator = CodeInstructionGenerator()
        code_examples = generator.generate_instruction_examples(100)
        
        # Count task types
        task_counts = defaultdict(int)
        for example in code_examples:
            # Simple classification based on instruction text
            instruction_lower = example.instruction.lower()
            if 'explain' in instruction_lower:
                task_counts['explanation'] += 1
            elif 'review' in instruction_lower or 'improve' in instruction_lower:
                task_counts['review'] += 1
            elif 'refactor' in instruction_lower or 'rewrite' in instruction_lower:
                task_counts['refactoring'] += 1
            elif 'bug' in instruction_lower or 'error' in instruction_lower:
                task_counts['debugging'] += 1
            elif 'document' in instruction_lower or 'comment' in instruction_lower:
                task_counts['documentation'] += 1
            else:
                task_counts['other'] += 1
        
        return dict(task_counts)
    
    def visualize_dataset_comparison(self):
        """Visualize the impact of different instruction datasets"""
        fig, axes = plt.subplots(2, 2, figsize=(15, 10))
        
        # 1. Dataset composition
        ax1 = axes[0, 0]
        
        datasets = ['No Instruction', 'Code Only (PL)', 'Mixed (PL+NL)']
        code_portions = [0, 100, 28]  # Approximate percentages
        nl_portions = [0, 0, 72]
        
        width = 0.6
        x = np.arange(len(datasets))
        
        bars1 = ax1.bar(x, code_portions, width, label='Code Instructions', color='lightblue')
        bars2 = ax1.bar(x, nl_portions, width, bottom=code_portions, label='NL Instructions', color='lightcoral')
        
        ax1.set_xlabel('Instruction Dataset Type')
        ax1.set_ylabel('Percentage')
        ax1.set_title('Instruction Dataset Composition')
        ax1.set_xticks(x)
        ax1.set_xticklabels(datasets)
        ax1.legend()
        ax1.grid(axis='y', alpha=0.3)
        
        # 2. Performance comparison - RNP F1
        ax2 = axes[0, 1]
        
        rnp_scores = [self.results[key]['rnp_f1'] for key in ['no_instruction', 'pl_only', 'pl_plus_nl']]
        bars = ax2.bar(datasets, rnp_scores, color=['gray', 'green', 'orange'])
        
        ax2.set_xlabel('Instruction Dataset')
        ax2.set_ylabel('F1 Score')
        ax2.set_title('Review Necessity Prediction Performance')
        ax2.grid(axis='y', alpha=0.3)
        
        # Add value labels
        for bar, score in zip(bars, rnp_scores):
            height = bar.get_height()
            ax2.text(bar.get_x() + bar.get_width()/2., height,
                     f'{score:.2f}', ha='center', va='bottom')
        
        # 3. Performance comparison - RCG BLEU
        ax3 = axes[1, 0]
        
        rcg_scores = [self.results[key]['rcg_bleu'] for key in ['no_instruction', 'pl_only', 'pl_plus_nl']]
        bars = ax3.bar(datasets, rcg_scores, color=['gray', 'green', 'orange'])
        
        ax3.set_xlabel('Instruction Dataset')
        ax3.set_ylabel('BLEU-4 Score')
        ax3.set_title('Review Comment Generation Performance')
        ax3.grid(axis='y', alpha=0.3)
        
        # Add value labels
        for bar, score in zip(bars, rcg_scores):
            height = bar.get_height()
            ax3.text(bar.get_x() + bar.get_width()/2., height,
                     f'{score:.2f}', ha='center', va='bottom')
        
        # 4. Task type diversity
        ax4 = axes[1, 1]
        
        task_counts = self.analyze_instruction_diversity()
        tasks = list(task_counts.keys())
        counts = list(task_counts.values())
        
        wedges, texts, autotexts = ax4.pie(counts, labels=tasks, autopct='%1.1f%%', startangle=90)
        ax4.set_title('Code Instruction Task Distribution')
        
        plt.tight_layout()
        plt.show()
        
        # Print key insights
        print("Key Insights from the Paper:")
        print("\n1. Code-only instructions (PL) perform better than mixed (PL+NL)")
        print("   - RCG: 5.64 vs 5.23 BLEU-4")
        print("   - Reason: Diverse NL instructions may be overwhelming for code tasks")
        
        print("\n2. Instruction tuning helps most for RCG (comment generation)")
        print("   - Complex task benefits from instruction following")
        print("   - RNP and CR show minimal improvement")
        
        print("\n3. Domain-specific instructions are more effective")
        print("   - Code Alpaca focuses on programming tasks")
        print("   - Better alignment with downstream code review tasks")

# Run the analysis
analyzer = InstructionDatasetAnalyzer()
analyzer.visualize_dataset_comparison()

## 4. Implementing Instruction Tuning Pipeline

Let's implement a complete instruction tuning pipeline that matches the paper's approach.

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from typing import Dict, Any

class InstructionTuningPipeline:
    """Complete instruction tuning pipeline for code review tasks"""
    
    def __init__(self, model_name: str = "microsoft/phi-2"):
        """
        Initialize with a smaller model for demonstration.
        In practice, this would be LLaMA-7B.
        """
        self.model_name = model_name
        self.instruction_template = self._get_alpaca_template()
        
    def _get_alpaca_template(self) -> str:
        """Get the Alpaca prompt template used in the paper"""
        return """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Input:
{input}

### Response:
{output}"""
    
    def prepare_instruction_dataset(self, examples: List[InstructionExample]) -> List[str]:
        """Convert instruction examples to formatted prompts"""
        formatted_examples = []
        
        for example in examples:
            if example.input:
                prompt = self.instruction_template.format(
                    instruction=example.instruction,
                    input=example.input,
                    output=example.output
                )
            else:
                # No input template
                prompt = f"""Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{example.instruction}

### Response:
{example.output}"""
            
            formatted_examples.append(prompt)
        
        return formatted_examples
    
    def create_stage1_dataset(self, code_only: bool = True) -> List[str]:
        """Create Stage 1 instruction tuning dataset"""
        generator = CodeInstructionGenerator()
        
        if code_only:
            # Code Alpaca style examples
            examples = generator.generate_instruction_examples(100)
            print("Created Code-only instruction dataset (PL)")
        else:
            # Mixed Code + General examples
            code_examples = generator.generate_instruction_examples(50)
            
            # Add some general Alpaca-style examples
            general_examples = [
                InstructionExample(
                    instruction="Write a short story about a robot",
                    output="Once upon a time, there was a helpful robot named Ada who loved to solve problems...",
                    source="alpaca"
                ),
                InstructionExample(
                    instruction="Explain the concept of machine learning",
                    output="Machine learning is a subset of artificial intelligence that enables computers to learn...",
                    source="alpaca"
                )
            ]
            
            examples = code_examples + general_examples
            print("Created Mixed instruction dataset (PL + NL)")
        
        return self.prepare_instruction_dataset(examples)
    
    def create_stage2_dataset(self, task_type: str) -> List[str]:
        """Create Stage 2 task-specific dataset"""
        generator = CodeInstructionGenerator()
        
        if task_type == "rnp":
            # Review Necessity Prediction examples
            examples = [
                InstructionExample(
                    instruction="Determine whether the provided diff hunk requires a code review. Respond with either 'yes' or 'no'.",
                    input="+ console.log('Debug info:', data);",
                    output="yes"
                ),
                InstructionExample(
                    instruction="Determine whether the provided diff hunk requires a code review. Respond with either 'yes' or 'no'.",
                    input="+ // Updated documentation",
                    output="no"
                )
            ] * 25  # Duplicate for larger dataset
            
        elif task_type == "rcg":
            # Review Comment Generation examples
            examples = [
                InstructionExample(
                    instruction="Review the given code and provide a constructive code review comment.",
                    input="function process(user) {\n    user.save();\n}",
                    output="Consider adding null check for 'user' parameter before calling save() method."
                )
            ] * 50
            
        elif task_type == "cr":
            # Code Refinement examples
            examples = [
                InstructionExample(
                    instruction="Refine the given code based on the provided code review comment.",
                    input="Comment: Add null check\nCode: function save(data) { data.persist(); }",
                    output="function save(data) {\n    if (data) {\n        data.persist();\n    }\n}"
                )
            ] * 50
        
        print(f"Created {task_type.upper()} task-specific dataset with {len(examples)} examples")
        return self.prepare_instruction_dataset(examples)
    
    def simulate_training(self, stage1_data: List[str], stage2_data: List[str], task_name: str) -> Dict[str, Any]:
        """Simulate the two-stage training process"""
        
        print(f"\n=== Simulated Training for {task_name.upper()} ===")
        print(f"Stage 1: Instruction tuning with {len(stage1_data)} examples")
        print(f"Stage 2: Task-specific fine-tuning with {len(stage2_data)} examples")
        
        # Simulate training metrics
        base_performance = 0.6
        stage1_improvement = 0.05 if len(stage1_data) > 50 else 0.02
        stage2_improvement = 0.15
        
        metrics = {
            'base_model': base_performance,
            'after_stage1': base_performance + stage1_improvement,
            'after_stage2': base_performance + stage1_improvement + stage2_improvement,
            'total_improvement': stage1_improvement + stage2_improvement
        }
        
        return metrics
    
    def run_experiment(self, compare_instruction_types: bool = True):
        """Run the complete instruction tuning experiment"""
        results = {}
        
        if compare_instruction_types:
            # Compare Code-only vs Mixed instructions
            for instruction_type, code_only in [("Code-only (PL)", True), ("Mixed (PL+NL)", False)]:
                print(f"\n{'='*60}")
                print(f"Experiment: {instruction_type}")
                print(f"{'='*60}")
                
                stage1_data = self.create_stage1_dataset(code_only=code_only)
                
                # Test on all three tasks
                for task in ["rnp", "rcg", "cr"]:
                    stage2_data = self.create_stage2_dataset(task)
                    metrics = self.simulate_training(stage1_data, stage2_data, task)
                    
                    key = f"{instruction_type}_{task}"
                    results[key] = metrics
                    
                    print(f"\n{task.upper()} Results:")
                    print(f"  Base model: {metrics['base_model']:.3f}")
                    print(f"  After instruction tuning: {metrics['after_stage1']:.3f} (+{metrics['after_stage1'] - metrics['base_model']:.3f})")
                    print(f"  After task fine-tuning: {metrics['after_stage2']:.3f} (+{metrics['total_improvement']:.3f})")
        
        return results

# Run the instruction tuning experiment
pipeline = InstructionTuningPipeline()
experiment_results = pipeline.run_experiment()

## 5. Analyzing Instruction Tuning Effects

Let's analyze why instruction tuning helps some tasks more than others.

In [None]:
def analyze_instruction_tuning_effects():
    """Analyze why instruction tuning has different effects on different tasks"""
    
    # Data from the paper (Table VIII)
    paper_results = {
        'RNP': {
            'no_instruction': 70.20,
            'with_instruction': 69.34,
            'improvement': -0.86
        },
        'RCG': {
            'no_instruction': 5.58,
            'with_instruction': 5.64,
            'improvement': 0.06
        },
        'CR': {
            'no_instruction': 81.87,
            'with_instruction': 81.59,
            'improvement': -0.28
        }
    }
    
    # Visualize the effects
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6))
    
    # 1. Performance comparison
    tasks = list(paper_results.keys())
    no_inst = [paper_results[task]['no_instruction'] for task in tasks]
    with_inst = [paper_results[task]['with_instruction'] for task in tasks]
    
    x = np.arange(len(tasks))
    width = 0.35
    
    bars1 = ax1.bar(x - width/2, no_inst, width, label='No Instruction Tuning', color='lightcoral')
    bars2 = ax1.bar(x + width/2, with_inst, width, label='With Instruction Tuning', color='lightblue')
    
    ax1.set_xlabel('Tasks')
    ax1.set_ylabel('Performance Score')
    ax1.set_title('Instruction Tuning Effect by Task')
    ax1.set_xticks(x)
    ax1.set_xticklabels(tasks)
    ax1.legend()
    ax1.grid(axis='y', alpha=0.3)
    
    # Add improvement annotations
    for i, task in enumerate(tasks):
        improvement = paper_results[task]['improvement']
        color = 'green' if improvement > 0 else 'red'
        ax1.annotate(f'{improvement:+.2f}', 
                    xy=(i, max(no_inst[i], with_inst[i]) + 1),
                    ha='center', va='bottom', color=color, fontweight='bold')
    
    # 2. Task complexity analysis
    ax2.axis('off')
    
    analysis_text = """Why Instruction Tuning Effects Vary:

1. Review Necessity Prediction (RNP):
   • Binary classification task
   • Simple input-output mapping
   • Instruction tuning may add complexity
   • Minimal benefit: -0.86 F1

2. Review Comment Generation (RCG):
   • Complex NL generation task
   • Benefits from instruction following
   • Requires understanding of review intent
   • Small improvement: +0.06 BLEU

3. Code Refinement (CR):
   • Code transformation task
   • Pattern-based improvements
   • May not need explicit instructions
   • Minimal impact: -0.28 BLEU

Key Insight:
Instruction tuning helps most with tasks
requiring contextual understanding and
multi-step reasoning (like RCG)."""
    
    ax2.text(0.05, 0.95, analysis_text, transform=ax2.transAxes,
             fontsize=11, verticalalignment='top', fontfamily='monospace')
    
    plt.tight_layout()
    plt.show()
    
    # Print detailed analysis
    print("Detailed Analysis:\n")
    
    print("1. Task Complexity vs Instruction Benefit:")
    task_complexity = {
        'RNP': 'Low (binary classification)',
        'RCG': 'High (natural language generation)',  
        'CR': 'Medium (code transformation)'
    }
    
    for task in tasks:
        improvement = paper_results[task]['improvement']
        complexity = task_complexity[task]
        print(f"   {task}: {complexity} → {improvement:+.2f} improvement")
    
    print("\n2. Why Code-only Instructions Work Better:")
    print("   • Domain alignment: Code tasks need code-related instructions")
    print("   • Reduced noise: General NL instructions may confuse code models")
    print("   • Efficiency: Smaller, focused dataset trains faster")
    
    print("\n3. Practical Implications:")
    print("   • Use instruction tuning selectively for complex tasks")
    print("   • Focus on domain-specific instruction datasets")
    print("   • Consider task characteristics when designing instructions")

analyze_instruction_tuning_effects()

## 6. Best Practices for Code-Domain Instruction Tuning

Based on the paper's findings and our analysis, here are key best practices:

In [None]:
class InstructionTuningBestPractices:
    """Best practices for instruction tuning in code domains"""
    
    def __init__(self):
        self.practices = {
            'dataset_design': [
                "Use domain-specific instructions (Code Alpaca > Mixed)",
                "Maintain consistent prompt template format",
                "Balance instruction diversity with domain focus",
                "Include negative examples for binary tasks"
            ],
            'training_strategy': [
                "Two-stage approach: instruction tuning then task-specific",
                "Lower learning rates for instruction tuning (5e-5)",
                "Longer training for complex generation tasks",
                "Monitor for instruction-task alignment"
            ],
            'task_selection': [
                "Prioritize complex NL generation tasks (RCG)",
                "Consider skipping for simple classification (RNP)",
                "Evaluate cost-benefit for each task type",
                "Use confidence scores to decide instruction necessity"
            ],
            'evaluation': [
                "Compare with and without instruction tuning",
                "Measure instruction following capability separately",
                "Test on held-out instruction types",
                "Monitor for catastrophic forgetting"
            ]
        }
    
    def create_optimal_instruction_dataset(self, target_task: str, size: int = 1000) -> List[InstructionExample]:
        """Create optimized instruction dataset based on best practices"""
        
        generator = CodeInstructionGenerator()
        examples = []
        
        if target_task == "rcg":  # Review Comment Generation benefits most
            # Focus on review-related instructions
            instruction_types = {
                'code_review': 0.4,  # 40% review instructions
                'code_explanation': 0.3,  # 30% explanation
                'bug_detection': 0.2,  # 20% debugging
                'documentation': 0.1   # 10% documentation
            }
            
            for inst_type, proportion in instruction_types.items():
                count = int(size * proportion)
                for _ in range(count):
                    template = random.choice(generator.instruction_templates[inst_type])
                    code_ex = random.choice(generator.code_examples)
                    
                    if inst_type == 'code_review':
                        output = f"Issues: {', '.join(code_ex['issues'][:2])}"
                    elif inst_type == 'code_explanation':
                        output = f"This code {code_ex['explanation']}"
                    elif inst_type == 'bug_detection':
                        output = f"Potential bugs: {code_ex['issues'][0]}"
                    else:
                        output = f"Documentation needed for {code_ex['explanation']}"
                    
                    examples.append(InstructionExample(
                        instruction=template,
                        input=code_ex['code'],
                        output=output,
                        source=f"optimized_{inst_type}"
                    ))
        
        elif target_task == "rnp":  # Simple classification - minimal instruction tuning
            # Focus on decision-making instructions
            decision_instructions = [
                "Classify this code change",
                "Determine if review is needed",
                "Analyze this diff"
            ]
            
            for _ in range(size // 4):  # Much smaller dataset
                instruction = random.choice(decision_instructions)
                code_ex = random.choice(generator.code_examples)
                
                # Simple binary output
                examples.append(InstructionExample(
                    instruction=instruction,
                    input=code_ex['code'],
                    output=random.choice(["needs review", "approved"]),
                    source="optimized_rnp"
                ))
        
        print(f"Created optimized instruction dataset for {target_task.upper()}: {len(examples)} examples")
        return examples
    
    def display_best_practices(self):
        """Display comprehensive best practices guide"""
        
        print("🎯 INSTRUCTION TUNING BEST PRACTICES FOR CODE REVIEW")
        print("=" * 60)
        
        for category, practices in self.practices.items():
            print(f"\n📋 {category.replace('_', ' ').title()}:")
            for i, practice in enumerate(practices, 1):
                print(f"   {i}. {practice}")
        
        print("\n\n💡 KEY RECOMMENDATIONS FROM PAPER:")
        recommendations = [
            "Use Code Alpaca instead of mixed Alpaca + Code Alpaca",
            "Apply instruction tuning primarily for Review Comment Generation",
            "Consider skipping instruction tuning for simple classification tasks",
            "Maintain consistent prompt templates across training and inference",
            "Monitor for domain shift when using general instruction datasets"
        ]
        
        for i, rec in enumerate(recommendations, 1):
            print(f"   {i}. {rec}")
        
        print("\n\n🔬 EXPERIMENTAL GUIDELINES:")
        guidelines = [
            "Always compare with baseline (no instruction tuning)",
            "Test both Code-only and Mixed instruction datasets", 
            "Measure instruction following capability independently",
            "Use held-out instruction types for evaluation",
            "Monitor computational overhead vs. performance gains"
        ]
        
        for i, guideline in enumerate(guidelines, 1):
            print(f"   {i}. {guideline}")

# Demonstrate best practices
best_practices = InstructionTuningBestPractices()
best_practices.display_best_practices()

# Create optimized datasets
print("\n\n📊 CREATING OPTIMIZED INSTRUCTION DATASETS:")
print("=" * 50)

rcg_dataset = best_practices.create_optimal_instruction_dataset("rcg", 200)
rnp_dataset = best_practices.create_optimal_instruction_dataset("rnp", 100)

print(f"\nOptimal dataset sizes based on paper findings:")
print(f"  RCG (complex): {len(rcg_dataset)} examples")
print(f"  RNP (simple): {len(rnp_dataset)} examples")
print(f"  Ratio: {len(rcg_dataset)/len(rnp_dataset):.1f}:1")

## 7. Future Directions and Advanced Techniques

Based on the paper's findings, here are promising directions for improving instruction tuning in code domains:

In [None]:
def visualize_future_directions():
    """Visualize future research directions for instruction tuning"""
    
    fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(16, 12))
    
    # 1. Adaptive Instruction Tuning
    ax1.set_xlim(0, 10)
    ax1.set_ylim(0, 8)
    
    # Current approach
    current_rect = plt.Rectangle((1, 6), 8, 1, facecolor='lightblue', edgecolor='black')
    ax1.add_patch(current_rect)
    ax1.text(5, 6.5, 'Fixed Instruction Dataset', ha='center', va='center', fontweight='bold')
    
    # Adaptive approach
    adaptive_rect = plt.Rectangle((1, 4), 8, 1, facecolor='lightgreen', edgecolor='black')
    ax1.add_patch(adaptive_rect)
    ax1.text(5, 4.5, 'Adaptive Instructions Based on Task Difficulty', ha='center', va='center', fontweight='bold')
    
    # Performance feedback
    feedback_rect = plt.Rectangle((1, 2), 8, 1, facecolor='lightcoral', edgecolor='black')
    ax1.add_patch(feedback_rect)
    ax1.text(5, 2.5, 'Performance-Driven Instruction Selection', ha='center', va='center', fontweight='bold')
    
    ax1.arrow(5, 5.8, 0, -0.5, head_width=0.2, head_length=0.1, fc='blue', ec='blue')
    ax1.arrow(5, 3.8, 0, -0.5, head_width=0.2, head_length=0.1, fc='green', ec='green')
    
    ax1.set_title('1. Adaptive Instruction Tuning', fontweight='bold')
    ax1.axis('off')
    
    # 2. Multi-Modal Instructions
    modalities = ['Code', 'Comments', 'Documentation', 'AST', 'Dependencies']
    values = [100, 60, 40, 30, 20]
    colors = ['blue', 'green', 'orange', 'red', 'purple']
    
    bars = ax2.barh(modalities, values, color=colors, alpha=0.7)
    ax2.set_xlabel('Information Content (%)')
    ax2.set_title('2. Multi-Modal Code Understanding', fontweight='bold')
    ax2.grid(axis='x', alpha=0.3)
    
    # 3. Instruction Complexity Analysis
    complexity_levels = ['Simple\n(Binary)', 'Medium\n(Classification)', 'Complex\n(Generation)', 'Very Complex\n(Reasoning)']
    instruction_benefit = [10, 30, 70, 90]
    
    bars = ax3.bar(complexity_levels, instruction_benefit, color=['red', 'orange', 'yellow', 'green'])
    ax3.set_ylabel('Instruction Tuning Benefit (%)')
    ax3.set_title('3. Task Complexity vs Instruction Benefit', fontweight='bold')
    ax3.grid(axis='y', alpha=0.3)
    
    # Add value labels
    for bar, value in zip(bars, instruction_benefit):
        height = bar.get_height()
        ax3.text(bar.get_x() + bar.get_width()/2., height + 2,
                 f'{value}%', ha='center', va='bottom')
    
    # 4. Advanced Techniques Timeline
    ax4.axis('off')
    
    timeline_text = """🚀 FUTURE RESEARCH DIRECTIONS:

📊 1. Adaptive Instruction Selection:
   • Dynamic instruction difficulty adjustment
   • Task-specific instruction generation
   • Performance-feedback loops

🔄 2. Multi-Stage Instruction Curricula:
   • Progressive complexity increase
   • Domain-specific instruction sequences
   • Cross-task instruction transfer

🎯 3. Personalized Instructions:
   • Developer-specific instruction styles
   • Codebase-aware instructions
   • Project context integration

🧠 4. Meta-Learning for Instructions:
   • Learn to generate instructions
   • Few-shot instruction adaptation
   • Instruction effectiveness prediction

💡 5. Integration with Code Analysis:
   • AST-aware instruction tuning
   • Semantic code understanding
   • Multi-modal code representation"""
    
    ax4.text(0.05, 0.95, timeline_text, transform=ax4.transAxes,
             fontsize=11, verticalalignment='top', fontfamily='monospace')
    
    plt.tight_layout()
    plt.show()

# Advanced instruction tuning techniques
class AdvancedInstructionTechniques:
    """Advanced techniques for code-domain instruction tuning"""
    
    def __init__(self):
        self.techniques = {
            'adaptive_selection': {
                'description': 'Dynamically select instructions based on model performance',
                'implementation': 'Monitor task-specific metrics and adjust instruction complexity',
                'benefits': ['Optimal instruction-task alignment', 'Reduced training time'],
                'challenges': ['Complex implementation', 'Requires performance monitoring']
            },
            'hierarchical_instructions': {
                'description': 'Multi-level instruction hierarchy from simple to complex',
                'implementation': 'Progressive instruction curriculum with increasing complexity',
                'benefits': ['Better learning stability', 'Improved generalization'],
                'challenges': ['Curriculum design complexity', 'Training time increase']
            },
            'code_aware_instructions': {
                'description': 'Instructions that incorporate code structure and semantics',
                'implementation': 'AST-based instruction generation with semantic understanding',
                'benefits': ['Better code understanding', 'Domain-specific improvements'],
                'challenges': ['Requires code parsing', 'Language-specific implementation']
            }
        }
    
    def demonstrate_adaptive_instruction(self) -> str:
        """Demonstrate adaptive instruction selection"""
        
        example = """
📊 ADAPTIVE INSTRUCTION EXAMPLE:

Task: Review Comment Generation
Current Performance: 5.2 BLEU-4
Target Performance: 6.0 BLEU-4

Step 1: Analyze Current Weaknesses
  - Poor at detecting null pointer issues
  - Weak security vulnerability identification
  
Step 2: Generate Targeted Instructions
  - "Identify potential null pointer dereferences in this code"
  - "Find security vulnerabilities in the following function"
  
Step 3: Measure Improvement
  - Re-evaluate on held-out test set
  - Adjust instruction mix based on results
  
Expected Outcome: +0.3 BLEU-4 improvement
"""
        return example

# Run future directions analysis
visualize_future_directions()

# Demonstrate advanced techniques
advanced_tech = AdvancedInstructionTechniques()
print(advanced_tech.demonstrate_adaptive_instruction())

print("\n🎯 IMPLEMENTATION PRIORITIES:")
priorities = [
    "1. Task-specific instruction dataset creation (immediate)",
    "2. Instruction effectiveness measurement (short-term)", 
    "3. Adaptive instruction selection (medium-term)",
    "4. Multi-modal code instruction tuning (long-term)",
    "5. Meta-learning for instruction generation (research)"
]

for priority in priorities:
    print(f"   {priority}")

## 8. Key Takeaways and Summary

### From the Paper's Research:

1. **Instruction Tuning Strategy**:
   - Two-stage approach: instruction tuning → task-specific fine-tuning
   - Code-only instructions (Code Alpaca) outperform mixed datasets
   - Benefits vary significantly by task complexity

2. **Task-Specific Effects**:
   - **RCG (Review Comment Generation)**: Most beneficial (+0.06 BLEU-4)
   - **RNP (Review Necessity Prediction)**: Minimal/negative effect (-0.86 F1)
   - **CR (Code Refinement)**: Slight negative effect (-0.28 BLEU-4)

3. **Why Code-Only Works Better**:
   - Domain alignment with downstream tasks
   - Reduced confusion from diverse general instructions
   - Focused training on relevant instruction types

### Practical Implementation Guidelines:

1. **When to Use Instruction Tuning**:
   - Complex generation tasks (especially NL generation)
   - Tasks requiring contextual understanding
   - Multi-step reasoning problems

2. **When to Skip Instruction Tuning**:
   - Simple classification tasks
   - Pattern-based transformations
   - Tasks with clear input-output mappings

3. **Best Practices**:
   - Use domain-specific instruction datasets
   - Maintain consistent prompt templates
   - Monitor instruction-task alignment
   - Consider computational overhead vs. benefits

### Future Research Opportunities:

1. **Adaptive Instruction Selection**: Dynamic instruction tuning based on performance
2. **Multi-Modal Instructions**: Incorporating code structure and documentation
3. **Meta-Learning**: Learning to generate effective instructions
4. **Personalization**: Developer and codebase-specific instructions

The LLaMA-Reviewer paper provides crucial insights into when and how to apply instruction tuning for code review tasks, showing that a nuanced, task-aware approach yields the best results.